BHyVe suspend/resume feature
米司 伊織 (Iori YONEJI)
suspend/resume, a feature to save a running virtual machine state and to restore the state to a virtual machine help many hypervisor users. Through this project, this feature will be added to BHyVe.
With suspend/resume support on a hypervisor, virtual machine state can
be saved so virtual machine can continue
beyond hypervisor's shutdown/reboot, it would be much convenient for many users.
Also, suspend/resume is key part of virtual machine live migration
implementation / offline migration.
I'll implement suspend/resume for BHyVe to make BHyVe more convenient
The project goal is suspend/resume feature support.
That means virtual machine state can be save to a file, and restore from a file.
Below are detail plan of the project:
1. Command line interface supporting suspend/resume
The existing bhyvectl command is too complicated and looks more like a debug command than control command.
General users would like more simple management tools,
so I make new bhyve management tools.
* To perform suspend
# /usr/sbin/bhyvemgr suspend <vmname> <filename>
* To perform resume
# /usr/sbin/bhyvemgr resume <vmname> <filename>
The example is manual suspend / resume done by user, but it also possible to invoke by init.d script(*1).
If init.d script suspend VMs on shutdown/reboot of host machine, and resume VMs on startup of host machine, VMs can keep running beyond shutdown/reboot, automatically.
2. How does it work
Virtual machines must be controllable by easy command system with the feature above.
Guest memory is mmap()-ed to device file /dev/vmm/.*, so this command will save memory file into a binary file after stopping memory operations.
Here, The file that suspend/resume command uses is consisted with 2 parts:
- Serialized registers that is a msgpack dictionary file (large number of tiny data).
- Copy of memory file (Huge flat linear data).
I think they should get in together by tar, and the tar file should be compressed because memory file will be big if it is raw.
The implementation of this tool will be below.
1.(bhyvemgr) Get vmctx and find its process and send stop signal.
2.(libvmmapi&vmm.ko) Save registers that is in hardware(VMCS) after the bhyve process enters in suspending mode.
3.(bhyve) Save or flush Host side state. Transmitter queues are to be emptied and receiver queues and configuration queue are to be saved. The state saved at here will be sent to bhyvemgr.
4.(bhyvemgr) The results of 3rd and 4th operations are stored in a file (I think msgpack is suited for this usage. also, bhyve host version must be stored).
5.(bhyvemgr) Memory is saved in file. By this, guest side state of device (that means device drivers) will be saved.
6.(bhyvemgr) All state we got are written in a compressed file.
It seems that 2nd operation should be processed in one library call, and the call should return a structure representing virtual machine state.
This library call uses ioctl explained below.
1.(bhyvemgr) Launch a new virtual machine like bhyveload.
2.(bhyvemgr) Unpack and load memory file to /dev/vmm/*.
3. restore host side state this step is also similar to bhyveload.
4.(libvmmapi&vmm.ko) Restore registers from state file and make them to be set in VMCS.
5.(bhyvemgr) Go into non-root mode.
3rd and 4th operations use a library call using ioctl(_, VM_SET_VCPUSTATE, _).
3. Internal design of suspend/resume
=saving/restoring virtual CPU state(kernel side VM state)=
Intel VMCS(*2) are the fields containing many values like guest registers, host registers, LAPIC(*3) registers etc.
Most guest registers must be saved, and also LAPIC interruption information registers are to saved.
And BHyVe serves get/set methods for VMCS in its ioctl handler in vmm_dev.c, so what need to be done is mapping VMCS to abstract virtual Intel 64 CPU(s).
BHyVe has only Intel support now, but AMD support is coming, so I'm going to implement same features with AMD CPU.
This feature is consisted by 3 parts below.
- VMCS (part of guest registers, guest non-register state, part of LAPIC registers)
This work is including defining abstract CPU state structure, ioctl handler to get VMCS fields and VMCS mapping function to abstract CPU structure above, and abstract CPU structure mapping function to VMCS and setter function to VMCS fields.
- struct vmxctx (guest registers not on VMCS)
- vLAPIC registers (not on VMCS)
These two also should be packed into abstract CPU state structure above. Setter/getter functions are required for each.
Userland will be able to access abstract CPU state structure by such ioctl() call below:
ioctl(ctx->fd, VM_GET_VCPUSTATE, vmstate);
ioctl(ctx->fd, VM_SET_VCPUSTATE, vmstate);
where "ctx->fd" is a virtual machine descriptor and "vmstate" is abstract virtual machine structure including number of CPUs and virtual CPU structures pointer.
=saving/restoring virtual device state(userland side VM state)=
Userland program(/usr/sbin/bhyve) emulates virtual devices such as block device, NIC and console.
The states of these emulated device such as virtio-net or virtio-blk must be save/restore on suspend/resume function.
Here's a list of devices, and the solution of the device:
These things need to be save/restore: a virtqueue, vtblk_config and pci_vtblk_softc
Also, disk image file name need to be save/restore.
These things need to be save/restore: 3 virtqueues(tx, rx, ctl), pci_vtnet_softc
These things need to be save/restore: UART control registers, recieve buffer.
These things need to be save/restore: IOAPIC registers
Also, device configurations(list of slot,driver,configinfo) are need to be save/restore.
Note that, if passthrough devices are assigned to the VM, suspend should be FAIL.
Because BHyVe cannot take care real device state save/restoring.
(*1): BHyVe have to support background VM execution before implement such init.d script.
(*2). This application uses Intel CPU specific terms such as "Intel VT-x" and "VMCS", but this suspend/resume implementation will be support AMD-V as well. This is why we define abstract CPU state.
(*3) LAPIC seems to be a Intel technical term, but this is in-core interrupt controller, implemented on mordan x86 CPUs.