Bare KVM, Use the KVM device directly via RUST
7 min read

Bare KVM, Use the KVM device directly via RUST

Introduction

KVM is an abbreviation of "Kernel-Based Virtual Machine". It is a Linux Kernel Module developed in 2007 and still actively evolves with new features and bug fixes. The list of supported hardware is extended from the X86 processor to almost all modern CPUs, such as ARM.

However, in contrast to the chunks of documentation that introduce QEMU or libvirt or event the OpenStack, articles detailing the use of bare KVM devices are rare.

Firecracker Project

My friend recently recommended me the Firecracker project. It aims to provide a security container environment for Cloud Native. The underground of this project is KVM which isolates containers in a separated VM for better security requirements.

Firecracker is written in Rust Language. I have preferred Rust Language for a long time for its fast speed, memory-secured model, and elegance and readable grammar. For those who want a strong memory safety requirement, Firecracker is one of the best.

The goal of this article

This article gives procedures to create, initialize and boot a VM directly via manipulating of KVM device file by Rust code. After that, a few lines of machine code (assembled from assembly language) will be executed in the newly created VM.

The example is taken on an X86_64 machine with a Linux Kernel of 4.18. To run the example on ARM or other architectures, it needs some modification to the assembly.

How To Use KVM device

KVM in the Linux system is provided as a particular device file: /dev/kvm. Like most device files in Linux, users can interact with the KVM device file by playing with the ioctl() function.

The KVM device file has a long list of ioctl() commands, including VM creation, vCPU creation, memory setup, etc. The complete list of commands can be found in the Kernel’s manual:

https://github.com/torvalds/linux/blob/master/Documentation/virt/kvm/api.rst

Note:  To operate the KVM device file, the user must be in the kvm group, or a permission error will be seen.

Procedures to set up a VM

To set up a runnable VM, seven steps are needed:

  1. Instantiate KVM by opening the /dev/kvm file.
  2. Create a VM by writing the ioctl() command KVM_CREATE_VM.
  3. Initialize Guest Memory. This procedure tells the KVM what guest VM memory layout and size are.
  4. Create one or more vCPUs. The virtual CPUs for VM, one or more.
  5. Initialize general purpose and special registers. Set the initial state of CPU registers.
  6. Run code on the vCPU. Execute machine code and process exceptions and IO requests.

The kvm-ioctls crate

The rust-vmm group contains a repo named kvm-ioctls implemented in Rust as a wrapper of KVM's ioctl() function calls. It depends on the kvm-bindings crate generated directly from C headers by bindgen.

This crate provides the primitives to interact with the KVM device file and exports the consistent interface for the above Rust users.

Run the machine

I have pushed the code onto my Github:

https://github.com/d0u9/blog_samples/tree/master/kvm/bare_kvm

Instantiate KVM

Instantiation is just opening the KVM file.

let kvm = Kvm::new().unwrap();

The new() function does an open() on the device file /dev/kvm:

pub fn new() -> Result<Self> {
    let fd = Self::open_with_cloexec(true)?;
    Ok(unsafe { Self::from_raw_fd(fd) })
}

Create a VM

Create a new virtual machine by calling ioctl() with KVM_CREATE_VM command underlying.

let vm = kvm.create_vm().unwrap();

Initialize Guest Memory

First, allocate a chunk of continuous memory for a VM in userspace by mmap() function.

let mem_size = 0x4000;
let load_addr: *mut u8 = unsafe {
    libc::mmap(
        null_mut(),
        mem_size,
        libc::PROT_READ | libc::PROT_WRITE,
        libc::MAP_ANONYMOUS | libc::MAP_SHARED | libc::MAP_NORESERVE,
        -1,
        0,
    ) as *mut u8
};

Then, define the guest memory region. The guest memory region is described by a C structure in the Kernel, struct kvm_userspace_memory_region{}, dnd it is defined as below:

struct kvm_userspace_memory_region {
      __u32 slot;
      __u32 flags;
      __u64 guest_phys_addr;
      __u64 memory_size; /* bytes */
      __u64 userspace_addr; /* start of the userspace allocated memory */
};

The slot field identifies this region.

The guest_phys_addr field tells the start physical address in VM to which the memory region maps.

The memory_size specifies the length of the memory region.

The userspace_addr is the memory address allocated in userspace.

For kvm-ioctls crate, it has an identical binding named kvm_bindings::kvm_userspace_memory_region.

let slot = 0;
let guest_addr = 0x1000;
let mem_region = kvm_bindings::kvm_userspace_memory_region {
    slot,
    guest_phys_addr: guest_addr,
    memory_size: mem_size as u64,
    userspace_addr: load_addr as u64,
    flags: kvm_bindings::KVM_MEM_LOG_DIRTY_PAGES,
};

The kvm_bindings::KVM_MEM_LOG_DIRTY_PAGES flag controls if KVM keeps tracking writes to memory within the slot.

Finally, initialize VM’s memory region with this structure.

unsafe { vm.set_user_memory_region(mem_region).unwrap() };

Create one or more vCPUs

let vcpu_fd = vm.create_vcpu(0).unwrap();

Initialize general purpose and special registers

First, get and set general purpose registers of vCPU.

let mut vcpu_regs = vcpu_fd.get_regs().unwrap();
vcpu_regs.rip = guest_addr;
vcpu_regs.rax = 2;
vcpu_regs.rbx = 3;
vcpu_regs.rflags = 2;
vcpu_fd.set_regs(&vcpu_regs).unwrap();

vcpu_regs.rip = guest_addr sets the instruction pointer to the start address mapped from our userspace memory region.

vcpu_regs.rflags = 2 is a convention for x86 CPUs that the second bit of the rflags register is always set to 1. Volume 1 of "Intel® 64 and IA-32 Architectures Software Developer’s Manuals" details this.

Then, the special registers:

let mut vcpu_sregs = vcpu_fd.get_sregs().unwrap();
vcpu_sregs.cs.base = 0;
vcpu_sregs.cs.selector = 0;
vcpu_fd.set_sregs(&vcpu_sregs).unwrap();

Run code on the vCPU

The VM's CPU and memory are ready to go. The last step is loading machine code to memory and executing it.

First, prepare the machine code binary code:

let asm_code = &[
    0x31, 0xc0,             /* xor    %ax,    %ax */
    0x31, 0xdb,             /* xor    %bx,    %bx */
    // LOOP:
    0x01, 0xd8,             /* add    %bx,    %ax */
    0x83, 0xc3, 0x01,       /* add    $1,     %bx */
    0x83, 0xfb, 0x14,       /* cmp    $20,    %bx */
    0x7e, 0xf6,             /* jle    LOOP        */
    0xba, 0xff, 0x0e,       /* mov    $0xeff, %dx */
    0xee,                   /* out    %al,    %dx */
    0xf4,                   /* hlt                */
];

How to generate these binaries will be discussed later.

Then we get the buffer of the userspace memory that mapped to the VM’s physical memory, and write binary code into:

unsafe {
    let mut slice = slice::from_raw_parts_mut(load_addr, mem_size);
    let _ = slice.write(asm_code).unwrap();
}

Finally, execute:

loop {
    match vcpu_fd.run().expect("run failed") {
        VcpuExit::Hlt => {
            println!("VM halts");
            break;
        }
        VcpuExit::IoOut(addr, data) => {
            println!("Address: {:#x}, Data: {:?}", addr, data);
        }
        r => {
            println!("Unknown halts: {:?}", r);
            break;
        }
    }
}

The vcpu_fd.run() function is continuously running until encounters some "events". There is a structure named struct kvm_run {} describes the reason that causes the returning of run(). For the details behind the Rust wrapper: a pointer to the struct kvm_run{} can be obtained from offset 0 of the memory that is mapped from the vCPU's fd by mmap() function.

struct kvm_run {
	...
	__u32 exit_reason;
	...
	union {
		...
		struct {
#define KVM_EXIT_IO_IN  0
#define KVM_EXIT_IO_OUT 1
			__u8 direction;
			__u8 size; /* bytes */
			__u16 port;
			__u32 count;
			__u64 data_offset; /* relative to kvm_run start */
		} io;
		...
	}
	...
}	
include/uapi/linux/kvm.h

Each time the run() function returns with the return value of 0, the reason that makes this return is written by the KVM to the structure's __u32 exit_reason field.

The exit reasons that interest us here are: VcpuExit::Hlt and VcpuExit::IoOut.

The VcpuExit::Hlt is specific to the X86 processor family, which has the hlt instruction. This exit reason occurs when the CPU executes the hlt instruction.

The VcpuExit::IoOut reason is generic to all processors. It occurs when the CPU has executed a port I/O instruction that could not be satisfied by the KVM. When the VcpuExit::IoOut reason is got, the port number and the data that the CPU writes to the port are returned.

For our sample assembly code above, the port is 0x0eff, and the data is 210 (the sum of numbers from 0 to 20).

Inspect the running result

Run the example by cargo run.

cargo run

And the result is as expected:

Address: 0xeff, Data: [210]
VM halts

How to generate machine code

The x86 CPU runs in real mode after booting by default. It is a compatibility consideration by Intel for running ancient code on a modern processor.

Real mode is unique to the normal protected mode, which we are familiar with. The default CPU operand length is 16 bits in real mode, and only less than 1MB of memory is available.

Due to these constraints, the assembly code compiled by GCC is not suitable for real mode execution. The 0x66 prefix, Operand Size Override Prefix, has a different meaning in real mode. For 16 bits environment, 0x66 tells the CPU to treat the following opcode as 32bit instruction.

Write assembly

We write a simple assembly that calculates the result of adding integers from 0 to 20. i.e. 1+ 2+ 3 +… + 20 and writes the result to the IO port addressed with 0xeff. Why do we sum up to 20 instead of 100 or some numbers else? The answer is overflow. Only one byte will be written to the IO port in our example. The maximum unsigned value of a byte is 255, and 1 + 2 + … + 20 = 210, a value just fits in the range.

The assembly code is simple:

    .code16
    .section ".text"

start:
    xor     %ax,    %ax
    xor     %bx,    %bx

loop:
    add     %bx,    %ax
    add     $1,     %bx
    cmp     $20,    %bx
    jle     loop

    mov     $0xeff, %dx
    out     %al,    %dx
    hlt

It is worth noting the .code16 directive. This directive marks the assembly file will be assembled for 16-bit registers, i.e., the default register size in real mode x86 CPU.

Turn the assembly code into machine binary

Assemble this to ELF binary:

as -march=i386 -32 test.s -o test.o

as is the assembler provided by the GNU toolchain.

To obtain the machine binaries, use the objdump tool. The objdump decodes the binary file as 32-bit by default. By default, the assembly code printed by the objdump tool is not the same as the previous one we just wrote. To dump the correct assembly code as we wrote before, use the options below:

objdump -d -mi8086 test.o
# or
objdump -d -Maddr16,data16 test.o

The 8086 CPU is naively running in real mode. Using 8086 as the target architecture makes objdump generate the correct assembly as we wrote.

The sample print:

test.o:     file format elf32-i386


Disassembly of section .text:

00000000 <start>:
   0:   31 c0                   xor    %ax,%ax
   2:   31 db                   xor    %bx,%bx

00000004 <loop>:
   4:   01 d8                   add    %bx,%ax
   6:   83 c3 01                add    $0x1,%bx
   9:   83 fb 14                cmp    $0x14,%bx
   c:   7e f6                   jle    4 <loop>
   e:   ba 17 02                mov    $0x217,%dx
  11:   ee                      out    %al,(%dx)
  12:   f4                      hlt

Generate the Rust variable

For convenience, generate the Rust code directly by running the script below:

tmpfile=$(mktemp) && \
	objcopy -O binary -j .text test.o $tmpfile && \
	od -A n -t x1 $tmpfile \
	| sed -r -e 's/^\s+//g' -e 's/\ +/\n/g' \
	| awk 'BEGIN { \
               cnt=0; \
               line=""; \
               ident="    "; \
               print "let asm_code = &["\
           } \
           { \
               if (cnt<3) { \
                   line=line "0x" $1 ", "; \
                   cnt++; \
               } else { \
                   cnt=0; \
                   print ident line "0x" $1 ","; line=""; \
               } \
           } \
           END { \
               print ident line; print "];"
           }' \
     ; \
	rm $tmpfile

Output:

let asm_code = &[
    0x31, 0xc0, 0x31, 0xdb,
    0x01, 0xd8, 0x83, 0xc3,
    0x01, 0x83, 0xfb, 0x14,
    0x7e, 0xf6, 0xba, 0x17,
    0x02, 0xee, 0xf4,
];

Conclude

This is a very fundamental article that introduces the base ideas of running assembly code on an X86 virtual machine in the KVM. The rust wrapper hides many C details and provides a neat interface for running machine code. Almost, we finished at the point of pushing the power button of a real x86 box.

There are still many works to do to load a fully functional OS, such as Linux. Such as setting up page tables, shifting to protected mode, and processing CPU exceptions and interrupts.  These works are complex and not trivial, it involves aspects of knowledge needed to design and implement an entire OS and bootloaders. However, we have started the first step toward the long march.

References

https://david942j.blogspot.com/2018/10/note-learning-kvm-implement-your-own.html
https://github.com/firecracker-microvm/firecracker
https://wiki.osdev.org/Real_Mode
https://stackoverflow.com/a/19064127/3824053
https://0x00sec.org/t/realmode-assembly-writing-bootable-stuff-part-1/2901
https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt