Guide Virtualization: Modern Approaches and Applications - Part II

This is the second article in the series. This section draws upon work in "Running multiple operating systems concurrently on an IA32 PC using virtualization techniques, by Kevin Lawton", released under GNU Free Documentation License, Version 1.1. It is not listed in the references though, because there are no direct references.
Some information has also been obtained from VMWare's patent applications, but there is no confirmation that these techniques are actually used in the real world.

Other articles in this series are:
http://www.techenclave.com/forums/virtualization-modern-approaches-and-applications-part-75013.html


This article deals with one technique of virtualization - Dynamic Recompilation

3. DYNAMIC RECOMPILATION

A conventional emulator (e.g. Bochs) emulates the microprocessor, executing each guest CPU instruction by calling a software subroutine on the host machine that simulates the function of that CPU instruction. This level of abstraction allows the guest machine to run on host machines with a different type of microprocessor, and is relatively simple to develop and port, but is also very slow.

A more efficient technique involves dynamically recompiling a block of instructions the first time it is to be used. Subsequent executions use a cached version of the recompiled block and hence execute faster. This approach is used by Virtual PC on the Mac OSX and QEMU.

A faster approach is possible in case the host architecture is the same as the guest architecture. In x86 emulation, only certain instructions which cause non faulting access need to be emulated/handled. User mode and virtual 8086 code can be executed directly on the processor (consider 32 bit mode only) as the privilege level is not changed at all. Only kernel mode code and real mode code needs to be dynamically recompiled. The dynamically translated code is put in a region at the end of the (guest) address space from which can be protected and made invisible using segmentation mechanisms. There are thus 3 modules in the execution unit - a binary translation sub-system; a direct execution sub-system; and an execution decision module/sub-system that implements a decision function for discriminating between the directly executable and non-directly executable VM instructions and instructs the VMM to execute accordingly [5]. This method can only be used when the guest VM has the same architecture as the host. Moreover, while the VM runs in user mode, it requires the use of drivers to change the IDT and GDT (among others) in the case of a hosted solution. This approach is used by VMware, Virtual PC for Windows and kqemu.

It must be understood that this form of dynamic recompilation does not simply replace offending instructions or run kernel level code in user mode without modification. The entire VM will run at reduced privilege but the guest kernel mode code will have to be modified to allow this.

Several schemes can be built up, each more advanced (and complete) than the previous one. It must be noted here that in these schemes all guest execution is performed at reduced privileges (PL 3). The schemes refer to the binary translation sub system.

Scheme 1) Execution starts at a well defined address (usually in the ROM) and follows a sequence with various branches in between. The VMM can fetch and scan instructions until a branch, and place a breakpoint there. Executing the code will result in a breakpoint exception, trapping to the VMM. The VMM can then repeat the previous procedure before transferring control to the appropriate location. Hence each section of code is scanned before execution. During scanning problematic instructions can either be replaced with emulating code or precede them with a breakpoint, allowing the VMM to handle them.
Insertion of these software breakpoints can cause problems though, in the case of self modifying code or self reading code. The above scheme also involves a lot of trapping and scanning of instructions and hence has low efficiency. However, it must be remembered that all this need be done only for kernel mode code.

Scheme 2) It is possible to improve efficiency by keeping track of sections of code that have already been scanned. A breakpoint exception can occur due to two reasons (as in scheme 1), either due to a problematic instruction (possibly – must be emulated) or due to a branch. In the case of a branch, if the target address points to a section of code that has already been analyzed (scanned) then no further work is required, and the control transfer can be effected immediately. Only a new section needs to be scanned. Moreover, if the branch address is (some non computed) constant, then if the target code has already been analyzed, there is no need for the breakpoint, and it can be removed. This results in less trapping and more native execution. Branch targets which are dynamically computed still need to be monitored by breakpoints though.

Scheme 3) The above case does not consider the situation in which the code is written to (self modifying code for example). In this case, code which was earlier scanned may need to be reworked. To allow this, we use page protection mechanisms. After a code page is analyzed and dynamically recompiled (possible insertion of breakpoints/ emulation code), it is write protected. Since multiple (linear) pages can map to the same physical page (frame), all such pages must be write protected. Any attempt to write to this page will now cause a page fault. The VMM can then adopt several different strategies. The simplest case involves allowing the write, and then removing that code page from the list of analyzed code sections. The next access to that section will require analyzing once again. A more efficient approach involves storing metadata about the page beforehand. Using this metadata it is possible to make an intelligent decision on whether it is necessary to rescan the entire code, or only minimal changes are required. Depending upon the instruction being replaced, and the instructions being written, the VMM can either leave the code marked as analyzed (native instruction replaced by native instruction, or replacement in unanalyzed code) or remove the entire code section from the scanned list, or immediately scan the new code section.

Scheme 4) The above scheme brings paging into the picture, and hence scanning should be considered at a page granularity level. Consider local page branches. When they are first encountered they can be marked as requiring virtualization (breakpoint inserted). At a later stage, after the target address is scanned, the branches can be marked to be executed natively (breakpoint removed). The original scheme is therefore a form of lazy processing.

Instead of lazy processing, it is also possible to go in for prescanning. In this case, whenever a page local branch occurs (to a non computed address), one (unconditional) or both (conditional) branch targets can be scanned ahead of time. This can be done recursively. After the recursion, the branch instruction can then be executed natively from thereon. Code at the lower levels of recursion can execute branches natively or not as required. Recursion is terminated whenever an out of page jump occurs, an already scanned instruction is encountered, or an instruction that straddles page boundaries is encountered. A maximum depth of recursion is established to avoid excessive recursion. The advantage of this scheme is that if the page is used heavily before replacement, then most of its execution is native and we avoid the repetitive trapping inherent in the previous schemes.

Scheme 5) The above schemes do not handle self reading code. Due to all the breakpoints inserted the code has been modified considerably. This can cause problems in the case of branches which will no longer branch correctly, and self reading code may also fail. In any case it is not practical to insert breakpoints in place, so even in the previous cases a copy of the page is required for any page that is modified. All modifications are made to the copy, and execution is performed from that modified copy. Branches will have to be adjusted accordingly. In particular, the instruction pointer will have to be emulated and any branches to a code block that has been modified will have to be rerouted to the modified copy. All reads will take place from the actual code. All this is required since page protection does not distinguish between reading and executing code.

Scheme 6) Paging does not offer a method to allow execution from a page while blocking read/write. Such a facility would allow easy detection and handling of self modifying/reading code. This is possible, however, by making use of the separate instruction and data TLBs available on most new processors (Pentium upwards).

First the Instruction and data TLB caches are invalidated using the INVLPG instruction. This cleans out the TLB entry for the code page. The page table entry is then made ring 3 accessible (user bit set). A private mapping to the physical page is created, i.e. another virtual page which maps to the same physical page is created. This private mapping is used to write a RET instruction to the code page. This RET is called using the normal mapping. It immediately causes a return to the next instruction after the call, and at the same time has also loaded the instruction TLB with an entry corresponding the normal mapping, which is user mode accessible. The private mapping is used to replace the RET with the original contents. Since we have used the private mapping for inserting/removing the RET, there is no entry in the data TLB corresponding to the normal mapping, only the private mapping is cached (in the data TLB). The page table entry now has its user/supervisor bit cleared (ring 3 not accessible). Now control is transferred to the level 3 code.

Code can continue to be fetched and executed from the page, as the TLB entry is marked as ring 3 accessible, and has not been updated yet (NOTE: it is the responsibility of the system programmer to maintain coherency, and the TLB will not be updated automatically when the PTE is updated). However, when there is an attempt to read/write from the page, there is no corresponding entry in the data TLB. Hence the access will refer to the PTE which has the U/S bit cleared. As a result, the access will cause a page fault. This will enable the VMM to detect self modifying/reading code and take appropriate action.

At any moment, the instruction TLB entry for the page may be removed by the processor (as per the TLB caching techniques involved). In this case, if there is an attempt to execute code from the page, page fault will occur which will trap to the VMM. The VMM can then repeat the above procedure.
The above schemes are only a general approach to dynamic recompiling. Implementation details may vary widely from the aforementioned schemes depending on the virtualization solution being used. For example, VMware provides direct execution for ring 2, 3 code and invokes the binary translation unit for PL 0, 1 code [5].

3.1 Other issues involved in dynamic recompilation techniques
Guest descriptor tables and non-reversible segmentation – the VMM saves the state of the host and VM descriptor tables, and switches from one to the other on transitions. An interrupt descriptor table must also be created to handle exceptions generated which cause transition from the guest to the VMM and also the native guest interrupts. Also we have to handle descriptor loading. The guest system code (which normally runs in rings 0..2) is run in ring 3 and hence attempts to load descriptors having DPL 0..2 which would normally succeed, will fail (general protection fault). To handle this, instructions that load segment registers or examine their contents must be virtualized.

Guest descriptor tables cannot be used directly in any case. This could allow the VM to take over the machine. So, the VMM maintains a shadow copy of the VM descriptor tables, and these tables are stored and restored during every switch rather than the actual VM (guest) tables. In effect, the shadow tables are the real tables. There is also a segment tracing module that monitors the VM descriptor tables and maintains coherency between the VM and shadow descriptor tables.

The x86 architecture uses non-reversible segmentation [1] (that is, the state of the segment loaded in the processor - cached descriptors - cannot be reconstructed once the contents in memory have been modified). This leads to considerable difficulties in virtualization, as whenever the contents of the segment registers are changed (say when switching VMs or trapping to the VMM), the cached contents are lost, and when they are reloaded (switching back to the VM) the cached contents will not be restorable. This will lead to an inconsistency between execution natively and execution in a VM. To maintain coherency the VMM reserves one entry in the GDT for each segment register of the processor (per VM).

Some correspondence exists between the VM tables and the shadow tables. In (early versions of) VMware this correspondence [5] was as below:
For data and code the shadow descriptor is a copy of the virtual machine entry with the exception that the linear range of the shadow entry does not overlap with that of the VMM. The shadow entry range is truncated to ensure this and any errors due to truncation are handled by the GPF handler. Also, entries with DPL 0 are shifted to DPL 1 which is where the binary translation unit runs. In the case of gate descriptors the target code segment selector is zeroed out. This results in a GPF, allowing the VMM to handle the gated call (and thus control situations which would normally result in a privilege change).

To prevent the VM from loading a VMM or cached descriptor during direct execution (stored with the shadow table), the DPL is kept lower (higher privilege) than the lowest CPL for direct execution. Hence any such attempt will cause a trap, and a switch to binary translation using the cached entries. In binary translation, both the VM and the VMM will run at the same privilege level, so each load of the segment registers must be emulated to prevent loading of non-shadow descriptors.

Segment tracing, mentioned above, maintains coherency. LGDT-like instructions cause a GPF, and are emulated by scanning and converting all the entries of the VM’s LDT and GDT to the shadow table. Memory traces are installed on the VM’s LDT and GDT. If a write to a descriptor which is not currently loaded occurs, then the corresponding shadow entry is modified. If the write does correspond to the loaded descriptor, then the old entries (shadow entries) are copied to the cached entries and then they are updated. Binary translation is now used, because there are now two notions of the current segment selectors.

Again, other products (and indeed, other versions of VMware) may use quite different approaches.

3.2 Paging and memory allocation – Before starting a virtual machine, some amount of physical memory is granted to it. This memory is granted by the host to the VMM, which then allocates it to its guest VMs. A VM must perform paging within its own virtual disk. Thus the VMM must be able to virtualize the virtual memory handling process. This is done in a similar fashion to how segmentation is handled – a shadow copy of the page tables is maintained for each VM. Coherency is maintained by placing a trace on the page table entries. If more physical memory is required to be assigned than is available, then some of the “virtual physical memory†of the VM will be paged in and out of memory by the host OS just like it would deal with any other application.

A mention should be made here of QEMU which uses a portable version of dynamic recompilation. While dynamic recompilation is normally heavily CPU dependent, QEMU uses a portable approach that allows it to act as an emulator for several architectures, which can be hosted across many others. It can also be used in user mode to run Linux binaries compiled for one architecture run on some other architecture. In conjunction with WINE, it allows the use of Windows Binaries on PowerPC, Sparc, and ARM based systems running a corresponding build of Linux.

REFERENCES
[1] G.J. Popek and R.P. Goldberg, “Formal Requirements for Virtualizable Third-Generation Architecturesâ€, Comm. ACM, July 1974, pp. 412- 421.
[5] Devine et al., “Virtualization system including a virtual machine monitor for a computer with a segmented architectureâ€, USPTO, May 2002.
 
Back
Top