Guide Virtualization: Modern Approaches and Applications - Part I

This is actually a seminar which I presented some months ago. I will post it here as a multi-part series.


Part 1 of the series addresses the following issues - what is virtualization? And why is it so difficult?

1. INTRODUCTION

1.1 Definition

Virtualization is a framework or methodology of dividing the resources of a computer into multiple execution environments, by applying one or more concepts or technologies such as hardware and software partitioning, time-sharing, partial or complete machine simulation, emulation, quality of service, and many others. Each virtual machine runs an independent operating system instance. In other words, virtualization creates multiple “execution instancesâ€. While virtualization generally involves creating virtual machines having an identical architecture to the underlying physical machine, in some cases, the underlying architecture may be completely different from the virtual architecture. In such a case, the virtual machine monitor (the software which provides the illusion of virtual machines) is acting as an emulator.

1.2 Requirements for a virtualizable architecture:

The (most basic) Popek and Goldberg requirement [1] for virtualization of a computing platform (processor architecture) is that all access to privileged registers from user mode cause a trap. Any such architecture is natively virtualizable.

1.3 Role of a virtual machine monitor:

A virtual machine monitor is a software system that partitions a single physical machine into multiple virtual machines. Virtualization of a computer environment involves virtualizing three components viz. processor, memory and I/O. This is similar to the way multitasking Operating systems run – each process functions as if it were the only process executing on the machine (exclude IPC for the purpose of this discussion). To provide this illusion, the OS running on behalf of the process in kernel mode must provide certain facilities. The task scheduler allocates time to each process, and activates and deactivates it accordingly. The processes memory space is isolated and protected by the memory management unit of the OS. I/O is always routed through the kernel via system calls. Hence the illusion of isolation is provided by the fact that the OS operates in kernel mode and can place restrictions on the actions of the user mode processes. The OS, therefore, traditionally has full control over the hardware of a computer.

However in the case of running multiple operating systems in concurrent virtual machines (or even a single guest OS in a VM hosted within another OS), this assumption can no longer be justified. The Virtual Machine Monitor must run at a privilege level above the guest operating systems. This is necessary to protect and isolate the virtual machines from each other. For example, while traditionally each OS has full access to the entire range of memory, in a VM such access must be limited to a well defined subset. To ensure this the VMM must either 1) take over all memory allocation for each guest OS which would be a slow emulation process causing many incompatibilities or 2) Permit guest operating systems freedom to allocate memory by themselves but be able to validate all changes to the segmentation/paging tables of the guest Operating systems so as to ensure that they do not access a non permitted memory range. Similarly, guest operating systems believe they have exclusive control of an I/O device. The VMM must be able to provide them this illusion when it is actually not true. All I/O accesses must also somehow be monitored by the VMM.

The virtual machine monitor can either run as a process of a host operating system (a hosted implementation) in which case it runs partly in user mode, and partly in user mode (the VMs). It can also function as a standalone unit, not hosted on any operating system in which case it can no longer make use of the facilities provided to it by the host (I/O, drivers, power management etc) and must take care of all such issues itself. This seminar concentrates on hosted solutions as they focus more on virtualization issues rather than hardware issues, and also are more challenging to develop as they must generally modify or hook into the host operating system in some way.

1.4 Challenges in virtualization:

Successful partitioning of a machine to support the concurrent execution of multiple operating systems poses several challenges.
  • Virtual machines must be isolated from one another: it is not acceptable for the execution of one to adversely affect the performance of another.
  • It is necessary to support a variety of different operating systems to accommodate the heterogeneity of popular applications
  • The performance overhead introduced by virtualization should be small.
This seminar considers virtualization on the x86 platform (though many of the technologies considered can be used on other architectures as well). The remainder of this seminar is organized in the following manner. Section 2 discusses the numerous challenges to virtualization on the x86 platform. Section 3 describes binary translation techniques for virtualization. Sections 4 and 5 describe Paravirtualization on the Denali and Xen platforms respectively. Section 6 describes the new virtualization extensions to the x86 platform. To round off, Section 7 describes the applications and advantages of virtualization.


2. DIFFICULTIES IN VIRTUALIZATION ON THE IA-32/X86 ARCHITECTURE


IA-32/x86 microprocessors provide protection based on the concept of a 2-bit privilege level, 0 being most privileged and 3 least privileged [2]. For an OS to control the CPU, some modules must run at privilege level 0. As explained in 1.3 above, a guest OS cannot be granted such control and cannot execute at privilege level 0. Thus, IA-based VMMs must use ring deprivileging, a technique that runs all guest software at privilege level greater than 0. There are 2 possible schemes – guest operating systems can run either at privilege level 1 (the 0/1/3 model) or at privilege level 3 (the 0/3/3 model). Although the 0/1/3 model supports simpler VMMs, it cannot be used on IA-32 professors for guests in 64 bit mode, as segmentation is deprecated on them [3]. Ring de-privileging causes numerous virtualization challenges [4].


2.1 Ring aliasing.

Ring aliasing refers to problems that arise when software is run at privilege level other than the level for which it was written. An example in IA-32 is the PUSH instruction -which pushes its operand on the stack-when executed with the CS register (part of which is the current privilege level). In this case against OS could easily determine that it is not running at privilege level zero. Other examples relate to I/O, IOPL, and VERR etc – instructions in which the CPL is factored into the checks performed.

2.2 Address-space compression.

Operating systems are designed to have access to the full range of the linear address space. A VMM must reserve for itself some portion of the guest’s virtual address space. In the minimal case, the VMM requires a small portion of the guest’s linear address space (usually part of the higher address range) to be able to transition from the guest to the VMM. In the maximal case it can use a large chunk, so as to be able to access the guest’s data easily. The VMM must protect guest access to those portions of the guest virtual address space. Any guest attempt to access this region must trap to the VMM, which must then emulate the memory access.
2.3 Nonfaulting access to privileged state.

Privileged-based protection generally prevents unprivileged software from accessing privileged CPU state. In most cases, attempted accesses result in exceptions (this is the major Popek and Goldberg requirement). However there are some IA-32 instructions that access privileged state but do not fault when executed in non-kernel mode. For example, the instructions LGDT, LIDT, LLDT and LTR execute only at privilege level zero. However, software can execute the equivalent read instructions SGDT, SIDT, SLDT and STR at any privilege level. Using these instructions, a guest OS might be able to determine that it is not running at full privilege level.

2.4 Adverse impacts on guest transitions.

Ring de-privileging interferes with the SYSENTER and SYSEXIT instructions that support low latency system calls. SYSENTER always effects a transition to privilege level 0, and SYSEXIT will fault if executed outside that privilege level. Hence the VMM must emulate every guest execution of these two instructions.

2.5 Interrupt Virtualization.

IA-32 uses the interrupt flag (IF) in the EFLAGS register to control interrupt masking. VMMs restrict the ability of guests to enable/disable interrupts. A guest will fault when it attempts to do so. However since this is done quite frequently, it can result in a large performance drop. There are also virtual interrupts that must be delivered by the VMM only when the guest has unmasked interrupts.

2.6 Ring Compression.

Ring deprivileging uses privilege-based mechanisms to protect the VMM from guest software. Segment limits do not apply to the 64-bit mode of x86-64 processors (AMD64/EM64T extensions). Hence paging must be used in this mode to limit access to address ranges. Since IA-32 paging does not distinguish between levels 0-2 (there are only two levels – user/supervisor), the guest OS must run at privilege level 3. Thus the guest OS will run at the same privilege level as guest applications. This is called ring compression. Figure 1 describes this phenomenon.

2.7 Access to hidden CPU state.

Some components of CPU state are not represented in any software-accessible register. The descriptor caches for the segment registers are among this. There is no mechanism to store and restore these hidden components.

While further discussion will be restricted to IA-32/x86/x86-64 architecture, the techniques described are applicable in some form or the other to other architectures as well.
REFERENCES

[1] G.J. Popek and R.P. Goldberg, “Formal Requirements for Virtualizable Third-Generation Architecturesâ€, Comm. ACM, July 1974, pp. 412- 421.
[2] Intel Corp., “Intel Architecture Software Developer’s Manual Volume 1: Basic Architectureâ€.
[3] AMD Corp., “"AMD64 Architecture Programmer’s Manual Volume 2: System Programming".
[4] Rich Uhlig et al., “Intel Virtualization Technologâ€, Computer, May
2005, pp. 48-56.
 
OK, I tried.... don't know how to make it any prettier. Parts 3 and 4 should have more diagrams and will therefore look better. This thing looks like a lot of text, but then again, that is what scientific papers are all about, no?
 
Wonderful take on "Virtualization" would read in-depth later, also would do a bit of formatting :)

As always, great going, keep it up :cheers:

Edit : Leave the formatting part to me ;)
 
gr8 topic.........will study it carefully l8er...i was really bored too study this and segmentation stuff......king boss reps pending..
 
hey nice one kk.

now some questions. Lets say the VMM controler is working at priv0. Hows it actually controlling the memory is a multi OS environment. (native virtualization) eg does it limit the actual memory capacity available to the OS as a whole or it comes into effect when there is a possibility or there is, data being put in meory allocated to other OS?

Btw is the allocation of memory dynamic? it should be imo based on requirements in a general VMM.
imo if why not build a cluster based on specific requirements(server farm)?
 
Well there are many approaches to what you asked.
VMWare for example, actually pages out the "physical" memory allocated to the guest OSes (Consider 3 levels of memory - hardware memory which is conventionally called physical memory, "physical memory" allocated to the guest OS, and "virtual memory" within the guest). IIRC Xen does not page out the physical memory allocated to a guest, so each guest is limited to a subset of the actual physical memory.
But this concept of hardware memory (ie 3 levels rather than the traditional bi-level arrangement) is what allows various policies. In the 3rd part of this series I will explain Xen in detail - and then you will get the true idea behind Virtualization. Xen and Denali are amazing in their approach, and it is surprising that it was not thought of much before.
However, this much is universal - each guest OS "sees" a fixed amount of "physical memory", set at guest boot time.
 
hmm.. k look forward to the other articles then ;)

while i had heard about this, but took up a different field to study so had lost track. nice to read it in a concise manner again :)
 
I just came across this. While I haven't read it yet, I'm going to certainly give it a shot in the next couple of days. So I'll save my comments for then :)
 
Back
Top