This is actually a seminar which I presented some months ago. I will post it here as a multi-part series.
Part 1 of the series addresses the following issues - what is virtualization? And why is it so difficult?
1. INTRODUCTION
1.1 Definition
Virtualization is a framework or methodology of dividing the resources of a computer into multiple execution environments, by applying one or more concepts or technologies such as hardware and software partitioning, time-sharing, partial or complete machine simulation, emulation, quality of service, and many others. Each virtual machine runs an independent operating system instance. In other words, virtualization creates multiple “execution instancesâ€. While virtualization generally involves creating virtual machines having an identical architecture to the underlying physical machine, in some cases, the underlying architecture may be completely different from the virtual architecture. In such a case, the virtual machine monitor (the software which provides the illusion of virtual machines) is acting as an emulator.
1.2 Requirements for a virtualizable architecture:
The (most basic) Popek and Goldberg requirement [1] for virtualization of a computing platform (processor architecture) is that all access to privileged registers from user mode cause a trap. Any such architecture is natively virtualizable.
1.3 Role of a virtual machine monitor:
A virtual machine monitor is a software system that partitions a single physical machine into multiple virtual machines. Virtualization of a computer environment involves virtualizing three components viz. processor, memory and I/O. This is similar to the way multitasking Operating systems run – each process functions as if it were the only process executing on the machine (exclude IPC for the purpose of this discussion). To provide this illusion, the OS running on behalf of the process in kernel mode must provide certain facilities. The task scheduler allocates time to each process, and activates and deactivates it accordingly. The processes memory space is isolated and protected by the memory management unit of the OS. I/O is always routed through the kernel via system calls. Hence the illusion of isolation is provided by the fact that the OS operates in kernel mode and can place restrictions on the actions of the user mode processes. The OS, therefore, traditionally has full control over the hardware of a computer.
However in the case of running multiple operating systems in concurrent virtual machines (or even a single guest OS in a VM hosted within another OS), this assumption can no longer be justified. The Virtual Machine Monitor must run at a privilege level above the guest operating systems. This is necessary to protect and isolate the virtual machines from each other. For example, while traditionally each OS has full access to the entire range of memory, in a VM such access must be limited to a well defined subset. To ensure this the VMM must either 1) take over all memory allocation for each guest OS which would be a slow emulation process causing many incompatibilities or 2) Permit guest operating systems freedom to allocate memory by themselves but be able to validate all changes to the segmentation/paging tables of the guest Operating systems so as to ensure that they do not access a non permitted memory range. Similarly, guest operating systems believe they have exclusive control of an I/O device. The VMM must be able to provide them this illusion when it is actually not true. All I/O accesses must also somehow be monitored by the VMM.
The virtual machine monitor can either run as a process of a host operating system (a hosted implementation) in which case it runs partly in user mode, and partly in user mode (the VMs). It can also function as a standalone unit, not hosted on any operating system in which case it can no longer make use of the facilities provided to it by the host (I/O, drivers, power management etc) and must take care of all such issues itself. This seminar concentrates on hosted solutions as they focus more on virtualization issues rather than hardware issues, and also are more challenging to develop as they must generally modify or hook into the host operating system in some way.
1.4 Challenges in virtualization:
Successful partitioning of a machine to support the concurrent execution of multiple operating systems poses several challenges.
2. DIFFICULTIES IN VIRTUALIZATION ON THE IA-32/X86 ARCHITECTURE
IA-32/x86 microprocessors provide protection based on the concept of a 2-bit privilege level, 0 being most privileged and 3 least privileged [2]. For an OS to control the CPU, some modules must run at privilege level 0. As explained in 1.3 above, a guest OS cannot be granted such control and cannot execute at privilege level 0. Thus, IA-based VMMs must use ring deprivileging, a technique that runs all guest software at privilege level greater than 0. There are 2 possible schemes – guest operating systems can run either at privilege level 1 (the 0/1/3 model) or at privilege level 3 (the 0/3/3 model). Although the 0/1/3 model supports simpler VMMs, it cannot be used on IA-32 professors for guests in 64 bit mode, as segmentation is deprecated on them [3]. Ring de-privileging causes numerous virtualization challenges [4].
2.1 Ring aliasing.
Ring aliasing refers to problems that arise when software is run at privilege level other than the level for which it was written. An example in IA-32 is the PUSH instruction -which pushes its operand on the stack-when executed with the CS register (part of which is the current privilege level). In this case against OS could easily determine that it is not running at privilege level zero. Other examples relate to I/O, IOPL, and VERR etc – instructions in which the CPL is factored into the checks performed.
2.2 Address-space compression.
Operating systems are designed to have access to the full range of the linear address space. A VMM must reserve for itself some portion of the guest’s virtual address space. In the minimal case, the VMM requires a small portion of the guest’s linear address space (usually part of the higher address range) to be able to transition from the guest to the VMM. In the maximal case it can use a large chunk, so as to be able to access the guest’s data easily. The VMM must protect guest access to those portions of the guest virtual address space. Any guest attempt to access this region must trap to the VMM, which must then emulate the memory access.
2.3 Nonfaulting access to privileged state.
Privileged-based protection generally prevents unprivileged software from accessing privileged CPU state. In most cases, attempted accesses result in exceptions (this is the major Popek and Goldberg requirement). However there are some IA-32 instructions that access privileged state but do not fault when executed in non-kernel mode. For example, the instructions LGDT, LIDT, LLDT and LTR execute only at privilege level zero. However, software can execute the equivalent read instructions SGDT, SIDT, SLDT and STR at any privilege level. Using these instructions, a guest OS might be able to determine that it is not running at full privilege level.
2.4 Adverse impacts on guest transitions.
Ring de-privileging interferes with the SYSENTER and SYSEXIT instructions that support low latency system calls. SYSENTER always effects a transition to privilege level 0, and SYSEXIT will fault if executed outside that privilege level. Hence the VMM must emulate every guest execution of these two instructions.
2.5 Interrupt Virtualization.
IA-32 uses the interrupt flag (IF) in the EFLAGS register to control interrupt masking. VMMs restrict the ability of guests to enable/disable interrupts. A guest will fault when it attempts to do so. However since this is done quite frequently, it can result in a large performance drop. There are also virtual interrupts that must be delivered by the VMM only when the guest has unmasked interrupts.
2.6 Ring Compression.
Ring deprivileging uses privilege-based mechanisms to protect the VMM from guest software. Segment limits do not apply to the 64-bit mode of x86-64 processors (AMD64/EM64T extensions). Hence paging must be used in this mode to limit access to address ranges. Since IA-32 paging does not distinguish between levels 0-2 (there are only two levels – user/supervisor), the guest OS must run at privilege level 3. Thus the guest OS will run at the same privilege level as guest applications. This is called ring compression. Figure 1 describes this phenomenon.
2.7 Access to hidden CPU state.
Some components of CPU state are not represented in any software-accessible register. The descriptor caches for the segment registers are among this. There is no mechanism to store and restore these hidden components.
While further discussion will be restricted to IA-32/x86/x86-64 architecture, the techniques described are applicable in some form or the other to other architectures as well.
REFERENCES
[1] G.J. Popek and R.P. Goldberg, “Formal Requirements for Virtualizable Third-Generation Architecturesâ€, Comm. ACM, July 1974, pp. 412- 421.
[2] Intel Corp., “Intel Architecture Software Developer’s Manual Volume 1: Basic Architectureâ€.
[3] AMD Corp., “"AMD64 Architecture Programmer’s Manual Volume 2: System Programming".
[4] Rich Uhlig et al., “Intel Virtualization Technologâ€, Computer, May
2005, pp. 48-56.
Part 1 of the series addresses the following issues - what is virtualization? And why is it so difficult?
1. INTRODUCTION
1.1 Definition
Virtualization is a framework or methodology of dividing the resources of a computer into multiple execution environments, by applying one or more concepts or technologies such as hardware and software partitioning, time-sharing, partial or complete machine simulation, emulation, quality of service, and many others. Each virtual machine runs an independent operating system instance. In other words, virtualization creates multiple “execution instancesâ€. While virtualization generally involves creating virtual machines having an identical architecture to the underlying physical machine, in some cases, the underlying architecture may be completely different from the virtual architecture. In such a case, the virtual machine monitor (the software which provides the illusion of virtual machines) is acting as an emulator.
1.2 Requirements for a virtualizable architecture:
The (most basic) Popek and Goldberg requirement [1] for virtualization of a computing platform (processor architecture) is that all access to privileged registers from user mode cause a trap. Any such architecture is natively virtualizable.
1.3 Role of a virtual machine monitor:
A virtual machine monitor is a software system that partitions a single physical machine into multiple virtual machines. Virtualization of a computer environment involves virtualizing three components viz. processor, memory and I/O. This is similar to the way multitasking Operating systems run – each process functions as if it were the only process executing on the machine (exclude IPC for the purpose of this discussion). To provide this illusion, the OS running on behalf of the process in kernel mode must provide certain facilities. The task scheduler allocates time to each process, and activates and deactivates it accordingly. The processes memory space is isolated and protected by the memory management unit of the OS. I/O is always routed through the kernel via system calls. Hence the illusion of isolation is provided by the fact that the OS operates in kernel mode and can place restrictions on the actions of the user mode processes. The OS, therefore, traditionally has full control over the hardware of a computer.
However in the case of running multiple operating systems in concurrent virtual machines (or even a single guest OS in a VM hosted within another OS), this assumption can no longer be justified. The Virtual Machine Monitor must run at a privilege level above the guest operating systems. This is necessary to protect and isolate the virtual machines from each other. For example, while traditionally each OS has full access to the entire range of memory, in a VM such access must be limited to a well defined subset. To ensure this the VMM must either 1) take over all memory allocation for each guest OS which would be a slow emulation process causing many incompatibilities or 2) Permit guest operating systems freedom to allocate memory by themselves but be able to validate all changes to the segmentation/paging tables of the guest Operating systems so as to ensure that they do not access a non permitted memory range. Similarly, guest operating systems believe they have exclusive control of an I/O device. The VMM must be able to provide them this illusion when it is actually not true. All I/O accesses must also somehow be monitored by the VMM.
The virtual machine monitor can either run as a process of a host operating system (a hosted implementation) in which case it runs partly in user mode, and partly in user mode (the VMs). It can also function as a standalone unit, not hosted on any operating system in which case it can no longer make use of the facilities provided to it by the host (I/O, drivers, power management etc) and must take care of all such issues itself. This seminar concentrates on hosted solutions as they focus more on virtualization issues rather than hardware issues, and also are more challenging to develop as they must generally modify or hook into the host operating system in some way.
1.4 Challenges in virtualization:
Successful partitioning of a machine to support the concurrent execution of multiple operating systems poses several challenges.
- Virtual machines must be isolated from one another: it is not acceptable for the execution of one to adversely affect the performance of another.
- It is necessary to support a variety of different operating systems to accommodate the heterogeneity of popular applications
- The performance overhead introduced by virtualization should be small.
2. DIFFICULTIES IN VIRTUALIZATION ON THE IA-32/X86 ARCHITECTURE
IA-32/x86 microprocessors provide protection based on the concept of a 2-bit privilege level, 0 being most privileged and 3 least privileged [2]. For an OS to control the CPU, some modules must run at privilege level 0. As explained in 1.3 above, a guest OS cannot be granted such control and cannot execute at privilege level 0. Thus, IA-based VMMs must use ring deprivileging, a technique that runs all guest software at privilege level greater than 0. There are 2 possible schemes – guest operating systems can run either at privilege level 1 (the 0/1/3 model) or at privilege level 3 (the 0/3/3 model). Although the 0/1/3 model supports simpler VMMs, it cannot be used on IA-32 professors for guests in 64 bit mode, as segmentation is deprecated on them [3]. Ring de-privileging causes numerous virtualization challenges [4].
2.1 Ring aliasing.
Ring aliasing refers to problems that arise when software is run at privilege level other than the level for which it was written. An example in IA-32 is the PUSH instruction -which pushes its operand on the stack-when executed with the CS register (part of which is the current privilege level). In this case against OS could easily determine that it is not running at privilege level zero. Other examples relate to I/O, IOPL, and VERR etc – instructions in which the CPL is factored into the checks performed.
2.2 Address-space compression.
Operating systems are designed to have access to the full range of the linear address space. A VMM must reserve for itself some portion of the guest’s virtual address space. In the minimal case, the VMM requires a small portion of the guest’s linear address space (usually part of the higher address range) to be able to transition from the guest to the VMM. In the maximal case it can use a large chunk, so as to be able to access the guest’s data easily. The VMM must protect guest access to those portions of the guest virtual address space. Any guest attempt to access this region must trap to the VMM, which must then emulate the memory access.
2.3 Nonfaulting access to privileged state.
Privileged-based protection generally prevents unprivileged software from accessing privileged CPU state. In most cases, attempted accesses result in exceptions (this is the major Popek and Goldberg requirement). However there are some IA-32 instructions that access privileged state but do not fault when executed in non-kernel mode. For example, the instructions LGDT, LIDT, LLDT and LTR execute only at privilege level zero. However, software can execute the equivalent read instructions SGDT, SIDT, SLDT and STR at any privilege level. Using these instructions, a guest OS might be able to determine that it is not running at full privilege level.
2.4 Adverse impacts on guest transitions.
Ring de-privileging interferes with the SYSENTER and SYSEXIT instructions that support low latency system calls. SYSENTER always effects a transition to privilege level 0, and SYSEXIT will fault if executed outside that privilege level. Hence the VMM must emulate every guest execution of these two instructions.
2.5 Interrupt Virtualization.
IA-32 uses the interrupt flag (IF) in the EFLAGS register to control interrupt masking. VMMs restrict the ability of guests to enable/disable interrupts. A guest will fault when it attempts to do so. However since this is done quite frequently, it can result in a large performance drop. There are also virtual interrupts that must be delivered by the VMM only when the guest has unmasked interrupts.
2.6 Ring Compression.
Ring deprivileging uses privilege-based mechanisms to protect the VMM from guest software. Segment limits do not apply to the 64-bit mode of x86-64 processors (AMD64/EM64T extensions). Hence paging must be used in this mode to limit access to address ranges. Since IA-32 paging does not distinguish between levels 0-2 (there are only two levels – user/supervisor), the guest OS must run at privilege level 3. Thus the guest OS will run at the same privilege level as guest applications. This is called ring compression. Figure 1 describes this phenomenon.
2.7 Access to hidden CPU state.
Some components of CPU state are not represented in any software-accessible register. The descriptor caches for the segment registers are among this. There is no mechanism to store and restore these hidden components.
While further discussion will be restricted to IA-32/x86/x86-64 architecture, the techniques described are applicable in some form or the other to other architectures as well.
REFERENCES
[1] G.J. Popek and R.P. Goldberg, “Formal Requirements for Virtualizable Third-Generation Architecturesâ€, Comm. ACM, July 1974, pp. 412- 421.
[2] Intel Corp., “Intel Architecture Software Developer’s Manual Volume 1: Basic Architectureâ€.
[3] AMD Corp., “"AMD64 Architecture Programmer’s Manual Volume 2: System Programming".
[4] Rich Uhlig et al., “Intel Virtualization Technologâ€, Computer, May
2005, pp. 48-56.