Tuesday, June 1, 2010

How Hardware Virtualization Works (Part 3)

This is the third in a series of posts about how hardware virtualization works. Catch it from Part 1 to understand the context.

Translate, Trap and Map

The basic Trap and Map technique described previously depends crucially on a hardware feature: The hardware must be able to trap on every instruction that could affect other virtual machines. Prior to the introduction of Intel's and AMD's specific additional hardware virtualization support, that was not true. For example, setting the real time clock was, in fact, not a trappable instruction. It wasn't even restricted to supervisors. (Note, not all Intel processors have virtualization support today; this is apparently a done to segment the market.)

Yet VMware and others did provide, and continue to provide, hardware virtualization on such older systems. How? By using a load-time binary scan and patch. (See figure below.) Whenever a section of memory was marked executable – making that marking was, thankfully, trap-able – the hypervisor would immediately scan the executable binary for troublesome instructions and replace each one with a trap instruction. In addition, of course, it augmented the bag 'o bits for that virtual machine with information saying what each of those traps was supposed to do originally.

Now, many software companies are not fond of the idea of someone else modifying their shipped binaries, and can even get sticky about things like support if that is done. Also, my personal reaction is that this is a horrendous kluge. But is a necessary kluge, needed to get around hardware deficiencies, and it has proven to work well in thousands, if not millions, of installations.

Thankfully, it is not necessary on more recent hardware releases.


Whether or not the hardware traps all the right things, there is still unavoidable overhead in hardware virtualization. For example, think back to my prior comments about dealing with virtual memory. You can imagine the complex hoops a hypervisor must repeatedly jump through when the operating system in a client machine is setting up its memory map at application startup, or adjusting the working sets of applications by manipulating its map of virtual memory.

One way around overhead like that is to take a long, hard look at how prevalent you expect virtualization to be, and seriously ask: Is this operating system ever really going to run on bare metal? Or will it almost always run under a hypervisor?

Some operating system development streams decided the answer to that question is: No bare metal. A hypervisor will always be there. Examples: Linux with the Xen hypervisor, IBM AIX, and of course the IBM mainframe operating system z/OS (no mainframe has been shipped without virtualization since the mid-1980s).

If that's the case, things can be more efficient. If you know a hypervisor is always really behind memory mapping, for example, provide an actual call to the hypervisor to do things that have substantial overhead. For example: Don't do your own memory mapping, just ask the hypervisor for a new page of memory when you need it. Don't set the real-time clock yourself, tell the hypervisor directly to do it. (See figure below.)

This technique has become known as paravirtualization, and can lower the overhead of virtualization significantly. A set of "para-APIs" invoking the hypervisor directly has even been standardized, and is available in Xen, VMware, and other hypervisors.

The concept of paravirtualizatin actually dates back to around 1973 and the VM operating system developed in the IBM Cambridge Science Center. They had the not-unreasonable notion that the right way to build a time-sharing system was to give every user his or her own virtual machine, a notion somewhat like today's virtual desktop systems. The operating system run in each of those VMs used paravirtualization, but it wasn't called that back in the Computer Jurassic.

Virtualization is, in computer industry terms, a truly ancient art.

The next post covers , lowest-overhead technique used in virtualization, then input/output, and draws some conclusions. (Link will be added when it is posted.)


kme said...

Why do you say that setting the real time clock was not a trappable instruction?

Setting the real time clock involves executing OUT to IO ports 0x70 and 0x71. The IOPL field in EFLAGS allows the OUT instruction to be restricted to ring 0 code. Attempts by lower privileged code to execute it will cause an exception that traps into the hypervisor. This has been the case since the 80386.

Greg Pfister said...

Hi, kme.

There I was referring to the past, prior to Intel VT-x (Vanderpool) and AMD-V (Pacifica), when there was no "ring -1" to run the hypervisor in.

In those days, yes, you could (and almost always did) restrict such instructions to ring 0, but that didn't help -- the guest OS was already in ring 0, and could execute the instruction directly. So there was no way for the hypervisor (also in ring 0) to grab control when needed.

It's kind of amazing to me that VMware and Xen ran on such systems (Xen used paravirtualization, VMware translate-and-trap), but there was a commercial need, with people willing to buy it, so run they did.

Thanks for the comment! It made me go back and check out the history in detail, to be sure I was right, which is useful. Reference to the situation in those bygone days, just when VT and AMD-V were being introduced:


('ware wrapping!)

- Greg

kme said...

Hi Greg,

Thanks for the reply. My understanding is that in VMWare's case, even with binary translation the guest OS still runs in ring 3. Binary translation is only used for the instructions which behave differently (rather than faulting) in ring > 0 - for example, if the guest OS uses MOV to examine its CS selector. RTC wouldn't be one of these.

(I would think that if the guest OS is run in actual ring 0, then preventing a hypervisor escape is essentially impossible).

Sassa said...

I would have thought that binary patching helps reduce the overhead by replacing HW trapping with SW trapping. I don't know what VMWare does, but it seems very hard to permit any kind of code to run in ring-0 safely with just binary patching - think self-modifying code, like compressed executables or that spat out by JIT-compilers. There are many ways to set up code execution before the page contains executable code.

I would have thought the hypervisor will permit the guests to run in ring > 0 and that way trap instructions requiring ring-0 privileges (setting up protected mode permissions, memory maps, IO, interrupt handlers). Actually, you could do binary patching just when such instruction traps for the first time - then you don't need to worry about misunderstanding the code and patching the wrong thing.

The other problem with virtualizing IO is that the hypervisor needs to understand the device protocol, so that the guest isn't preempted in the middle of setting up the IO (and another guest getting a device in an inconsistent state).

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.