Thursday, May 27, 2010

How Hardware Virtualization Works (Part 2)

This is the second in a series of posts about how hardware virtualization works. See Part 1 to catch it from the start.

The Goal

The goal of hardware virtualization is to maintain, for all the code running in a virtual machine, the illusion that it is running on its own, private, stand-alone piece of hardware. What a provider is giving you is a lease on your own private computer, after all.

"All code" includes all applications, all middleware like databases or LAMP stacks, and crucially, your own operating system –including the ability to run different operating systems, like Windows and Linux, on the same hardware, simultaneously. Hence: Isolation of virtual machines from each other is key. Each should think it still "owns" all of its own hardware.

The result isn't always precisely perfect. With sufficient diligence, operating system code can figure out that it isn't running on bare metal. Usually, however, that is the case only when specific programming is done with the aim of finding that out.

Trap and Map

The basic technique used is often referred to as "trap and map." Imagine you are a thread of computation in a virtual machine, running on a one processor of a multiprocessor that is also running other virtual machines.

So off you go, pounding away, directly executing instructions on your own processor, running directly on bare hardware. There is no simulation or, at this point, software of any kind involved in what you are doing; you manipulate the real physical registers, use the real physical adders, floating-point units, cache, and so on. You are running asfastas thehardwarewillgo. Fastasyoucan. Poundingoncache, playingwithpointers, keepinghardwawrepipelinesfull, until…


You attempt to execute an instruction that would change the state of the physical machine in a way that would be visible to other virtual machines. (See the figure nearby.)

Just altering the value in your own register file doesn't do that, and neither does, for example, writing into your own section of memory. That's why you can do such things at full-bore hardware speed.

Suppose, however, you attempt to do something like set the real-time clock – the one master real time clock for the whole physical machine. Having that clock altered out from under other running virtual machines would not be very good at all for their health. You aren't allowed to do things like that.

So, BAM, you trap. You are wrenched out of user mode, or out of supervisor mode, up into a new higher privilege mode; call it hypervisor mode. There, the hypervisor looks at what you wanted to do – change the real-time clock -- and looks in a bag of bits it keeps that holds the description of your virtual machine. In particular, it grabs the value showing the offset between the hardware real time clock and your real time clock, alters that offset appropriately, returns the appropriate settings to you, and gives you back control. Then you start runningasfastasyoucan again. If you later read the real-time clock, the analogous sequence happens, adding that stored offset to the value in the hardware real-time clock.

Not every such operation is as simple as computing an offset, of course. For example, a client virtual machine's supervisor attempting to manipulate its virtual memory mapping is a rather more complicated case to deal with, a case that involves maintaining an additional layer of mapping (kept in the bag 'o bits): A map from the hardware real memory space to the "virtually real" memory space seen by the client virtual machine. All the mappings involved can be, and are, ultimately collapsed into a single mapping step; so execution directly uses the hardware that performs virtual memory mapping.

Concerning Efficiency

How often do you BAM? Unhelpfully, this is clearly application dependent. But the answer in practice, setting aside input/output for the moment, is not often at all. It's usually a small fraction of the total time spent in the supervisor, which itself is usually a small fraction of the total run time. As a coarse guide, think in terms of overhead that is well less than 5%, or in other words, for most purposes, negligible. Programs that are IO intensive can see substantially higher numbers, though, unless you have access to the very latest in hardware virtualization support; then it's negligible again. A little more about that later.

I originally asked you to imagine you were a thread running on one processor of a multiprocessor. What happens when this isn't the case? You could be running on a uniprocessor, or, as is commonly the case, there could be more virtual machines than physical processors or processor hardware theads. For such cases, hypervisors implement a time-slicing scheduler that switches among the virtual machine clients. It's usually not as complex as schedulers in modern operating systems, but it suffices. This might be pointed to as a source of overhead: You're only getting a fraction of the whole machine! But assuming we're talking about a commercial server, you were only using 12% or so of it anyway, so that's not a problem. A more serious problem arises when you have less real memory than all the machines need; virtualization does not reduce aggregate memory requirements. But with enough memory, many virtual machines can be hosted on a single physical system with negligible degradation.

The next post covers more of the techniques used to do this, getting around some hardware limitations (translate/trap/map) and efficiency issues (paravirtualization). (Link will be added when it is posted.)


Heshsham said...

Nice explanation!
By the way why is GPU virtualization far more difficult?

Greg Pfister said...

Hi, Heshsham. Thanks!

Assuming by GPU virtualization you mean cutting up a single GPU attached to a server, giving part to each of several VMs running on the server, then... no.

Since GPUs are IO devices, this is IO virtualization; see part 4 of this set of posts about that. It can be done and will be, eventually, but in general isn't there yet.

Part 4 is at .

Also, I don't think GPUs are likely to do the kind of trap-and-map time slicing discussed here. They're more likely to be partitioned, with separate parts dedicated to the VMs they serve; that's how most IO devices that do this do it now (mainly network adapters).

This was briefly touched on in part 1 of this series. It's expanded in a later post about the varieties of virtualization: .


Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.