Friday, June 19, 2009

Why Virtualize? A Primer for HPC Guys

Some folks who primarily know High-Performance Computing (HPC) think virtualization is a rather silly thing to do. Why add another layer of gorp above a perfectly good CPU? Typical question, which I happened to get in email:

I'm perplexed by the stuff going on the app layer -

first came the chip + programs

then came - chip+OS+ applications

then came - Chip+Hypervisor+OS+applications

So for a single unit of compute, the capability keeps decreasing while extra layers are added over again and again.. How does this help?

I mean Virtualization came for consolidation and this reducing the prices of the H/W it's being associated with something else?

In answering that question, I have two comments:

First Comment:

Though this really doesn't matter to the root question, you're missing a large bunch of layers. After your second should be


Where middleware expands to many things: Messaging, databases, transaction managers, Java Virtual Machines, .NET framework, etc. How you order the many layers within middleware, well, that can be argued forever; but in any given instance they obviously do have an order (often with bypasses).

So there are many more layers than you were considering.

How does this help in general? The usual way infrastructure software helps: It lets many people avoid writing everything they need from scratch. But that's not what you're really asking, which was why you would need virtualization in the first place. See below.

Second Comment:

What hypervisors -- really, virtual machines; hypervisors are one implementation of that notion – do is more than consolidation. Consolidation is, to be sure, the killer app of virtualization; it's what put virtualization on the map.

But hypervisors, in particular, do something else: They turn a whole system configuration into a bag of bits, a software abstraction decoupled from the hardware on which they are running. A whole system, ready to run, becomes a file. You can store it, copy it, send it somewhere else, publish it, and so on.

For example, you can:

  • Stop it for a while (like hibernate – a snapshot (no, not disk contents)).
  • Restart on the same machine, for example after hardware maintenance.
  • Restart on a different machine (e.g., VMware's VMotion; others have it under different names)
  • Copy it – deploy additional instances. This is a core technology of cloud computing that enables "elasticity." (That, and apps structured so this can work.)
  • By adding an additional layer, run it on a system with a different architecture from the original.

Most of these things have their primary value in commercial computing. The classic HPC app is a batch job: Start it, run it, it's done. Commercial computing's focus nowadays tends to be: Start it, run it, run it, run it, keep it running even though nodes go down, keep it running through power outages, earthquakes, terrorist strikes, … Think web sites, or, before them, transaction systems to which bank ATMs connect. Not that commercial batch doesn't still exist, nor is it less important; payrolls are still met, although there's less physical check printing now. But continuously operating application systems have been a focus of commercial computing for quite a while now.

Of course, some HPC apps are more continuous, too. I'm thinking of analyzing continuous streams of data, e.g., environmental or astronomical data that is continuously collected. Things like SETI or Folding at home could use it, too, running beside interactive use in a separate machine, continually, unhampered by your kids following dubious links and getting virii/trojans. But those are in the minority, so far. Grossly enormous-scale HPC with 10s or 100s of thousands of nodes will soon have to think in these terms as the time between failures (MTBF) of nodes exceeds the run time of jobs, but there it's an annoyance, a problem, not an intentional positive aspect of the application like it is for commercial.

Grid application deployment could do it รก la clouds, but except for hibernate-style checkpoint/restart, I don't see that any time soon. They effectively have a kind of virtualization already, at a higher level (like Microsoft does its cloud virtualization in Azure above and assuming .NET).

The traditional performance cost of virtualization is anathema to HPC, too. But that's trending to near zero. With appropriate hardware support (Intel VT, AMD-V, similar things from others) that's gone away for processor & memory. It's still there for IO, but can be fixed; IBM zSeries native IO has had essentially no overhead for years. The rest of the world will have to wait for PCIe to finish finalizing its virtualization story and IO vendors to implement it in their devices; that will come about in a patchy manner, I'd predict, with high-end communication devices (like InfiniBand adapters) leading the way.

So that's what virtualization gets you: isolation (for consolidation) and abstraction from hardware into a manipulable software object (for lots else). Also, security, which I didn't get into here.

Not much for most HPC today, as I read it, but a lot of value for commercial.


Anonymous said...

Greg, I see it taking some time before native I/O virtualization works robustly in networking/storage/IPC adapters, via PCI-SRIOV. The main challenge is that once you let a virtual machine talk directly to the adapter, the latter must include fault isolation mechanisms, to ensure that a crashing VM does not take down the whole adapter (and therefore, system). This means that the adapter must scrupulously check each doorbell write, work descriptor content and scatter/gather list issued to it by a VM for 'correctness'. The definition of 'correctness' is a little slippery here and no-one has attempted to define it. At minimum, it means "The adapter does not lock up, no matter what the VM fires at it." But should an adapter, allow a rogue VM to put garbage on the wire (no lockups, but potentially serious trouble downstream). Another budding standard, PCI-ATS will have to be running robustly in the host's north bridge to screen out flaky host memory addresses given to the adapter by misbehaving VMs. Frankly, I see it being years before this stuff is ready for prime time.

Greg Pfister said...

I absolutely agree with your point that for IO virtualization to work within adapters, the adapters must be able to isolate VM use completely. You have a good list in there of what that means: No matter what any VM does, the adapter should (a) not lock up; (b) not affect any other partition’s operation – a superset of (a); (c) not violate the host’s memory partitioning; (d) not put utter garbage on the wire, like bad checksums.

These are stiff requirements. But they’re the same requirements CPU/memory virtualization meets now (maybe except for garbage on the wire). There’s no reason IO adapters can’t meet them, too; the technology is there.

There is a hangup, of course: adapter vendors need to absorb a new skill set, and, more importantly, a new mind set. This will involve kicking and screaming. But the additional hardware cost, as in CPUs, really won’t be big. Someone just needs to breaks the dam, providing the support needed for fast IO, and get to charge a premium.

Will that take a long time? I'm not sure. The issue is psychology, not technology.

admiyo said...

Clear and concise, well written. As someone who has worked in both HPC and virtualization, I have had to make these points before, but not as coherently as you've laid it out here.

It is worth pointing out that the single system image discussion we had before is a form of virtualization as well, just the reverese of what we are talking about here. Basically Virtualization will allow you to break the 1-1 relationships between machine and OS instance.

Greg Pfister said...


About multi-system SSI, yes, “virtualization” could be and has been used to describe it, but I think such usage confuses the issue because multi-system SSI breaks the normal SMP programming model (as discussed in prior posts). We’ll have to agree to disagree.

Rob Peglar said...

I started in HPC in 1978 and this current debate/viewpoint is strangely reminiscent of the 'real vs. virtual memory' brouhaha. The 'real' advocates liked their base & bounds with fixed sized partitions, and no silly virtualization technique, no matter how robust, could sway them. OTOH, the virtual memory advocates saw it as a way to make multiprogramming easier and actually give batch jobs a larger memory 'space' than they had - read - could afford. The real guys did rollout/rollin and (in some cases) overlays, the virtual guys did paging.

So it is today. Those with infinite real resources to burn will burn them and chide those who don't have such resources. Those without infinite resources will virtualize them and find a way.

There is also perceived geek cred involved. The real guys believe they earn geek creds by doing all sorts of gyrations to control their real resources and their workloads. Likewise, the virtual guys believe they earn geek creds by eschewing these techniques and instead focus on slick hiding/lying methods. After all, virtualization is the fine art of lying to the layer above you :-)

But as far as accelerators go, there will always be a conflict between the standard and the proprietary. Remember, the accelerator folks are not standing still either, and even though the charts above showing 5-years-and-out behavior, this assumes the accelerator does not change during that time.

Cat versus mouse - an ancient and honorable battle :-)

Digvijay "VJ" Singh Rathore said...

Greg, I appreciate your reply to my question in the email and it does make sense but again as Rob Peglar says, it's still seems, "Cat and Mouse". It's convenience V/S perfection on a compute and it's here for the taking.

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.