Friday, January 23, 2009

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (5)

This is part 5 of a multi-post sequence on this topic which began here. This is the crux, the final part attempting to answer the question

Why Hasn't SSI Taken Over The World?

After last-minute productus interruptus pull-outs and half-hearted product introductions by so many major industry players – IBM, HP, Sun, Tandem, Intel, SCO – you have to ask what's wrong.

Semi-random events like business conditions, odd circumstances, and company politics always play a large part in whether any product sees the light of day, but over multiple attempts those should average out: It can't just be a massive string of bad luck. Can it? This stuff just seems so cool. It practically breaths breathless press-release hype. Why isn't it everywhere?

Well, I'm fully open to suggestions.

I will offer some possibilities here.

Marketing: A Tale of Focus Groups

In one of the last, biggest pushes for this within IBM, sometime around 2002, a real marketer got involved, got funding, and ran focus groups with several sets of customers in multiple cities to get a handle on who would buy it, and for how much. I was involved in the evaluation and listened to every dang cassette tape of every dang multi-hour session. I practically kicked them in frustration at some points.

The format: After a meet-and-greet with a round of donuts or cookies or whatever, a facilitator explained the concept to the group and asked them "Would you buy this? How much would you pay?" They then discussed it among themselves, and gave their opinions.

It turned out that the customers in the focus groups divided neatly into two completely separate groups:

Naives

These were the "What's a cluster?" people. They ran a small departmental server of their own, but had never even heard of the concept of a cluster. They had no clue what a single system image was or why anybody would conceivably want one, and the short description provided certainly didn't enlighten them. This was the part I was near to kicking about; I knew I could have done that better. Now, however, I realize that what they were told was probably fairly close to the snippets they would have heard in the normal course of marketing and trade press articles.

Result: They wouldn't buy it because they wouldn't understand what it was and why they should be interested.

Experts

These guys were sysadmins of their own clusters, and knew every last minute detail of the area. They'd built clusters (usually small ones), configured cluster databases, and kept them running come hell or high water with failover software. The Naives must have thought they were talking Swahili.

They got the point, they got it fast, and they got it deep. They immediately understood implications far beyond what the facilitator said. They were amazed that it was possible, and thought it was really neat, although some expressed doubt that anybody could actually make it work, as in "You mean this is actually possible? Geez."

After a short time, however, all of them, every last one, zeroed in on one specific flaw: Operating system updates are not HA, because you can't in many cases run a different fix level on different nodes simultaneously. Sometimes it can work, but not always. These folks did rolling OS upgrades all the time, updating one node's OS and seeing if it fell over, then moving to the next node, etc.; this is a standard way to avoid planned outages.

Result: They wouldn't buy it either, because, as they repeatedly said, they didn't want to go backwards in availability.

That was two out of two. Nobody would buy it. It's difficult to argue with that market estimate.

But what if those problems were fixed? The explanation can be fixed, for sure; as I said, I all but kicked the cassettes in frustration. Doing kernel updates without an outage is pretty hard, but in all but the worst cases it could also be done, with enough effort.

Even were that done, I'm not particularly hopeful, for the several other reasons discussed below.

Programming Model

As Seymour Cray said of virtual memory, "Memory is like an orgasm: It's better when you don't have to fake it." (Thank you, Eugene Miya and his comp.sys.super FAQ.)

That remains true when it's shared virtual memory bouncing between nodes, even despite a probable lack of disk accesses. Even were an application or framework written to scale up in shared-memory multiprocessor style over many multi-cores in many nodes – and significant levels of multiprocessor scaling will likely be achieved as the single-chip core count rises – that application is going to perform much better if it is rewritten to partition its data so each node can do nearly all accesses into local memory.

But hey, why bother with all that work? Lots of applications have already been set up to scale on separate nodes, so why not just run multiple instances of those applications, and tune them to run on separate nodes? It achieves the same purpose, and you just run the same code. Why not?

Because most of the time it won't work. They haven't been written to run multiple copies on the same OS. Apache is the poster child for this. Simple, silly things get in the way, like using the same names for temp files and other externally-visible entities. So you modify the file system, letting each have an instance-specific /temp and other files… But now you've got to find all those cases.

The massively dominant programming model of our times runs each application on its own copy of the operating system. That has been a major cause of server sprawl and the resultant killer app for virtualization. The issue isn't just silly duplicate file names, although that is still there. The issue is also performance isolation, fault isolation, security isolation, and even inter-departmental political isolation. "Modern" operating systems simply haven't implemented hardening of the isolation between their separate jobs, not because it's impossible – mainframes OSs did it and still do – but because nobody cares any more. Virtualization is instead used to create isolated copies of the OS.

But virtualizing to one OS instance per node on top of a single-system-image OS that unifies the nodes, that's – I'm struggling for strong enough words here. Crazy? Circular? Möbius-strip-like? It would negate the whole point of the SSI.

Decades of training, tool development, and practical experience are behind the one application / one OS / one node programming model. It has enormous cognitive momentum. Multinode single system image is simply swimming upstream against a very strong current on this one, even if in some cases it may seem simpler, at least initially, to implement parallelism using shared virtual memory across multiple nodes. This alone seems to guarantee isolation to a niche market.

Scaling and Single-OS Semantics

Scaling up an application is a massive logical AND. The hardware must scale AND the software must scale. For the hardware to scale, the processing must scale AND the memory bandwidth must scale AND the IO must scale. For the software to scale, the operating system, middleware, AND application must all scale. If anything at all in the whole stack does not scale, you are toast: You have a serial bottleneck.

There are many interesting and useful applications that scale to hundreds or thousands of parallel nodes. This is true both of batch-like big single operations – HPC simulations, MapReduce over Internet-scale data – and the continuous massive river of transactions running through web sites like Amazon.com and eBay. So there are many applications that scale, along with their middleware, on hardware – piles of separate computers – that scales more-or-less trivially. When there's a separate operating system on each node, that part scales trivially, too.

But what happens when the separate operating systems are replaced with a kernel-level SSI operating system? That LCC was used on the Intel Paragon seems to say that it can.

However, a major feature and benefit of the whole idea of SSI is that the single system image matches the semantics of a single-node OS exactly. Getting that exact match is not easy. At one talk I heard, Jerry Popek estimated that getting the last 5% of those semantics was over 80% of the work, but provided 95% of the benefit – because it guarantees that if any piece of code ran on one node, it will simply run on the distributed version. That guarantee is a powerful feature.

Unfortunately, single-node operating systems simply weren't designed with multi-node scaling in mind. The poster child for this one is Unix/Linux systems' file position pointer, which is associated with the file handle, not with the process manipulating the file. The intent, used in many programs, is that a parent process can open a file, read and process some, then pass the handle to a child, which continues reading and processing more of the file; when the child ends, the parent can pick up where the child left off. It's how command-line arguments traditionally get passed: The parent reads far enough to see what kind of child to start up, starts it, and implicitly passes the rest of the input file – the arguments – to the child, which slurps them up. On a single-node system, this is actually the simplest thing to do: You just use one index, associated with the handle, which everybody increments. For a parallel system, that one index is a built-in serial bottleneck. Spawn a few hundred child processes to work in parallel, and watch them contend for a common input command string. The file pointer isn't the only such case, but it's an obvious egregious example.

So how did they make it work on the Intel Paragon? By sacrificing strict semantic compatibility. 

For a one-off high-end scientific supercomputer (not clear it was meant to be one-off, but that's life), it's not a big deal. One would assume a porting effort or the writing of a lot of brand new code. For a general higher-volume offering, cutting such corners would mean trouble; too much code wants to be reused.

Conclusions

Making a single image of an operating system span multiple computers sounds like an absolutely wonderful idea, with many clear benefits. It seems to almost magically solve a whole range of problems, cleanly and clearly. Furthermore, it very clearly can be done; it's appeared in a whole string of projects and low-volume products. In fact, it's been around for decades, in a stream of attempts that are a real tribute to the perseverance of the people involved, and to the seductiveness of the concept.

But it has never really gotten any traction in the marketplace. Why?

Well, maybe it was just bad luck. It could happen. You can't rule it out.

On the other hand, maybe (a) you can't sell it; (b) it's at right angles to the vastly dominant application programming model; (c) it scales less well than many applications you would like to run on it.

Bummer.


__________________________


(Postscript: Are there lessons here for cloud computing? I suspect so, but haven't worked it out, and at this point my brain is tired. See you later, on some other topic.)

11 comments:

admiyo said...

I think the availability issue is a red herring. There are many ways to work around it, the most obvious being a dual Head node approach: Instead of 500 systems, you represent it as 2 systems of 250, and each can scale up or down accordingly. Migrate all compute nodes over to the non upgrading computer, upgrade the other one, then migrate nodes back. This is a simplification, but it still shows how you can deal with the problem. There are other approaches, too.

I worked at Penguin Computing, which does a limited Kernel SSI : No shared FS, just unified process space. This is targeted at the HPC space only and sells quite well. This is, however, the only place where SSI really seems to make sense.

What many Sysadmins in the Windows world do is run one network service per server, due to fear that the service will not only fail, but take down the server itself. This has lead to a proliferation of machines each at 2% utilization. Dealing with this mess is what keeps VMware in business.

Taht being said, good article. I think the nature of the OS is still evolving. I suspect that in the future we will see some aspect of Kernel SSI as an underlying layer, with Virtualization being used to isolate the scopes for smaller tasks, and certain high-performance tasks spanning multiple nodes. The SSI layer will be slow to change for the very reasons you note above, but the the individual virtualized OS instances will be a wide mix of OSes, some new, some very old.

Greg Pfister said...

@admiyo,

I don't understand your solution to the HA issue.

Updates that don't involve changing the inter-node protocols are not a problem. But if you do update those protocols --

-If each group of 250 is a separate system image, you can't migrate jobs between them.
- If all 500 are in the same image, you have the problem I described: Can't update fewer than all because the protocols changed.

There is a workaround of always making every update capable of running versions N & N+1 of the protocols simultaneously. But you still can't switch all to N+1 simultaneously -- Each node would have to switch to N+1 incrementally, maintaining communication only with the N+1-ified nodes it contacts and staying at N with the others.

That seems pretty messy to me. Not impossible, just messy, particularly when N+1 introduces significant new function.

Oh, and yes, I completely agree that OS instability is one of the big reasons to isolate functions to a single OS image.

admiyo said...

Greg,

Not sure I understand the question. As I said before, the solution at Penguin was targeted at the Supercomputing world, and had different parameters than your standard Linux system. The compute nodes are stateless. Any shared state is done via NFS. To reparent a compute node to a different master involves either a reboot, or something less drastic, but that refreshes the OS image. Jobs don't get migrated: you wait until they are done and do upgrades in between large scale, long running tasks.

Earlier versions of Scyld Beowulf were written with multi-master control in mind, butit didn't really get you anything. Due the the way that the code sat on top of the Linux Kernel, both Masters had to run the exact same version of the kernel, and that didn't really allow for rolling upgrades. Reparenting works just as well when the kernels match, because a single COmpute node is not going to be running more than one task: It is going to be running a few processes that are all part of one large MPI-type task. If ther Kernel's match , you can reparent cheaply. If you can't, you can recycle the node.

The code that sets up the internode protocols are all pushed at job launch time. The one exception is Infiniband, but even then, it is pushed out from the head node when the compute nodes boot. Shared storage might be used for task data, but not for operating system work. The compute node OS is kept light, and the root FS remains in RAM.

Greg Pfister said...

OK, given that the worker nodes are refreshed each time you start a new job, and stay constant through each job, I agree that the problem isn't there in the first place for workers.

The masters having to run the same version of the kernel is more similar to the lack of rolling upgrade I was really referring to.

admiyo said...

Nodes are not necessarily refreshed, but all libraries are fetched from the head nodes. There were a lot of issues getting the caching mechanism right, but the general consensus from customers was that it was just better to fetch libraries from the head node each time and pay a little penalty on startup. If a job starts one a week, spending an extra few milliseconds on fetching the libraries seems like a acceptable penalty.

Thanks for the replies.

Anonymous said...

I finally got around to reading all five parts of this, after letting it sit in my feed-reader for weeks waiting for enough quiet time to devote to it. Thanks for a very clear and sensible account!

xian said...

"But virtualizing to one OS instance per node on top of a single-system-image OS that unifies the nodes, that's – I'm struggling for strong enough words here. Crazy? Circular? Möbius-strip-like?"

Well said!

Overall, this is a wonderful set of posts - can we call it literate computing? Looking forward to your next book. Picked a topic yet?

Greg Pfister said...

Xian,

Thanks! I'm a sucker for positive feedback. :-)

Next book topic will be along the lines of this series of posts. I have an outline tying it all together around a theme of what we do, now that Moore's Law doesn't scale frequency: Parallel of various stripes, accelerators, virtualization, clouds, etc.

I just have to get down to doing the work on it. But I keep getting pulled off on other stuff. Frustrating. Which is one reason I started this blog, to try to get things going.

Dan said...

Hi guys,

I can't believe it took me this long to find such a great article! I'm an astrophysicist by trade (def. not a computer programmer/admin), and would really like to get into HPC 'on the cheap'. There's a program I use that is parallel in the sense that it spawns N copies of itself on N cores on a multi-core machine. Very primitive stuff. We have a ton of Linux machines, which I'd love to link together to behave as one machine with a single IP address, so my primitive parallel program could work.

My question is, do you think SSI is the way to go? Or do you have any other alternatives?

Thanks so much!!

Dan.

Greg Pfister said...

Hi, Dan.

Thanks for the compliment!

I'm pretty sure there aren't any commercial SSI solutions, probably for the reasons I cite above. There may still be some experimentation going on. Some googling might find that, but I don't know of any myself.

From the description of your app, it would probably run well on a big SGI NUMA system, but that's hardly on the cheap.

Have you looked at all at some of the cloud computing facilities available? Like Hadoop. It starts up a copy of a program on each of N machines, giving each a pointer to a file or part of a file of input data; then collects the filed results, feeding them all into a final program you specify. Higher-level management above it is available, too (Eucalyptus, Open Nebula) but it may be simpler for you to just use it directly. All the above is open source, freely downloadable, runs on Linux.

Good luck!

Greg

Dan said...

Hi Greg,

Thanks so much for taking the time to reply. Yep, that was pretty much my take on it - no commercial SSI solutions. I might take a look at kerrighed, which is an open-source version. I've already started reading about Hadoop and Eucalyptus - thanks for the tips!

I do think that I might end up concluding that an SGI system is the way to go. I spoke to one of their sales reps a couple of weeks ago, who said that I'd be looking at $35k for a 32-core machine. I'll see if I can con some cash out of NASA/NSF. Not holding my breath, but that would be awesome.

Once again, thanks, and I'll keep you posted.

Regards,

Dan.

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.