This is part 5 of a multi-post sequence on this topic which began here. This is the crux, the final part attempting to answer the question
Why Hasn't SSI Taken Over The World?
After last-minute productus interruptus pull-outs and half-hearted product introductions by so many major industry players – IBM, HP, Sun, Tandem, Intel, SCO – you have to ask what's wrong.
Semi-random events like business conditions, odd circumstances, and company politics always play a large part in whether any product sees the light of day, but over multiple attempts those should average out: It can't just be a massive string of bad luck. Can it? This stuff just seems so cool. It practically breaths breathless press-release hype. Why isn't it everywhere?
Well, I'm fully open to suggestions.
I will offer some possibilities here.
Marketing: A Tale of Focus Groups
In one of the last, biggest pushes for this within IBM, sometime around 2002, a real marketer got involved, got funding, and ran focus groups with several sets of customers in multiple cities to get a handle on who would buy it, and for how much. I was involved in the evaluation and listened to every dang cassette tape of every dang multi-hour session. I practically kicked them in frustration at some points.
The format: After a meet-and-greet with a round of donuts or cookies or whatever, a facilitator explained the concept to the group and asked them "Would you buy this? How much would you pay?" They then discussed it among themselves, and gave their opinions.
It turned out that the customers in the focus groups divided neatly into two completely separate groups:
Naives
These were the "What's a cluster?" people. They ran a small departmental server of their own, but had never even heard of the concept of a cluster. They had no clue what a single system image was or why anybody would conceivably want one, and the short description provided certainly didn't enlighten them. This was the part I was near to kicking about; I knew I could have done that better. Now, however, I realize that what they were told was probably fairly close to the snippets they would have heard in the normal course of marketing and trade press articles.
Result: They wouldn't buy it because they wouldn't understand what it was and why they should be interested.
Experts
These guys were sysadmins of their own clusters, and knew every last minute detail of the area. They'd built clusters (usually small ones), configured cluster databases, and kept them running come hell or high water with failover software. The Naives must have thought they were talking Swahili.
They got the point, they got it fast, and they got it deep. They immediately understood implications far beyond what the facilitator said. They were amazed that it was possible, and thought it was really neat, although some expressed doubt that anybody could actually make it work, as in "You mean this is actually possible? Geez."
After a short time, however, all of them, every last one, zeroed in on one specific flaw: Operating system updates are not HA, because you can't in many cases run a different fix level on different nodes simultaneously. Sometimes it can work, but not always. These folks did rolling OS upgrades all the time, updating one node's OS and seeing if it fell over, then moving to the next node, etc.; this is a standard way to avoid planned outages.
Result: They wouldn't buy it either, because, as they repeatedly said, they didn't want to go backwards in availability.
That was two out of two. Nobody would buy it. It's difficult to argue with that market estimate.
But what if those problems were fixed? The explanation can be fixed, for sure; as I said, I all but kicked the cassettes in frustration. Doing kernel updates without an outage is pretty hard, but in all but the worst cases it could also be done, with enough effort.
Even were that done, I'm not particularly hopeful, for the several other reasons discussed below.
Programming Model
As Seymour Cray said of virtual memory, "Memory is like an orgasm: It's better when you don't have to fake it." (Thank you, Eugene Miya and his comp.sys.super FAQ.)
That remains true when it's shared virtual memory bouncing between nodes, even despite a probable lack of disk accesses. Even were an application or framework written to scale up in shared-memory multiprocessor style over many multi-cores in many nodes – and significant levels of multiprocessor scaling will likely be achieved as the single-chip core count rises – that application is going to perform much better if it is rewritten to partition its data so each node can do nearly all accesses into local memory.
But hey, why bother with all that work? Lots of applications have already been set up to scale on separate nodes, so why not just run multiple instances of those applications, and tune them to run on separate nodes? It achieves the same purpose, and you just run the same code. Why not?
Because most of the time it won't work. They haven't been written to run multiple copies on the same OS. Apache is the poster child for this. Simple, silly things get in the way, like using the same names for temp files and other externally-visible entities. So you modify the file system, letting each have an instance-specific /temp and other files… But now you've got to find all those cases.
The massively dominant programming model of our times runs each application on its own copy of the operating system. That has been a major cause of server sprawl and the resultant killer app for virtualization. The issue isn't just silly duplicate file names, although that is still there. The issue is also performance isolation, fault isolation, security isolation, and even inter-departmental political isolation. "Modern" operating systems simply haven't implemented hardening of the isolation between their separate jobs, not because it's impossible – mainframes OSs did it and still do – but because nobody cares any more. Virtualization is instead used to create isolated copies of the OS.
But virtualizing to one OS instance per node on top of a single-system-image OS that unifies the nodes, that's – I'm struggling for strong enough words here. Crazy? Circular? Möbius-strip-like? It would negate the whole point of the SSI.
Decades of training, tool development, and practical experience are behind the one application / one OS / one node programming model. It has enormous cognitive momentum. Multinode single system image is simply swimming upstream against a very strong current on this one, even if in some cases it may seem simpler, at least initially, to implement parallelism using shared virtual memory across multiple nodes. This alone seems to guarantee isolation to a niche market.
Scaling and Single-OS Semantics
Scaling up an application is a massive logical AND. The hardware must scale AND the software must scale. For the hardware to scale, the processing must scale AND the memory bandwidth must scale AND the IO must scale. For the software to scale, the operating system, middleware, AND application must all scale. If anything at all in the whole stack does not scale, you are toast: You have a serial bottleneck.
There are many interesting and useful applications that scale to hundreds or thousands of parallel nodes. This is true both of batch-like big single operations – HPC simulations, MapReduce over Internet-scale data – and the continuous massive river of transactions running through web sites like Amazon.com and eBay. So there are many applications that scale, along with their middleware, on hardware – piles of separate computers – that scales more-or-less trivially. When there's a separate operating system on each node, that part scales trivially, too.
But what happens when the separate operating systems are replaced with a kernel-level SSI operating system? That LCC was used on the Intel Paragon seems to say that it can.
However, a major feature and benefit of the whole idea of SSI is that the single system image matches the semantics of a single-node OS exactly. Getting that exact match is not easy. At one talk I heard, Jerry Popek estimated that getting the last 5% of those semantics was over 80% of the work, but provided 95% of the benefit – because it guarantees that if any piece of code ran on one node, it will simply run on the distributed version. That guarantee is a powerful feature.
Unfortunately, single-node operating systems simply weren't designed with multi-node scaling in mind. The poster child for this one is Unix/Linux systems' file position pointer, which is associated with the file handle, not with the process manipulating the file. The intent, used in many programs, is that a parent process can open a file, read and process some, then pass the handle to a child, which continues reading and processing more of the file; when the child ends, the parent can pick up where the child left off. It's how command-line arguments traditionally get passed: The parent reads far enough to see what kind of child to start up, starts it, and implicitly passes the rest of the input file – the arguments – to the child, which slurps them up. On a single-node system, this is actually the simplest thing to do: You just use one index, associated with the handle, which everybody increments. For a parallel system, that one index is a built-in serial bottleneck. Spawn a few hundred child processes to work in parallel, and watch them contend for a common input command string. The file pointer isn't the only such case, but it's an obvious egregious example.
So how did they make it work on the Intel Paragon? By sacrificing strict semantic compatibility.
For a one-off high-end scientific supercomputer (not clear it was meant to be one-off, but that's life), it's not a big deal. One would assume a porting effort or the writing of a lot of brand new code. For a general higher-volume offering, cutting such corners would mean trouble; too much code wants to be reused.
Conclusions
Making a single image of an operating system span multiple computers sounds like an absolutely wonderful idea, with many clear benefits. It seems to almost magically solve a whole range of problems, cleanly and clearly. Furthermore, it very clearly can be done; it's appeared in a whole string of projects and low-volume products. In fact, it's been around for decades, in a stream of attempts that are a real tribute to the perseverance of the people involved, and to the seductiveness of the concept.
But it has never really gotten any traction in the marketplace. Why?
Well, maybe it was just bad luck. It could happen. You can't rule it out.
On the other hand, maybe (a) you can't sell it; (b) it's at right angles to the vastly dominant application programming model; (c) it scales less well than many applications you would like to run on it.
Bummer.
__________________________
(Postscript: Are there lessons here for cloud computing? I suspect so, but haven't worked it out, and at this point my brain is tired. See you later, on some other topic.)