So, now everybody's staring in rapt attention as Intel provides a peek at its upcoming eight-core chip. When they're not speculating about Larrabee replacing Cell on PlayStation 4, that is.
Sigh.
I often wish the guts of computers weren't so totally divorced from everyday human experience.
Just imagine if computers could be seen, heard, or felt as easily as, for example, cars. That would make what has gone on over the last few years instantly obvious; we'd actually understand it. It would be as if a guy from the computer (car) industry and a consumer had this conversation:
"Behold! The car!" says car (computer) industry guy. "It can travel at 15 miles per hour!"
"Oh, wow," says consumer guy, "that thing is fantastic. I can move stuff around a lot faster than I could before, and I don't have to scoop horse poop. I want one!"
Time passes.
"Behold!" says the industry guy again, "the 30 mile-an-hour car!"
"Great!" says consumer guy. "I can really use that. At 15 mph, it takes all day to get down to town. This will really simplify my life enormously. Gimme, gimme!"
Time passes once more.
"Behold!" says you-know-who, "the 60 mph car!"
"Oh, I need one of those. Now we can visit Aunt Sadie over in the other county, and not have to stay overnight with her 42 cats. Useful! I'll buy it!"
Some more time.
"Behold!" he says, "Two 62-mph cars!"
"Say what?"
"It's a dual car! It does more!"
"What is that supposed to mean? Look, where's my 120 mph car?"
"This is better! It's 124 mph. 62 plus 62."
"Bu… Wha… Are you nuts? Or did you just arrived from Planet Meepzorp? That's crazy. You can't add up speeds like that."
"Sure you can. One can deliver 62 boxes of muffins per hour, so the two together can deliver 124. Simple."
"Muffins? You changed what mph means, from speed to some kind of bulk transport? Did we just drop down the rabbit hole? Since when does bulk transport have anything to do with speed?"
"Well, of course the performance doubling doesn't apply to every possible workload or use. Nothing ever really did, did it? And this does cover a huge range. For example, how about mangos? It can do 124 mph on those, too. Or manure. It applies to a huge number of cases."
"Look, even if I were delivering mangos, or muffins, or manure, or even mollusks …"
"Good example! We can do those, too."
"Yeah, sure. Anyway, even if I were doing that, and I'm not saying I am, mind you, I'd have to hire another driver, make sure both didn't try to load and unload at the same time, pay for more oil changes, and probably do ten other things I didn't have to do before. If I don't get every one of them exactly right, I'll get less than your alleged 124 whatevers. And I have to do all that instead of just stepping on the gas. This is an enormous pain."
"We have your back on those issues. We're giving Jeb here – say hello, Jeb –"
"Quite pleased to meet you, I'm sure. Be sure to do me the honor of visiting my Universal Mango Loading Lab sometime."
"…a few bucks to get that all worked out for you."
"Hey, I'm sure Jeb is a fine fellow, but right down the road over there, Advanced Research has been working on massively multiple loading for about forty years. What can Jeb add to that?"
"Oh, that was for loading special High-Protein Comestibles, not every day mangos and muffins. HPC is a niche market. This is going to be used by everybody!"
"That is supposed to make it easier? Come on, give me my real 120 mile per hour car. That's a mile, not a munchkin, a monkey, a mattock, or anything else, just a regular, old, mile. That's what I want. In fact, that's what I really need."
"Sorry, the tires melt. That's just the way it is; there is no choice. But we'll have a Quad Car soon, and then eight, sixteen, thirty-two! We've even got a 128-car in our labs!"
"Oh, good grief. What on God's Green Earth am I going to do with a fleet of 128 cars?"
…
Yeah, yeah, I know, a bunch of separate computers (cars) isn't the same as a multi-processor. They're different kinds of things, like a pack of dogs is different from a single multi-headed dog. See illustrations here. The programming is very different. But parallel is still parallel, and anyway Microsoft and others will just virtualize each N-processor chip into N separate machines in servers. I'd bet the high-number multi-cores ultimately morph into a cluster-on-a-chip as time goes on, anyway, passing through NUMA-on-a-chip on the way.
But it's still true that:
- Computers no longer go faster. We just get more of them. Yes, clock speeds still rise, but it's like watching grass grow compared to past rates of increase. Lots of software engineers really haven't yet digested this; they still expect hardware to bail them out like it used to.
- The performance metrics got changed out from under us. SPECrate is muffins per hour.
- Various hardware vendors are funding labs at Berkeley, UIUC, and Stanford to work on using them better, of course. Best of luck with your labs, guys, and I hope you manage to do a lot better than was achieved by 40 years of DARPA/NSF funding. Oh, but that was a niche.
My point in all of this is not to protest the rising of the tide. It's coming in. Our feet are already wet. "There is no choice" is a phrase I've heard a lot, and it's undeniably true. The tires do melt. (I sometimes wonder "Choice to do what?" but that's another issue.)
Rather, my point is this: We have to internalize the fact that the world has changed – not just casually admit it on a theoretical level, but really feel it, in our gut.
That internalization hasn't happened yet.
We should have reacted to multi-core systems like consumer guy seeing the dual car and hearing the crazy muffin discussion, instantly recoiling in horror, recognizing the marketing rationalizations as somewhere between lame and insane. Instead, we hide the change from ourselves, for example letting companies call a multi-core system "a processor" (singular) because it's packaged on one chip, when they should be laughed at so hard even their public relations people are too embarrassed to say it.
Also, we continue to casually talk in terms that suggest a two-processor system has the power of one processor running twice as fast – when they really can't be equated, except at a level of abstraction so high that miles are equated to muffins.
We need to understand that we've gone down a rabbit hole. So many standard assumptions no longer hold that we can't even enumerate them.
To ground this discussion in real GHz and performance, here's an example of what I mean by breaking standard assumptions.
In a discussion on Real World Technologies' Forums about the recent "Intel Larrabee in Sony PS4" rumors, it was suggested that Sony could, for backward compatibility, just emulate the PS3's Cell processor on Larrabee. After all, Larrabee is several processor generations after Cell, and it has much higher performance. As I mentioned elsewhere, the Cell cranks out "only" 204 GFLOPS (peak), and public information about Larrabee puts it somewhere in the range of at least 640 GFLOPS (peak), if not 1280 GFLOPS (peak) (depends on what assumptions you make, so call it an even 1TFLOP).
With that kind of performance difference, making a Larrabee act like a Cell should be a piece of cake, right? All those old games will run just as fast as before. The emulation technology (just-in-time compiling) is there, and the inefficiency introduced (not much) will be covered up by the faster processor. No problem. Standard thing to do. Anybody competent should think if it.
Not so fast. That's pre-rabbit-hole thinking. Those are all flocks of muffins flying past, not simple speed. Down in the warrens where we are now, it's possible for Larrabee to be both faster and slower than Cell.
In simple speed, the newest Cell's clock rate is actually noticeably faster than expected for Larrabee. Cell has shipped for years at 3.2 GHz; the more recent PowerXCell version uses newer fabrication technology to lower power (heat), not to increase speed. Public Larrabee estimates say that when it ships (late 2009 or 2010) it will be somewhere around 2 GHz., so in that sense Cell is about 1.25X faster than Larrabee (both are in-order, both count FLOPS double by having a multiply-add).
Larrabee is "faster" only because it contains much more stuff – many more transistors – to do more things at once than Cell does. This is true at two different levels. First, it has more processors: Cell has 8, while Larrabee at least 16 and may go up to 48. Second, while both Cell and Larrabee gain speed by lining up several numbers and operating on all of them at the same time (SIMD), Larrabee lines up more numbers at once than Cell: The GFLOPS numbers above assume Larrabee does 16 operations at once (512-bit vector registers), but Cell does only four operations at once (128-bit vector registers). To get maximum performance on both of them, you have to line up that many numbers at once. Unless you do, performance goes down proportionally.
This means that to match today's and several years' ago Cell performance, next year's Larrabee would have to not just emulate it, but extract more parallelism than is directly expressed in the program being emulated. It has to find more things to do at once than were there to begin with.
I'm not saying that's impossible; it's probably not. But it's certainly not at all as straightforward as it would have been before we went down the rabbit hole. (And I suspect that "not at all as straightforward" may be excessively delicate phrasing.)
Ah, but how many applications really use all the parallelism in Cell – get all its parts cranking at once? Some definitely do, and people figure out how to do more every day. But it's not a huge number, in part because Cell does not have the usual, nice, maximally convenient programming model exhibited by mainstream systems, and claimed for Larrabee; it traded that off for all that speed (in part). The idea was that Cell was not for "normal" programming; it was for game programming, with most of the action in intense, tight, hand-coded loops doing image creation from models. That happened, but certainly not all the time, and anecdotally not very often at all.
Question: Does that make the problem easier, or harder? Don't answer too quickly, and remember that we're talking about emulating from existing code, not rewriting from scratch.
A final thought about assumption breaking and Cell's notorious programmability issues compared with the usual simpler-to-use organizations: We may, one day, look back and say "It sure was nice back then, but we no longer have the luxury of using such nice, simple programming models." It'll be muffins all the way down. I just hope that we've merely gone down the rabbit hole, and not crossed the Mountains of Madness.