The Perils of Parallel: What Multicore Really Means (and a Larrabee/Cell Example)

Sunday, February 15, 2009

What Multicore Really Means (and a Larrabee/Cell Example)

So, now everybody's staring in rapt attention as Intel provides a peek at its upcoming eight-core chip. When they're not speculating about Larrabee replacing Cell on PlayStation 4, that is.

Sigh.

I often wish the guts of computers weren't so totally divorced from everyday human experience.

Just imagine if computers could be seen, heard, or felt as easily as, for example, cars. That would make what has gone on over the last few years instantly obvious; we'd actually understand it. It would be as if a guy from the computer (car) industry and a consumer had this conversation:

"Behold! The car!" says car (computer) industry guy. "It can travel at 15 miles per hour!"

"Oh, wow," says consumer guy, "that thing is fantastic. I can move stuff around a lot faster than I could before, and I don't have to scoop horse poop. I want one!"

Time passes.

"Behold!" says the industry guy again, "the 30 mile-an-hour car!"

"Great!" says consumer guy. "I can really use that. At 15 mph, it takes all day to get down to town. This will really simplify my life enormously. Gimme, gimme!"

Time passes once more.

"Behold!" says you-know-who, "the 60 mph car!"

"Oh, I need one of those. Now we can visit Aunt Sadie over in the other county, and not have to stay overnight with her 42 cats. Useful! I'll buy it!"

Some more time.

"Behold!" he says, "Two 62-mph cars!"

"Say what?"

"It's a dual car! It does more!"

"What is that supposed to mean? Look, where's my 120 mph car?"

"This is better! It's 124 mph. 62 plus 62."

"Bu… Wha… Are you nuts? Or did you just arrived from Planet Meepzorp? That's crazy. You can't add up speeds like that."

"Sure you can. One can deliver 62 boxes of muffins per hour, so the two together can deliver 124. Simple."

"Muffins? You changed what mph means, from speed to some kind of bulk transport? Did we just drop down the rabbit hole? Since when does bulk transport have anything to do with speed?"

"Well, of course the performance doubling doesn't apply to every possible workload or use. Nothing ever really did, did it? And this does cover a huge range. For example, how about mangos? It can do 124 mph on those, too. Or manure. It applies to a huge number of cases."

"Look, even if I were delivering mangos, or muffins, or manure, or even mollusks …"

"Good example! We can do those, too."

"Yeah, sure. Anyway, even if I were doing that, and I'm not saying I am, mind you, I'd have to hire another driver, make sure both didn't try to load and unload at the same time, pay for more oil changes, and probably do ten other things I didn't have to do before. If I don't get every one of them exactly right, I'll get less than your alleged 124 whatevers. And I have to do all that instead of just stepping on the gas. This is an enormous pain."

"We have your back on those issues. We're giving Jeb here – say hello, Jeb –"

"Quite pleased to meet you, I'm sure. Be sure to do me the honor of visiting my Universal Mango Loading Lab sometime."

"…a few bucks to get that all worked out for you."

"Hey, I'm sure Jeb is a fine fellow, but right down the road over there, Advanced Research has been working on massively multiple loading for about forty years. What can Jeb add to that?"

"Oh, that was for loading special High-Protein Comestibles, not every day mangos and muffins. HPC is a niche market. This is going to be used by everybody!"

"That is supposed to make it easier? Come on, give me my real 120 mile per hour car. That's a mile, not a munchkin, a monkey, a mattock, or anything else, just a regular, old, mile. That's what I want. In fact, that's what I really need."

"Sorry, the tires melt. That's just the way it is; there is no choice. But we'll have a Quad Car soon, and then eight, sixteen, thirty-two! We've even got a 128-car in our labs!"

"Oh, good grief. What on God's Green Earth am I going to do with a fleet of 128 cars?"

…

Yeah, yeah, I know, a bunch of separate computers (cars) isn't the same as a multi-processor. They're different kinds of things, like a pack of dogs is different from a single multi-headed dog. See illustrations here. The programming is very different. But parallel is still parallel, and anyway Microsoft and others will just virtualize each N-processor chip into N separate machines in servers. I'd bet the high-number multi-cores ultimately morph into a cluster-on-a-chip as time goes on, anyway, passing through NUMA-on-a-chip on the way.

But it's still true that:

Computers no longer go faster. We just get more of them. Yes, clock speeds still rise, but it's like watching grass grow compared to past rates of increase. Lots of software engineers really haven't yet digested this; they still expect hardware to bail them out like it used to.
The performance metrics got changed out from under us. SPECrate is muffins per hour.
Various hardware vendors are funding labs at Berkeley, UIUC, and Stanford to work on using them better, of course. Best of luck with your labs, guys, and I hope you manage to do a lot better than was achieved by 40 years of DARPA/NSF funding. Oh, but that was a niche.

My point in all of this is not to protest the rising of the tide. It's coming in. Our feet are already wet. "There is no choice" is a phrase I've heard a lot, and it's undeniably true. The tires do melt. (I sometimes wonder "Choice to do what?" but that's another issue.)

Rather, my point is this: We have to internalize the fact that the world has changed – not just casually admit it on a theoretical level, but really feel it, in our gut.

That internalization hasn't happened yet.

We should have reacted to multi-core systems like consumer guy seeing the dual car and hearing the crazy muffin discussion, instantly recoiling in horror, recognizing the marketing rationalizations as somewhere between lame and insane. Instead, we hide the change from ourselves, for example letting companies call a multi-core system "a processor" (singular) because it's packaged on one chip, when they should be laughed at so hard even their public relations people are too embarrassed to say it.

Also, we continue to casually talk in terms that suggest a two-processor system has the power of one processor running twice as fast – when they really can't be equated, except at a level of abstraction so high that miles are equated to muffins.

We need to understand that we've gone down a rabbit hole. So many standard assumptions no longer hold that we can't even enumerate them.

To ground this discussion in real GHz and performance, here's an example of what I mean by breaking standard assumptions.

In a discussion on Real World Technologies' Forums about the recent "Intel Larrabee in Sony PS4" rumors, it was suggested that Sony could, for backward compatibility, just emulate the PS3's Cell processor on Larrabee. After all, Larrabee is several processor generations after Cell, and it has much higher performance. As I mentioned elsewhere, the Cell cranks out "only" 204 GFLOPS (peak), and public information about Larrabee puts it somewhere in the range of at least 640 GFLOPS (peak), if not 1280 GFLOPS (peak) (depends on what assumptions you make, so call it an even 1TFLOP).

With that kind of performance difference, making a Larrabee act like a Cell should be a piece of cake, right? All those old games will run just as fast as before. The emulation technology (just-in-time compiling) is there, and the inefficiency introduced (not much) will be covered up by the faster processor. No problem. Standard thing to do. Anybody competent should think if it.

Not so fast. That's pre-rabbit-hole thinking. Those are all flocks of muffins flying past, not simple speed. Down in the warrens where we are now, it's possible for Larrabee to be both faster and slower than Cell.

In simple speed, the newest Cell's clock rate is actually noticeably faster than expected for Larrabee. Cell has shipped for years at 3.2 GHz; the more recent PowerXCell version uses newer fabrication technology to lower power (heat), not to increase speed. Public Larrabee estimates say that when it ships (late 2009 or 2010) it will be somewhere around 2 GHz., so in that sense Cell is about 1.25X faster than Larrabee (both are in-order, both count FLOPS double by having a multiply-add).

Larrabee is "faster" only because it contains much more stuff – many more transistors – to do more things at once than Cell does. This is true at two different levels. First, it has more processors: Cell has 8, while Larrabee at least 16 and may go up to 48. Second, while both Cell and Larrabee gain speed by lining up several numbers and operating on all of them at the same time (SIMD), Larrabee lines up more numbers at once than Cell: The GFLOPS numbers above assume Larrabee does 16 operations at once (512-bit vector registers), but Cell does only four operations at once (128-bit vector registers). To get maximum performance on both of them, you have to line up that many numbers at once. Unless you do, performance goes down proportionally.

This means that to match today's and several years' ago Cell performance, next year's Larrabee would have to not just emulate it, but extract more parallelism than is directly expressed in the program being emulated. It has to find more things to do at once than were there to begin with.

I'm not saying that's impossible; it's probably not. But it's certainly not at all as straightforward as it would have been before we went down the rabbit hole. (And I suspect that "not at all as straightforward" may be excessively delicate phrasing.)

Ah, but how many applications really use all the parallelism in Cell – get all its parts cranking at once? Some definitely do, and people figure out how to do more every day. But it's not a huge number, in part because Cell does not have the usual, nice, maximally convenient programming model exhibited by mainstream systems, and claimed for Larrabee; it traded that off for all that speed (in part). The idea was that Cell was not for "normal" programming; it was for game programming, with most of the action in intense, tight, hand-coded loops doing image creation from models. That happened, but certainly not all the time, and anecdotally not very often at all.

Question: Does that make the problem easier, or harder? Don't answer too quickly, and remember that we're talking about emulating from existing code, not rewriting from scratch.

A final thought about assumption breaking and Cell's notorious programmability issues compared with the usual simpler-to-use organizations: We may, one day, look back and say "It sure was nice back then, but we no longer have the luxury of using such nice, simple programming models." It'll be muffins all the way down. I just hope that we've merely gone down the rabbit hole, and not crossed the Mountains of Madness.

11 comments:

Anonymous said...: Great post as usual. You got very close to, but never quite spelled out, how Cell and Larrabee compare at the bad extreme, where they're both running a (roughly) single-threaded computation. How odd does that end up looking?; February 15, 2009 at 6:37 PM
Greg Pfister said...: Thanks, David.

Single thread? OK, but are there (parallel) SIMD operations in that "one" thread, or is it totally serial?

Cell has a 1.25X clock advantage, but Larrabee has a 4X wider SIMD. If you can use all of Larrabee's SIMD, I'd say Larrabee surely wins.

If you can't use a lot of Larrabee's SIMD -- 16 is pretty wide -- then I'd bet on Cell.

All this assumes similar memory bandwidth etc. I don't have those numbers at the tip of my tongue, and it will be app-dependent anyway.

This assumes both are coded form scratch, of course, which isn't the case I talked about.

I think one of the things about our new world is that we'll have to wean ourselves from general performance numbers.; February 15, 2009 at 7:03 PM
Bob Blainey said...: Great post Greg. For sure, you've illustrated that, while emulating one processor on another at high speed has historically been a huge challenge, the same problem when we're deeply into the parallel and SIMD space is a challenge of a completely different order of magnitude.

However, the real message I took away from the post was that we need to find a new way to measure, compare and communicate the performance of systems. Clock frequency has never been an accurate measure of performance but, for most people, it was close enough and easy to understand. So what do we say about these muffin-shipping systems and, worse, what do we say about systems that go slower but somehow ship more muffins? As your caricature illustrates, the current communications from chip and system companies is doing little but confusing users. But opening up the hood and explaining how the engine works is not a viable alternative. What is a *simple*, good-enough way to talk about the performance of these processors and systems?; February 15, 2009 at 9:36 PM
Anonymous said...: Another issue with clock speed as a rough performance measure is that many systems, especially consumer grade, lack sufficient cooling for sustained operation at their specified speeds. Even with single core processors you often have to do your own measurements if you really are concerned about performance. Multicore is more complex, but the processor manufacturers are really just trading one kind of hand waving for another.

Taking advantage of multicore performance requires new programming paradigms, or new combinations of old ones. As you say we have gone down a rabbit hole, and the computing world is becoming stranger than most people imagine.; February 16, 2009 at 8:28 AM
Alex UK said...: Brilliant article Greg, thank you very much.; March 17, 2009 at 8:06 AM
Anonymous said...: Not to be unkind, but your post reminds me why IBM was destroyed by MS, Intel, Nvidia, AMD, and every software person that helped explode new business models on the internet.

Your blather about multi-core is right when taken to extreme, but any concept tends to fall apart when taken to extremes. The dual core processor was a massive step forward for Intel and AMD. The quad core processor is a very significant step. However, it is certainly true that the law of diminishing returns kicks in hard when an 8-core CPU is considered, and such a beast is likely to offer near zero advantages for the vast majority of users currently using a PC with an old school OS like Windows or Linux.

Your suggestion that this is proven by failed attempts to find general purpose programming concepts that benefit from multi-processor systems is only partially correct. Why? Because a single user, multi-core PC is usually running a large number of simultaneous tasks that quite happily map to a multi-core system. No one task has to run on more than one core, or even have more than one thread.

The problem is the greater number of PC tasks can all be accommodated, using time-slicing, on a single core, because they are not processor intensive.

The problem with multi-core in the real world (a world I'd argue that IBM long since left) is that as soon as a software algorithm is developed that takes advantage of current PC CPU designs (like 3d-rasterising, or video-decoding), the algorithm is quickly baked into a standalone ASIC, providing a much cheaper and far more power efficient solution.

To sell greater than 4 multi-core in the only space that commercially matters, the PC market space, the problem is not one of parallel programming, but finding useful software tasks that will continue to run best on the CPU, and require more CPU power than is currently available in a 4-core x 3(to 4)Ghz system.

The real 'research' is in the statistical knowledge of the type of software that the hundreds of millions of PC users need and have on their systems. Video is video. Sound is sound. My point is that both have clear desirable limits to ultimate useful quality, and beyond that point no improvement is significant in the market place. Likewise with text. At the moment, computers (in mass use) are 'de-evolving' (courtesy of Intel's Atom chip), and the significance of this is we truly no longer have a good idea what uses to put our existing levels of CPU power to.

Parallel computing belongs in the area of specific designs to match specific algorithms, like motion estimation (for video compression) or 3d game rendering. Here data flow considerations are usually at least as important as the multi-core approach, given that the speed of processing is usually memory bound (for an achievable commercial cost- something I know IBM never had to worry about in its military research contracts).

The Core architecture of the Sony PS3 was an horrifically dumb approach to solving a computer graphics problem (Sony made the mistake of consulting old men whose knowledge of computer science came from textbooks that should have long been forgotten). In the end, Sony had to admit defeat, and place an Nvidia GPU is their console, relegating the Core design to a general purpose CPU.

Larrabee comes from a different place- namely the dreams of 3D games programmers that began work before hardware solutions like 3DFX's Voodoo 1 were available. Here the question was "what CPU architecture would best allow the programming of 3D rasterising algorithms. Larrabee has been designed by the man responsible for the software engine that rasterised iD's Quake 1 game, and the man responsible for the same for Epic's Unreal.

Larrabee is NOT a general purpose multi-core CPU like Intel and AMD's current 2 and 4-core x86 chips. It is a 'wet dream' CPU for software engineers that pre-date good 3D games hardware. Larrabee's problem is that good ideas in computer science/engineering from one period in time are rarely good ideas years later (as companies like IBM have learnt to their great cost).

Traditional CPU designs have one massive problem, and it is this: a simply requirement for logical processing that can be easily be designed at a hardware level may not be efficiently available as an expression of CPU instructions. In other words, the programmer needs to do something in the inner-loop, the logic of this may be trivial, but the CPU has to emulate this logic with a hideous chain of slow hardware events.

An example is seen in the field of video compression. x264 is the world's best software encoder for AVC mp4, and it is open-source. It scales to multi-core systems, of course (giving people a real use for 8+ core Intel and AMD chips). One of the sponsoring companies, Avail, needed a faster solution today, and explored the possibility of moving some of the code onto Nvida GPU systems. Despite the amazing performance of Nvidia and ATI solutions, the programming design was a bad match to the particular logic problems of the encoding algorithms, and Avail realised that they could use cheap ASIC design to target the key inner loops. The REAL world of parallel programming.

Back to our multi-core desktop processors, and there is one application that seems core hungry beyond even 4 cores- namely the only type of application that uses the full power of most people's systems, computer games.

While 'deadheads' attempt to convince that failed parallel programming research in the past by dinosaurs proves something absolute, computer games show that one can easily have applications that consist of many simultaneous parallel tasks, tasks that happily spread themselves across all the available cores.

Take a projected 'reality simulation' game evolved from the current titles that show the best progress, like 'Fallout' and 'GTA4'. This future game needs 'AI' (a miss-use of the term, I know), character animation, physics, weather simulation, environmental sound processing, massively complex pre-scene analysis for lighting, and on and on and on.

Every task has to complete in time for the next frame to be rendered. Individual tasks may, or may not be parallel program friendly, but each task will certainly be core friendly.

IBM has no business in computer games, so games don't exist to IBM people. In the REAL world, computer games drive the vast majority of commercial computer hardware development. Are the most interesting, and commercially valuable, computer games multi-core friendly- the answer is a resounding YES. Problem is, the dusty old (out-of-date) computer science textbooks and research papers on multi-core systems that certain people learnt their craft from won't even acknowledge the existence of computer games.

If the argument is that multi-core is a problem, because non-game apps rarely benefit, people in the real-world say "what are these non-game apps that need more power anyway?". As I have said above, such apps either don't exist, or rapidly find themselves moved to hardware (for example, dual-core chips were initially mostly useful for decoding hi-definition, Bluray like content, but now a small segment of the GPU on even ATI's and Nvidia's cheapest product decodes hi-def video with no CPU hit, and using far less power than a CPU solution).

People who worked in an age where parallel programming was researched and funded to help produce 'better' nuclear weapons are ill-suited to understand the current state of affairs.

Multi-core processors are for online game servers, render farms producing new Hollywood CGI films and FX, and state-of-the-art computer games. General purpose parallel programming languages (or their non-existence) is a red-herring. Core in the PS3 is a red-herring (the chip was supposed to do the graphics- had Sony understood that Core was going to fail at graphics, they would have used a traditional processor instead). Larrabee is designed as a GPU, and shares a lot of commonality with the GPUs from Nvidia and ATI.

One final question. If Intel and Microsoft and IBM spend such vast amounts of money on 'computer science' why do the best ideas constantly come from individuals that are rich only in intellect, and largely free from the restrictions of past rules and understandings? What would an Intel, or IBM or Microsoft designed Internet have looked like? Even in my worst nightmares, I hope I never imagine such a thing.

Or, to put it another way, mega engineering companies existed in the 19th century, but the aeroplane and allied industries evolved from people far removed from these 'braindead' fossils. Old men are ill placed to tell young men what is possible, what is a good idea, or what the future will look like. Every generation has their textbooks saying "these are the rules", and woe betide any generation that treats these documents as a bible.

Do you know how many pins the original RS-232 connector had, in order to send data slowly down a single line? This nonsense is what you get when research is paid for by government grants linked to military projects. I'm sure the original 80-pin (or whatever) RS-232 connector had 50 patents covering its design, but so what- wrong is wrong- foolish is foolish- no matter how many 'textbooks' repeat the nonsense. How many people were taught complete garbage about the maximum transmission speeds of copper phone lines before the days of ADSL.

Old people feel intimidated by the fact that the rules they lived and worked by are obsolete. If they are sensible, they simply retire gracefully. If they are foolish, they waste their time warning why young people are doomed to failure.

The solutions for multi-core use will not be coming from formal university research. The world has moved on.; March 28, 2009 at 5:20 AM
Anonymous said...: You should name specific books.

Your telling me there is not value to be derived from such works as: Art Of Comp Programming, the hardware software interface, the c programming language, and works in electrical and computer engineering.

Honestly, your not saying anything new. Stupid people that cannot innovate will always learn and apply knowledge through rote memorization. Theory is meant to be understood, it's not dangerous.

You bash computer science without knowing shit about it. Knowing logic/assembly/architecture/organization has never hurt anyone. The problem is with the field being flooded with mindless fodder in it for the check, too stupid to understand and only looking for application purposes.

Computer science is a mathematical discipline, along with engineering, approached from that perspective and ENTHUSIASM for the field itself, not the job market. Will, and has in my experience, almost always lead to a strong force in either the software engineering field, or computer science.; August 16, 2009 at 12:52 PM
Constantine said...: ...Because a single user, multi-core PC is usually running a large number of simultaneous tasks that quite happily map to a multi-core system. No one task has to run on more than one core, or even have more than one thread...

What are these simultaneous tasks which you are running every day on desktop PC? Mapping a word processor to one core and spreadsheet to the other? These tasks are not parallel and don;t need multi-core.

Parallel processing is most useful in server environment and when used for computations. Now, how NVIDIA CUDA would be useful in server environment? It was designed as a graphical accelerator. One consequences of this is actually you can send computation tasks to GPU only between video refresh rate for a part of millisecond. TESLA is not solution too because if power requirements, price, etc. CUDA programming model is terrible, hard to maintain, etc., etc.

About government, military projects: without them we could not have had Silicon Valley - read the history. This is directly related also to Internet, GPS, etc. This is a fact. Substantial scientific achievements needs a long term commitment and funding.; September 14, 2009 at 9:35 AM
Kurt said...: I liked your post, and am glad I found your blog. The topics you discussed are generally internalized from the beginning in embedded network processing. I am wondering what your thoughts are regarding multicore in 10-40G data-plan systems.

Thanks for writing. I will be following your blog closely!; October 2, 2009 at 6:41 PM
Greg Pfister said...: Thanks, Kurt.

Multicore on control / data plane embedded network systems has actually been around for a while, and will continue, and expand, of course -- it has to. I'm far from an expert in that realm, but it seems the obvious direction is greater parallelism in the number of data streams (packets or other units) handled simultaneously. The parallel data transfer requirements will get fierce, of course.

On this general topic, you may be interested in my post on the recently-announced "Japan Inc." effort; see http://bit.ly/YR1vS .

Greg; October 2, 2009 at 9:30 PM
Anonymous said...: @Anonymous, who postet the long rant

Some of the stuff you write is spot-on, some is bullshit. It is okay to bash old ideas, old ways of thinking - but not, old people. For example I had a CS professor as my thesis supervisor who was over 80 years old. Many thought he was a dinosaur who didn't know shit, just because he was old. Fact is, every day he cranked out more truly original, new and novel ideas than any of the 20-something, drivelling idiots he had to teach in class. Just one example out of millions.; April 6, 2012 at 11:00 AM

Sunday, February 15, 2009

What Multicore Really Means (and a Larrabee/Cell Example)

11 comments:

Post a Comment