The Perils of Parallel: August 2009

There's a rumor that Intel is planning to integrate Larrabee, its forthcoming high-end graphics / HPC accelerator, into its processors in 2012. A number of things about this make a great deal of sense at a detailed level, but there's another level at which you have to ask "What will this mean?"

Some quick background, first: Larrabee is a well-publicized Intel product-of-the-future, where this particular future is late this year ('09) or early next year ('10). It's Intel's first foray into the realm of high-end computer graphics engines. Nvidia and ATI (part of AMD now) are now the big leagues of that market, with CUDA and Stream products respectively. While Larrabee, CUDA and Stream differ greatly, all three use parallel processing to get massive performance. Larrabee may be in the 1,000 GFLOPS range, while today Nvidia is 518 GFLOPS and ATI is 2400 GFLOPS. In comparison, Intel's latest core i7 processor reaches about 50 GFLOPS.

Integrating Larrabee into the processor (or at least its package) fits well with what's known of Intel's coarse roadmap, illustrated below (from canardpc.com, self-proclaimed "un scandale" below):

"Ticks" are new lithography, meaning smaller chips; Intel "just" shrinks the prior design. "Tocks" keep the same lithography, but add new architecture or features. So integrating Larabee on the "Haswell" tock makes sense as the earliest point at which it could be done.

The march of the remainder of Moore's Law – more transistors, same clock rate – makes such integration possible, and, for cost purposes, inevitable.

The upcoming "Westmere" parts start this process, integrating Intel's traditional "integrated graphics" onto the same package with the processor; "integrated" here means low-end graphics integrated into the processor-supporting chipset that does IO and other functions. AMD will do the same. According to Jon Peddie Research, this will destroy the integrated graphics market. No surprise there: same function, one less chip to package on a motherboard, probably lower power, and… free. Sufficiently, anyway. Like Internet Explorer built into Windows for "free" (subsidized) destroying Netscape, this will just come with the processor at no extra charge.

We will ultimately see the same thing for high-end graphics. 2012 for Larrabee integration just puts a date on it. AMD will have to follow suit with ATI-related hardware. And now you know why Nvidia has been starting its own X86 design, a pursuit that otherwise would make little sense.

Obviously, this will destroy the add-in high-end graphics market. There might be some residual super-high-end graphics left for the super-ultimate gamer, folks who buy or build systems with multiple high-end cards now, but whether there will be enough volume for that to survive at all is arguable.

Note that once integration is a fact, "X inside" will have graphics architecture implications it never had before. You pick Intel, you get Larraabee; AMD, ATI. Will Apple go with Intel/Larrabee, AMD/ATI, or whatever Nvidia cooks up? They began OpenCL, to abstract the hardware, but as an interface it is rather low-level and reflective of Nvidia's memory hierarchy. Apple will have to make a choice that PC users will individually, but for their entire user base.

That's how, and "why" in a low-level technical hardware sense. It is perfectly logical that, come 2012, every new PC and Mac has what by then will probably have around 2,000 GFLOPS. This is serious computing power. On your lap.

What the heck are most customers going to do with this? Will there be a Windows GlassWax and Mac OS XII Yeti where the user interface is a full 3D virtual world, and instead of navigating a directory tree to find things, you do dungeon crawls? Unlikely, but I think more likely than verbal input, even really well done, since talking aloud isn't viable in too many situations. Video editing, yes. Image search, yes too, but that's already here for some, and there are only so many times I want to find all the photos of Aunt Bessie. 3D FaceSpace? Maybe, but if it were a big win, I think it would already exist in 2.5D. Same for simple translations of the web pages into 3D. Games? Sure, but that's targeting a comparatively narrow user base, with increasingly less relevance to gameplay. And it's a user base that may shrink substantially due to cloud gaming (see my post Twilight of the GPU?).

It strikes me that this following of one's nose on hardware technology is a prime example of what Robert Capps brought up in a recent Wired article (The Good Enough Revolution: When Cheap and Simple Is Just Fine) quoting Clay Shirky, an NYU new media studies professor, who was commenting on CDs and lossless compression compared to MP3:

"There comes a point at which improving upon the thing that was important in the past is a bad move," Shirky said in a recent interview. "It's actually feeding competitive advantage to outsiders by not recognizing the value of other qualities." In other words, companies that focus on traditional measures of quality—fidelity, resolution, features—can become myopic and fail to address other, now essential attributes like convenience and shareability. And that means someone else can come along and drink their milk shake.

It may be that Intel is making a bet that the superior programmability of Larrabee compared with strongly graphics-oriented architectures like CUDA and Stream will give it a tremendous market advantage once integration sets in: Get "Intel Inside" and you get all these wonderful applications that AMD (Nvidia?) doesn't have. That, however, presumes that there are such applications. As soon as I hear of one, I'll be the first to say they're right. In the meantime, see my admittedly sarcastic post just before this one.

My solution? I don't know of one yet. I just look at integrated Larrabee and immediately think peacock, or Irish Elk – 88 lbs. of antlers, 12 feet tip-to-tip.

Megaloceros Giganteus, the Irish Elk, as integrated Larrabee.
Based on an image that is Copyright Pavel Riha, used with implicit permission
(Wikipedia Commons, GNU Free Documentation License)

They're extinct. Will the traditional high-performance personal computer also go extinct, leaving us with a volume market occupied only by the successors of netbooks and smart phones?

---------------------------------------

The effect discussed by Shirky makes predicting the future based on current trends inherently likely to fail. That happens to apply to me at the moment. I have, with some misgivings, accepted an invitation to be on a panel at the Anniversary Celebration of the Coalition for Academic Scientific Computation.

The misgivings come from the panel topic: HPC - the next 20 years. I'm not a futurist. In fact, futurists usually give me hives. I'm collecting my ideas on this; right now I'm thinking of democratization (2TF on everybody's lap), really big data, everything bigger in the cloud, parallelism still untamed but widely used due to really big data. I'm not too happy with those, since they're mostly linear extrapolations of where we are now, and ultimately likely to be as silly as the flying car extrapolations of the 1950s. Any suggestions will be welcome, particularly suggestions that point away from linear extrapolations. They'll of course be attributed if used. I do intend to use a Tweet from Tim Bray (timbray) to illustrate the futility of linear extrapolation: “Here it is, 2009, and I'm typing SQL statements into my telephone. This is not quite the future I'd imagined.”

Tim Sweeny recently gave a keynote at High Performance Graphics 2009 titled "The End of the GPU Roadmap" (slides). Tim is CEO and founder of Epic Games, producers of over 30 games including Gears of War, as well as the Unreal game engine used in 100s of games. There are lots of really interesting points in that 74-slide presentation, but my biggest keeper is slide 71:

[begin quote]

Lessons learned: Today's hardware is too hard!

If it costs X (time, money, pain) to develop an efficient single-threaded algorithm, then…
- Multithreaded version costs 2X
- PlayStation 3 Cell version costs 5X
- Current "GPGPU" version costs: 10X or more
Over 2X is uneconomical for most software companies!
This is an argument against:
- Hardware that requires difficult programming techniques
- Non-unified memory architectures
- Limited "GPGPU" programming models

[end quote]

Judging from the prior slides, by '"GPGPU"' Tim apparently means the DirectX 10 pipeline with programmable shaders.

I'm not sure what else to make of this beyond rehashing Tim's words, and I'd rather point you to his slides than start doing that. The overall tenor somewhat echoes comments I made in one of my first posts; it continues to be the most hit-on page of this blog, so I must have said something useful there.

I will note, though, that Tim's estimates of effort are based on very extensive experience – with game programming. For low-ish levels of parallelism, like 4 or 8, multithreading adds zero cost to typical commercial applications already running under a competent transaction monitor. It just works, since they're already at that level of software multithreading for other reasons (like achieving overlap with IO waits). Of course, that's not at all universally true for commercial applications, particularly for high levels of parallelism, no matter how much cloud evangelists talk about elasticity.

Once again, thanks to my friend who is expert at finding things like this slide set (it's not on the conference web site) and doesn't want his name mentioned.

Short post this time.

The Perils of Parallel

Sunday, August 30, 2009

A Larrabee in Every PC and Mac

Parallelism Needs a Killer Application

Saturday, August 15, 2009

Today’s Graphics Hardware is Too Hard