Memory bandwidth. And, most likely, software cost. Now that I've given you the punch lines, here's the rest of the story.
Larrabee, Intel's venture into high-performance graphics (and accelerated HPC), the root of months of trash talk between Intel and Nvidia, is well-known to have been delayed sin die: The pre-announced 2010 product won't happen, although some number will be available for software development, and no new date has been given for a product. It's also well-known for being an architecture that's clearly programmable with standard thinking and tools, as opposed to other GPGPUs (Nvidia, ATI/AMD), which look like something from another planet to anybody but a graphics wizard. In that context, a talk at Stanford earlier this month by Tom Forsyth, one of the Larrabee architects at Intel, is an interesting event.
Tom's topic was The Challenges of Larrabee as a GPU, and it began with Tom carefully stating the official word on Larrabee (delay) and then interpreting it: "Essentially, the first one isn't as cool as we hoped, and so there's no point in trying to sell it, because no one would buy it." Fair enough. He promised they'd get it right on the 2nd, 3rd, 4th, or whatever try. Everybody's doing what they were doing before the announcement; they'll just keeping on truckin'.
But, among many other interesting things, he also said the target memory bandwidth – presumably required for adequate performance on the software renderer being written (and rewritten, and rewritten…) was to be able to read 2 bytes / thread / cycle.
He said this explicitly, vehemently, and quite repeatedly, further asserting that they were going to try to maintain that number in the future. And he's clearly designing code to that assertion. Here's a quote I copied: "Every 10 instructions, dual-issue means 5 clocks, that's 10 bytes. That's it. Starve." And most good code will be memory-limited.
The thing is: 2 bytes / cycle / thread is a lot. It's so big that a mere whiff of it would get HPC people, die-hard old-school HPC people, salivating. Let's do some math:
Let's say there are 100 processors (high end of numbers I've heard). 4 threads / processor. 2 GHz (he said the clock was measured in GHz).
That's 100 cores x 4 treads x 2 GHz x 2 bytes = 1600 GB/s.
Let's put that number in perspective:
- It's moving more than the entire contents of a 1.5 TB disk drive every second.
- It's more than 100 times the bandwidth of Intel's shiny new QuickPath system interconnect (12.8 GB/s per direction).
- It would soak up the output of 33 banks of DDR3-SDRAM, all three channels, 192 bits per channel, 48 GB/s aggregate per bank.
What are the levers we have to reduce it? Processor count, clock rate, and that seems to be it. They need those 4 threads / processor (it's at the low end of really keeping their 4-stage pipe busy). He said the clock rate was "measured in GHz," so 1 GHz is a floor there. That's still 800 GB/s. Down to 25 processors we go; I don't know about you, but much lower than 24 cores starts moving out of the realm I was lead to expect. But 25 processors still gives 200 GB/s. This is still probably impossible, but starting to get in the realm of feasibility. Nvidia's Fermi, for example, is estimated as having in excess of 96 GB/s.
So maybe I'm being a dummy: He's not talking about main memory bandwidth, he's talking about bandwidth to cache memory. But then the number is too low. Take his 5 instructions, dual issue, 10 bytes example: You can get a whole lot more than 10 bytes out of an L1 cache in 5 instructions, not even counting the fact that it's probably multi-ported (able to do multiple accesses in a single cycle).
So why don't other GPU vendors have the same problem? I suspect it's at least partly because they have several separate, specialized memories, all explicitly controlled. The OpenCL memory model, for example, includes four separate memory spaces: private, local, constant, and global (cached). These are all explicitly managed, and if you align your stars correctly can all be running simultaneously. (See OpenCL Fundamentals, Episode 2). In contrast, Larrabee has to chokes it all through one general-purpose memory.
Now, switching gears:
Tom also said that the good news is that they can do 20 different rendering pipelines, all on the same hardware, since it's a software renderer; and the bad news is that they have to. He spoke of shipping a new renderer optimized to a new hot game six months after the game shipped.
Doesn't this imply that they expect their software rendering pipeline to be fairly marginal – so they are forced to make that massive continuing investment? When asked why others didn't do that, he indicated that they didn't have a choice; the pipeline's in hardware, so that one size fits all. Well, in the first place that's far less true with newer architectures; both Nvidia and ATI (AMD) are fairly programmable these days (they'd say "very programmable," I'm sure). In the second place, if it works adequately, who cares if you don't have a choice? In the third place, there's a feedback loop between applications and the hardware: Application developers work to match what they do to the hardware that's most generally available. This is the case in general, but is particularly true with fancy graphics. So the games will be designed to what the hardware does well, anyway.
And I don't know about you, but in principle I wouldn't be really excited about having to wait 6 months to play the latest and greatest game at adequate performance. (In practice, I'm never the first one out of the blocks for a new game.)
I have to confess that I've really got a certain amount of fondness for Larrabee. Its architecture seems so much more sane and programmable than the Nvidia/ATI gradual mutation away from fixed-function units. But these bandwidth and programming issues really bother me, and shake out some uncomfortable memories: The last I recall Intel attempting, as in Larrabee, to build whatever hardware the software wanted (or appeared to want on the surface), what came out was the ill-fated and nearly forgotten iAPX 432, renowned for hardware support of multitasking, object-oriented programming, and even garbage collection – and for being 4x slower than an 80286 of the same frequency.
Different situation, different era, different kind of hardware design, I know. But it still makes me worry.
(Acknowledgement: The graphic comparison to 1.5 TB disk transfer, was suggested by my still-wishing-to-remain-anonymous colleague, who also pointed me to the referenced video. This post generally benefited from email discussion with him.)