Thursday, January 28, 2010

The Problem with Larrabee

Memory bandwidth. And, most likely, software cost. Now that I've given you the punch lines, here's the rest of the story.

Larrabee, Intel's venture into high-performance graphics (and accelerated HPC), the root of months of trash talk between Intel and Nvidia, is well-known to have been delayed sin die: The pre-announced 2010 product won't happen, although some number will be available for software development, and no new date has been given for a product. It's also well-known for being an architecture that's clearly programmable with standard thinking and tools, as opposed to other GPGPUs (Nvidia, ATI/AMD), which look like something from another planet to anybody but a graphics wizard. In that context, a talk at Stanford earlier this month by Tom Forsyth, one of the Larrabee architects at Intel, is an interesting event.

Tom's topic was The Challenges of Larrabee as a GPU, and it began with Tom carefully stating the official word on Larrabee (delay) and then interpreting it: "Essentially, the first one isn't as cool as we hoped, and so there's no point in trying to sell it, because no one would buy it." Fair enough. He promised they'd get it right on the 2nd, 3rd, 4th, or whatever try. Everybody's doing what they were doing before the announcement; they'll just keeping on truckin'.

But, among many other interesting things, he also said the target memory bandwidth – presumably required for adequate performance on the software renderer being written (and rewritten, and rewritten…) was to be able to read 2 bytes / thread / cycle.

He said this explicitly, vehemently, and quite repeatedly, further asserting that they were going to try to maintain that number in the future. And he's clearly designing code to that assertion. Here's a quote I copied: "Every 10 instructions, dual-issue means 5 clocks, that's 10 bytes. That's it. Starve." And most good code will be memory-limited.

The thing is: 2 bytes / cycle / thread is a lot. It's so big that a mere whiff of it would get HPC people, die-hard old-school HPC people, salivating. Let's do some math:

Let's say there are 100 processors (high end of numbers I've heard). 4 threads / processor. 2 GHz (he said the clock was measured in GHz).

That's 100 cores x 4 treads x 2 GHz x 2 bytes = 1600 GB/s.

Let's put that number in perspective:

  • It's moving more than the entire contents of a 1.5 TB disk drive every second.
  • It's more than 100 times the bandwidth of Intel's shiny new QuickPath system interconnect (12.8 GB/s per direction).
  • It would soak up the output of 33 banks of DDR3-SDRAM, all three channels, 192 bits per channel, 48 GB/s aggregate per bank.
In other words, it's impossible. Today. It might be that Intel is pausing Larrabee to wait for product shipment of some futuristic memory technology, like the 3D stacked chips with direct vias (vertical wires) passing all the way through the RAM chip to the processor stacked on it (Exascale Ambitions talk at Salishan 20 by Bill Camp, Intel’s Chief Architect/CTO of HPC, p. 21). Tom referred to the memory system designers as wizards beyond his comprehension; but even so, such exotica seems a flaky assumption to me.

What are the levers we have to reduce it? Processor count, clock rate, and that seems to be it. They need those 4 threads / processor (it's at the low end of really keeping their 4-stage pipe busy). He said the clock rate was "measured in GHz," so 1 GHz is a floor there. That's still 800 GB/s. Down to 25 processors we go; I don't know about you, but much lower than 24 cores starts moving out of the realm I was lead to expect. But 25 processors still gives 200 GB/s. This is still probably impossible, but starting to get in the realm of feasibility. Nvidia's Fermi, for example, is estimated as having in excess of 96 GB/s.

So maybe I'm being a dummy: He's not talking about main memory bandwidth, he's talking about bandwidth to cache memory. But then the number is too low. Take his 5 instructions, dual issue, 10 bytes example: You can get a whole lot more than 10 bytes out of an L1 cache in 5 instructions, not even counting the fact that it's probably multi-ported (able to do multiple accesses in a single cycle).

So why don't other GPU vendors have the same problem? I suspect it's at least partly because they have several separate, specialized memories, all explicitly controlled. The OpenCL memory model, for example, includes four separate memory spaces: private, local, constant, and global (cached). These are all explicitly managed, and if you align your stars correctly can all be running simultaneously. (See OpenCL Fundamentals, Episode 2). In contrast, Larrabee has to chokes it all through one general-purpose memory.

Now, switching gears:

Tom also said that the good news is that they can do 20 different rendering pipelines, all on the same hardware, since it's a software renderer; and the bad news is that they have to. He spoke of shipping a new renderer optimized to a new hot game six months after the game shipped.

Doesn't this imply that they expect their software rendering pipeline to be fairly marginal – so they are forced to make that massive continuing investment? When asked why others didn't do that, he indicated that they didn't have a choice; the pipeline's in hardware, so that one size fits all. Well, in the first place that's far less true with newer architectures; both Nvidia and ATI (AMD) are fairly programmable these days (they'd say "very programmable," I'm sure). In the second place, if it works adequately, who cares if you don't have a choice? In the third place, there's a feedback loop between applications and the hardware: Application developers work to match what they do to the hardware that's most generally available. This is the case in general, but is particularly true with fancy graphics. So the games will be designed to what the hardware does well, anyway.

And I don't know about you, but in principle I wouldn't be really excited about having to wait 6 months to play the latest and greatest game at adequate performance. (In practice, I'm never the first one out of the blocks for a new game.)

I have to confess that I've really got a certain amount of fondness for Larrabee. Its architecture seems so much more sane and programmable than the Nvidia/ATI gradual mutation away from fixed-function units. But these bandwidth and programming issues really bother me, and shake out some uncomfortable memories: The last I recall Intel attempting, as in Larrabee, to build whatever hardware the software wanted (or appeared to want on the surface), what came out was the ill-fated and nearly forgotten iAPX 432, renowned for hardware support of multitasking, object-oriented programming, and even garbage collection – and for being 4x slower than an 80286 of the same frequency.

Different situation, different era, different kind of hardware design, I know. But it still makes me worry.

(Acknowledgement: The graphic comparison to 1.5 TB disk transfer, was suggested by my still-wishing-to-remain-anonymous colleague, who also pointed me to the referenced video. This post generally benefited from email discussion with him.)


Rex Guo said...

Nice post.

I feel that Larrabee's problem is the x86 IA itself.
The opcodes are large to begin with and was never designed to be bandwidth-efficient. So this could be a case of hammer and nail.

David Kanter said...

Greg - great post and thanks for the link! I didn't see Forsyth's talk, but it sounds like something on my agenda for this weekend. 2B/(cycle*thread) does sound quite impressive.

It's a little hard to figure out how this compares to the more traditional metric (Bytes/FLOP). The (cycle*thread) is going to be specific to each microarchitecture (i.e. each cycle a LRB core can do 16 DP FLOPs with 4 threads, while Nehalem can only do 4 DP FLOPs on 2 threads). Each cycle, a Nehalem core gets the equivalent of 2.7B of memory bandwidth (plus cache bandwidth).

The tricky part is that Nehalem and LRB are totally different beasts. On a per cycle basis, each LRB thread can do 2X the compute of a Nehalem thread...

Anyway, after playing around with the turns out that Nehalem has an incredible bytes/FLOP ratio. It's much higher than any GPU with reasonable double precision performance (i.e. ATI GPUs or NV's Fermi). On top of that, you have your amazing cache hierarchy (32B/cycle from each L1 @ 3GHz = 96GB/s, 384GB/s for the whole chip).

Reading between the lines then, it sounds like what Tom is saying is that they want to figure out a way to provide the massive compute power of a GPU, but without sacrificing the rich B/FLOP of a CPU.

Clearly, GDDR5 alone won't cut it - so the only answer I can come up with is that they might try some sort of high speed DRAM on the same package.


Andrew Richards said...

Hi Greg,

Thanks for a great article and link. Very interesting analsys. My reading of what they're saying about bandwidth is slightly different. I think they're talking about memory bandwidth in a slightly different way.

What's different about a Pentium core as opposed to a normal GPU core is the complexity of address generation. A Pentium, when doing a load or store, takes not just a 32-bit or 64-bit address, but also a segment register and a page table. In Pentium processors, you have an address generation unit, which does this complex transformation for each load and store. This is a bit slow, so you want to hide that latency. The way LRB hides the latency is by quickly switching to another one of those 4 threads while the load is happening.

What's different about LRB, is that you have 512-bit, 16-wide vector load/stores. And not just a simple 512-bit load/store, but also scatter/gather load/stores. And for GPU tasks, or data-parallel task, the scatter/gather loads are critical. But a scatter/gather load/store requires 16 logical-to-physical address translations, 16 cache checks, 16 cache-coherency checks. And these can't be done in a single cycle. This does affect local and global memory accesses. So, it effectively memory latency and bandwidth for GPU code is worse than the 4/5/6 cycles you would expect for this type of architecture.

Andrew Richards said...

Tom seems to be saying conflicting things about this, so it's presumably quite complex, and subject to change. But he does say that it probably takes at least 16 cycles to load from 16 different addresses, unless you're hitting the same cache-line.

Maybe this doesn't sound important, but it depends on how you're programming it. For standard shader execution, where I would expect 16 shaders to run in parallel on a single LRB core thread, it means any load inside a shader program probably takes at least 16 cycles, if you get L1$ hits for all loads. Some of those cycles of latency may be hidden if the other 3 sets of 16 shader threads (out of the total of 4 sets of 16 shader threads) but not if the other threads do a memory load. So, this isn't too bad, but it assumes L1$ hits.

"2 bytes per core per clock of bandwidth" works out as 1 32-bit load/store every 2 cycles, or 1 512-bit gather/scatter load/store every 32 cycles. The 4 threads per core, doesn't help here (that's hiding latency, not throughput.) So, you can only issue 1 load or store every 32 instructions, which for graphics doesn't sound enough to me. Remember, that this is all assuming L1$ hits.

Is Tom saying that if you write code to reduce power consumption, the clock goes faster? that seems odd, but the suggestion to me is that the LRB has to slow down if the power consumption goes too high, so write your code to consume less power.

It may be just a simplification for the presentation, but that isn't a DirectX11 pipeline. Maybe one of the bits of software still to be done? DX11 features should benefit Larrabee, though.

Greg Pfister said...

Thanks for the serious and well-thought out comments, all.

About LRB 16-address vector loads -- I've been worried about that one for some time. Waiting for 16 cache misses would be crazy. What I heard was Tom saying, basically, "so don't do that" -- organize your data better so it all comes out of one or a few cache lines. This implies some line size dependent coding, but that's not too horrible, I suppose.

About power -- yes, I did a double take on that too. I had to back up the video and listen to it a few more times. Vector predication significantly affecting power, and speed, surprised me a lot. Apparently the dynamic power when the Mul/Add unit is used is seriously higher than its static standby power, enough to potentially kick in a temperature -> clock speed feedback loop to stabilize temperature.

A while back in a post on the next 50 years I suggested that one would pay for computing by how much power it used, not by how many cores or whatever you occupied. Looks like it's starting to come true sooner than I thought.

Gary Lauterbach said...

Greg -

Thanks for the Larabee write up, I enjoy reading your blog.

My understanding of Toms talk differs a bit from some of your numbers. On one slide Tom clearly has a bullet that says 2 bytes from memory per core per cycle. In your calculation of memory bandwidth requirements you used threads rather than cores, this bumps the memory BW by 4x. I believe the threads do not run concurrently on a core (if they did they wouldn't "hide" memory latency).

With another set of assumptions:

1.5 GHz clock
32 cores
2 bytes / cycle /core

Results in a memory BW of 96 GB/s, the same as the Fermi estimate.

I agree it will be difficult to continue to scale Larabee memory BW in the future but isn't that one of the Achilles heal of all multi-core designs?


Andrew Richards said...

My biggest concern is the figure of 1.25 texture cache misses per sample. That figure looks very strange

Greg Pfister said...

@Gary - I think you're right; it's there on p. 49 of the slides. But how do you write code to that spec? He said 10 instructions, 10 bytes, but counting instructions is a per thread affair. If that's divided by 4 for all 4 threads, it's 2.5 bytes/10 instructions, which is hard to reach. He also said he assumed 2 of the threads are stalled, at any given time, so maybe only the running ones count? 5 bytes/10 instructions is still really limiting.

@andrew - I'm not calibrated on texture cache misses. What's normal?


Gary Lauterbach said...

I think the key is that the threads are interleaved, not simultaneous. 10 instructions in 5 cycles assumes the full issue rate of 2 per cycle is sustained. In those 5 cycles an average of 2 bytes per cycle can be loaded from memory. These are the raw rates for the core pipeline. Any given thread "sees" these same inst./memory BW ratios but at an average rate of 1/4 the raw core clock rate since the core is "timeshared" across 4 threads.

Larrabee with a small dispatch width of 2 and running carefully scheduled code that is sustaining 2 ipc makes SMT not beneficial. IMT is used to "hide" the latency of L1 cache misses. I use quotes on hide because the actual wall clock time is still there, it's just that the thread that caused the miss is not running so it's instruction stream does not perceive the miss latency.

Andrew Richards said...

The AMD 5870 achieves 10-60 billion texture samples per second. Multiplying that by 1.5 gives an unreasonable number of cache misses, if going to DRAM. Especially when you consider the other memory-bandwidth-intensive things going on.

Andrew Richards said...

I can't make any sense of it. It looks like I've misunderstood, given those numbers, but what Tom says is that the texture caches don't work very well because of the binning architecture of the renderer. Because the renderer is rendering triangles out of order, you lose the texture locality of the renderer. When rendering in a game, you order the triangles so that you're drawing triangles with the same textures. But that ordering is lost by the binning into tiles. So you end up either having to have impractically large texture caches, or accepting you'll get a lot of cache misses. Are these just L1 cache misses that go to L2 cache? Tom seems to be saying not, given his diagram of the system and he says that the L2 cache is used for the frame buffer tile and binning data. I must be missing something.

Gary Lauterbach said...

Normalizing to a flops/operand basis I think Tom may have a point in saying most algorithms will be memory starved. The Larabee VPU can perform 16 SP mul-adds per cycle, 32 flops, during which 2 bytes of memory data can be delivered, 1/2 operand => 64 flops per operand from memory. Note that this ratio holds for DP as well.

Using an operand/flops metric I get the following for the Larrabee VPU:

Registers: 3 operands per flop
L1 cache: 1 operand per flop
Memory: 1/64 operand per flop

Paulius Micikevicius said...

Regarding flops/byte subdiscussion, I think there are two points.

The first one, byte throughput, was addressed by Gary. Specifically, don't count instructions or bytes per thread, but rather from the core point of view. Since threads are interleaved (just like on NV or AMD GPUs), it's the aggregate that you care about. So, we know it's 2 bytes of 'main memory' per clock and dual-issue per clock (main memory is in quotes, because that's GPU, not host CPU, DRAMs). The perfectly balanced code then has 1:1 instruction:byte ratio. Now, depending on your data type, the instruction count converts to a different number of operations. Say we're looking at fp32, so 16-wide vector. Now we have 17:1 flop:byte ratio for the perfect balance (depending on how you count fmad it may be higher).

The second issue is how theoretical memory bandwidth is counted. In the original post Greg assesses that 200 GB/s is probably impossible, using 96 GB/s Fermi bandwidth as reference. However, it depends on how you count it. Current GPUs (AMD's 5870, NV's GTX 285) already have 150+ GB/s bandwidth. Now, if you're looking at a copy where each byte is moved twice across the bus (load then store), but count each byte once, you halve that to 75+ GB/s (which is fine, as long as you use the same convention for all processors). Though I would argue that many codes are dominated by read bandwidth, writes consuming much less bandwidth. So, a few hundred GB/s in the next few years doesn't sound that far fetched.

Phil Taylor said...

wrt to Tom's comments about texture and cache, he was referencing the separate cache available to the TxS unit.


page 9

The effect of losing the locality of texture references due to the binning style architecture means the TxS cache isnt as effective as originally planned. Upsizing the cache does help some.

Andrew Richards said...

Hi Phil, Thanks for that. It would be interesting to see how much benefit you get for how much texture cache (although I suspect that's very game-dependent.) Tom gave some interesting numbers in his talk, which just beg more questions. It seems to me that texture caches are much more effective with today's games than they were a year or 2 ago.

dan said...

Hey Greg,

Love your site. I have it on Google Reader and dip into it from time to time.

I'm glad to see that someone else remembers the iAPX 432. When I talk about it to people, it leads to a lot of head shaking. The wiki page doesn't say it, but I remember the instruction set as having been Huffman encoded to reduce text memory usage.

Dan Robinson

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.