The Perils of Parallel: March 2010

The recent near-simultaneous announcements of IBM's POWER7 and Intel's Itanium 9300 ("Tukwila") invites some background on how those differ in essential architecture, why VLIW (Itaniam's architecture) never was a good idea in the first place, and how they compare to the alternative simple multicore systems such as Sun's (now Oracle's) Niagara.

Why do this when this blog is about parallel stuff, and they're all, except Sun, devoted to single-stream performance? This isn't "The Perils of Serial," and won't become that, but it nevertheless is wise to know thy enemy. Besides, it will explain what "small simple core" really means compared with the alternatives.

So: POWER is Superscalar, Itanium is VLIW (Very Long Instruction Word; Intel calls its version EPIC for Explicitly Parallel Instruction Computer), and Niagara is composed of many simple cores.

Everything clear now?

I hope not, since I've written a pile of stuff here that gives my understanding of how those architectures differ. The explanations will deliberately leave out a whole lot of detail in an attempt to render them approachable, but I'll give some hints about what I've left out. I'd appreciate any feedback indicating that I got things wrong, or left out issues that I mistakenly thought weren't key.

Finally, before I forget: The real killer tech among the announcements is none of this architecture stuff. It's some silicon technology in POWER7. IBM's fabs have managed to put DRAM on the same silicon as the POWER7 logic, something known as embedded DRAM, or eDRAM. This is a real feat, since the fab process to make DRAM and the process to make logic are very dissimilar; SRAM, in contrast, uses the same process as logic, which is one reason it's often seen in caches (another is that it's faster than DRAM). eDRAM lets POWER7 have a huge cache (32MB) that eats far less chip area because it is dense DRAM instead of more sprawling SRAM; uses less power than equivalent-size SRAM; and puts that cache right on the same silicon as the processors, reducing its access latency by 6 times compared with off-chip cache (they say). Hats off!

Starting Point: A Very Simple Processor

In its guts, a CPU has functional blocks of logic that are connected roughly like this (very simplified):

There's a unit that gets the next instruction (Fetch), another that figures out what that instruction says to do (Decode), and then another that finally does what it says (Execute). Execution, however, itself uses different hardware units for the different kinds of things that an instruction could say to do. So it's actually split up like this (for example, and deliberately leaving out memory access and writeback for simplicity):

The Integer unit is designed to do integer operations, like adding, masking, rotating bits, etc. The Float unit is similarly designed to do only floating-point operations, like add, multiply-add, and so on. The branch unit decides whether to specify where the next instruction fetched should come from. The three share just about nothing; they're completely different logic functions.

In what we have so far, only one instruction is being worked on at a time: One instruction is fetched, then that same instruction is passed to decode, then that same instruction goes to one of the execution units, and only then, after the current instruction is over, does the next instruction get fetched. If we have one clock tick for each of those operations, it does one instruction every three clock ticks, at best.

Once upon a time computers actually did just that: they did one instruction, period, at a time. Then someone introduced the joys of pipelining:

Now we can be doing three instructions simultaneously. Starting from the right:

We are executing instruction N, actually doing what it said to do – an integer add, for example.
At the same time, we're decoding the next instruction to be done, instruction N+1.
And at the same time as that, we're getting hold of (Fetching) the instruction to be done before that one, namely instruction N+2.

"Fetch" now becomes free-running; it just keeps getting the next instruction. This, however, produces a problem: Branches. If instruction N is a branch whose execution says the next instruction performed should be at N+2289 (for example), you do not want to fetching and decoding N+1 and N+2. So you dump them out, replacing them with null operations while the instruction at N+2289 is fetched. The time wasted doing those nulled-out operations constitutes the infamous pipeline stall. There are good heuristics for avoiding this kind of pipeline stall; they basically involve making an educated guess about what the next instruction is going to be. For example, some processors keep a branch history table, a cache of where recent branches went. Then, as soon as a branch is fetched, the next fetch done is from the location that followed the branch last time. This works really well for every iteration of a loop but the last, for example. (And the first.)

We are now running quite a bit faster. Except for pipeline stalls, we finish an instruction every clock cycle, rather than every three. This is good. But there's an opportunity to do even better staring us in the face, if we know where to look.

The Road to Superscalar

Look at those three execution units on the right: Integer, Float, and Branch. Two out of three of them are sitting idle all the time, since no one instruction ever uses them all at once. But it's easy to think of cases where they could be running simultaneously.

For example, suppose X, Y, and Z are floats; and A, B, and C are integers. Then you could do these two instructions at the same time:

X = Y + Z /* fadd (fregY, fregX, fregZ)

A = B + C /* add (iregB, iregC, iregA)

These two use different logic – one the integer execution unit, the other the float execution unit. And they use different registers – one reads and writes integer registers, the other just uses float registers. This kind of thing happens all the time in computational loops, with the integer operation counting an index or something. Nothing intrinsically says you can't do them at the same time, so why not do that?

Well, because not every pair of instructions, or trio, or whatever, is independent the way those two are. Suppose we had these two successive instructions:

A=A+1

B=A+1

Were these done at the same time, B won't see the updated value of A. Crash!

We have to pick out the independent sets of instructions. How can we pull that off?

Superscalar

Like this:

First, we have to fetch faster than one instruction per cycle. No problem there; the units of storage in instruction caches, cache lines, hold well more than one instruction each.

And we have to be able to decode more than one instruction per cycle. Just replicate the decoding logic (not illustrated).

Finally, we have to hold all those instructions at the same time in a queue, the instruction queue. If we don't have them all at the same time, it's rather difficult to do them at the same time. (The queue's length we'll talk about later, but generally it's pretty short, like four.)

That queue isn't just a passive queue; it has to have some special logic winding among its storage elements:

It has to compare each instruction to all others in the queue (logic not shown), and flag dependencies: This instruction uses a register that one uses – no good, we must do them in order. Or: these two are independent in the data they work on, but need the same execution unit – no good, must do them in order.
Whenever it finds some that are independent, it shoves that whole group, simultaneously, at their appropriate execution units. Wham! All done at once. The need to send all at once is why there are so many paths to the execution unit.

That's superscalar. It's "super" because, under the right conditions, it can execute more than one instruction per cycle. It's "scalar" because, well, it's not vector. That's another way, and the only other way at the time this was invented, to do more than one operation per cycle. For example, specify two sets of N operands each, have one instruction that says "add them all," and have a pile of adders available to apply to the job.

Superscalar can get even more super than the previous diagram indicates. What if instructions are independent, but they both use the integer unit, or both use the float unit? Simple: Add more units.

Multiple float units have obvious uses (see vector case above). However, multiple integer units are more often useful, since lots of address calculations go on independent of program-explicit integer operations. Also, integer units are a lot smaller and less complicated that float units. Most superscalar designs have at least two integer units, and often more.

Superscalar designs like the whole IBM POWER series can do very well. What you get out of them is, of course, heavily dependent on the instruction stream – how independent the instructions are from each other. Lots of compiler technology has been applied to this, with the result that aggressive superscalar designs can perform pure computational floating-point loops pretty much at their limit, cranking out multiple results per cycle, until other system aspects intervene (like memory bandwidth). But they can't turn a pig's ear into a silk purse. Operating system code – test a bit and branch, test another bit and branch, etc. – is really hard to get down to even one instruction per cycle; usually it's more than that.

Superscalar Woes

Despite being rather good, the superscalar approach has one glaring problem: That magical queue in the middle. It has to compare every instruction with every other instruction to test independence; and it has to have paths to every execution unit.

Every-to-every checking that must complete in a single cycle (otherwise there's no point) and all those paths, add up to a lot of logic and wires. In fact, the queue and its associated logic grows in size roughly as the square of the queue length. What that means is that a queue eight long is hard but can be accomplished – POWER7 and earlier designs can issue 8 instructions per cycle – but a queue 16 long would be fairly exorbitant.

Question: How do you get past this?

Answer: Punt to software.

Very Large Instruction Word (VLIW)

That's what VLIW does. (Intel calls their version EPIC.) It looks roughly like this:

Seem familiar? Almost all I did is replace the words "instruction queue" with the word "parcels." That's because the prior diagram didn't show all the comparison logic among the instructions; VLIW has none. (I also dropped a few of the paths to the execution units; more on that later.)

So how does it get away with not checking dependencies among instructions? The compiler does it. The "parcels" correspond to more-or-less normal instructions, and a single instruction is now a whole queue-load of those parcels.

That's why it's called "Very Long Instruction Word." The whole collection of parcels is one "instruction," always fetched as a unit, and executed as a unit. All in parallel.

So what does the compiler do when there aren't enough independent parcels to fill up a whole Very Long Instruction? It just pads out the instruction with null operation parcels. And use intensely clever compiler techniques to try to avoid that as much as possible.

There are, additionally, some fun issues concerning executing a number of loop iterations that isn't evenly divisible by the number of parcels, but it can be done. For example, you can have several loop exit paths of different widths; or you can use masking techniques to disable execution of particular parcels according to a vector-like test.

In addition to ditching the n² comparison logic, VLIW designs can also cut down on the number of paths to execution units by simply mandating that only certain parcels can have certain operations.

In the diagram above, for example, only the first parcel can be a branch. There's no loss of generality or performance in doing that, because all parcels are done at once, so you can always move a branch to the first position; and you don't need more than one, since it's a bit hard to do two branches, to two different locations, simultaneously. Superscalar can't do that reduction because a branch could be in any position in the queue, and there can be multiple branches; they just have to be done sequentially (or punted with a pipeline stall).

My other data path changes are pretty arbitrary, but fairly representative; I've allowed for only two floats at the same time, but three integer ops.

With that quadratic factor out of the way, the sky's the limit in terms of the parallelism you can incorporate. Sixteen-way, 32-way, go for it. You'll need a lot of execution units, but what the heck, that's what the customer is paying for.

Sounds good, right? Nah.

Why VLIW Always Was a Bad Idea

There are two reasons why I've always thought VLIW is a bad idea. The first one below is by far the lesser, but it's still significant. The second is, for me, the killer.

Problem 1: Compile is specific to implementation. In particular, what if you have a compiled binary you bought off the shelf. It was compiled several years ago. Now you go out and buy a shiny new VLIW system that executes a new, different number of parcels in parallel. What happens to your binaries? Two cases, but either way you're screwed:

If the new number is smaller – say, your binary is compiled for 4, but but your hardware does 2 – at least the code works. But your code is full of those null operations needed to pad things out to 4, so you waste time doing those. Your OS in particular (test a bit and branch) will probably run twice as slow as it needs to.
If the new number is larger – say, your binary is compiled for 4, but your hardware does 8 – the program doesn't work. The hardware tries to do 8 at a time, but only groups of 4 are independent. As a result, dependent instructions run at the same time and clobber each other's data. Intel does have a fix for this: A bit that says "independence stops here!" The compiler sets the bit at the boundaries it compiled for, and the hardware respects it. So the program works. But it only runs as fast as the old hardware. Your new wider hardware doesn't help; you got nothing for that purchase.

Problem 2: Not enough parallelism. So you can do all this parallelism. Is it present in a single thread of execution to exploit in the first place? Um, no. It's certainly not there in operating systems (test a bit and branch), databases, and other commercial software. Most technical codes find it hard to average over four in a single thread. Parallelism for multiple threads is there, but it doesn't translate down to single threads very well at all.

Now, before all the graphics guys go ballistic on me over Problem 2, consider: Yes, there's a ton of single-thread parallelism in graphics. But it's well-handled by SIMD (vector) techniques. Those are far cheaper to implement in hardware, and furthermore don't have Problem 1, since it's relatively simple to parameterize the code to the length of the SIMD (vector) registers. Also, with SIMD, you can double down and do more smaller-data operations at once, with the same hardware; VLIW can't do that.

Problem 2 has been known for a very long time, even from the inception of VLIW. I remember being in an IBM Research seminar by Josh Fisher of Yale, back when he originally proposed it in the early 1980s. I recall asking "But is there enough single-thread parallelism?" (in so many words). Response: "It gets rid of the quadratic-size logic!" (in so many words). Me again: "But is the parallelism there to exploit?" Respose: "But it's such a neat idea, it gets rid of the quadratic-size logic!" And so on. (To be fair, I don't recall whether I was talking to Fisher at the time, or to someone who invited him. Probably an inviter.)

As to the experimental truth of this, well, there's probably a lot of experimentation and compiler jockeying swirling around the fact that Itanium 2 issues eight parcels per cycle, the same as POWER7.

I think Intel was sold a bill of good on this; some folks from HP were the prime movers in its joint inception.

But it hardly matters. Itanium, incorporating the new magic of VLIW and moving on from the X86 instruction set, was originally going to be Intel's path to 64-bit computing. If anybody remembers that. Then AMD did a successful, upward-compatible, X86-compatible 64-bit architecture. The die was cast, and Itanium became a long-term dead end. (Just like IBM and the PS/2's microchannel vs. the ISA I/O bus.)

Can it be made to work adequately? Sure. Intel's Itanium is far from a pig; it's established a small place in the big-iron category. Additionally, there are still people doing research on VLIW, and more power to them. Just because something is unpalatable initially doesn't mean that gnawing on it some more won't uncover something quite worthwhile.

By the Way, I Left Out Some Hard Stuff

Like, a lot of hard stuff. In particular, once you start initiating integer and FP ops at the same time, in Superscalar or otherwise, they probably don't finish in the same order they were started.

Oops. Or OoO (Out-of-Order) execution.

This produces significant additional complexity in the dependence analysis needed to tell when instructions in the queue (or whole parcel groups) can start executing. And it just gets worse when you include memory operations, which I've left out for simplicity, just because memory operations take a long time.

Furthermore, with all those instructions running at the same time in different units, which one(s) do you punt when a branch comes along?

And in addition, doing multiple instructions per cycle is not very useful unless you can store back multiple results per cycle into your register file. (For example, POWER 7 can retire 5 results per cycle.) There's also the other end: You have to be able to read multiple registers simultaneously, too.

All this can be, and has been, handled. But it's hardly trivial. It requires lots of transistors to squeeze a single instruction stream to be as parallel as it can, which isn't very.

Enter Simple Multicore

So pity the poor core designer/architect, trying to make systems go ever faster, pounding his head against this rock of an obstacle, being more clever than anybody thought possible, but still getting ever-decreasing returns from tsunamis of transistors thrown against the problem.

Is it any wonder that they punted to software again?

But that's not what is said. I suspect many don't realize that's what they did.

What I'm referring to is, of course, the use of many simple cores, immortalized in thousands of presentations of near-meaningless diagrams like this one:

This is so common it's now part of the standard canon. Everybody knows it. It's accepted fact. Don't believe it, and you've branded yourself a Neanderthal. Right?

Basically, it claims that four (always seems to be four…) small, simple cores, the kind described earlier as just pipelined, give more performance than you get with one of those old-fashioned, big, bad, hot, complicated cores.

Well, sure. They can. But let's all understand that this is punting to software big time: Instead of finding parallelism in one instruction stream, programs must be explicitly split up into multiple instruction streams.

And, of course, the kinds of performance measures always quoted for lots of simple cores always assume perfect scaling – splitting into four actually achieves four times the performance. This doesn't happen. You don't get perfect performance out of OoO, Superscalar or VLIW, either, and the comparison does take those well-known limitations into account. So it's the known (to hardware guys) imperfect compared with the unknown (to hardware guys) perfect, an inherently unfair comparison.

Fortunately, there actually is very good scaling possible on a lot of commercial processing; think many individual accesses to a web site. Many, not all, technical / HPC programs scale rather well, too.

Unfortunately, none of those is the client-side volume market needed for any killer app which might support the industry. It may work, with additional hardware support, for power reduction (see my prior post about Parallel PowerPoint), but so far that's a road not taken.

The Perils of Parallel

Wednesday, March 10, 2010

OnLive Goes Live June 17 2010

Tuesday, March 2, 2010

POWER, Itanium, Niagara: Superscalar vs. VLIW vs. Simple Multicore