The Perils of Parallel: POWER, Itanium, Niagara: Superscalar vs. VLIW vs. Simple Multicore

Tuesday, March 2, 2010

POWER, Itanium, Niagara: Superscalar vs. VLIW vs. Simple Multicore

The recent near-simultaneous announcements of IBM's POWER7 and Intel's Itanium 9300 ("Tukwila") invites some background on how those differ in essential architecture, why VLIW (Itaniam's architecture) never was a good idea in the first place, and how they compare to the alternative simple multicore systems such as Sun's (now Oracle's) Niagara.

Why do this when this blog is about parallel stuff, and they're all, except Sun, devoted to single-stream performance? This isn't "The Perils of Serial," and won't become that, but it nevertheless is wise to know thy enemy. Besides, it will explain what "small simple core" really means compared with the alternatives.

So: POWER is Superscalar, Itanium is VLIW (Very Long Instruction Word; Intel calls its version EPIC for Explicitly Parallel Instruction Computer), and Niagara is composed of many simple cores.

Everything clear now?

I hope not, since I've written a pile of stuff here that gives my understanding of how those architectures differ. The explanations will deliberately leave out a whole lot of detail in an attempt to render them approachable, but I'll give some hints about what I've left out. I'd appreciate any feedback indicating that I got things wrong, or left out issues that I mistakenly thought weren't key.

Finally, before I forget: The real killer tech among the announcements is none of this architecture stuff. It's some silicon technology in POWER7. IBM's fabs have managed to put DRAM on the same silicon as the POWER7 logic, something known as embedded DRAM, or eDRAM. This is a real feat, since the fab process to make DRAM and the process to make logic are very dissimilar; SRAM, in contrast, uses the same process as logic, which is one reason it's often seen in caches (another is that it's faster than DRAM). eDRAM lets POWER7 have a huge cache (32MB) that eats far less chip area because it is dense DRAM instead of more sprawling SRAM; uses less power than equivalent-size SRAM; and puts that cache right on the same silicon as the processors, reducing its access latency by 6 times compared with off-chip cache (they say). Hats off!

Starting Point: A Very Simple Processor

In its guts, a CPU has functional blocks of logic that are connected roughly like this (very simplified):

There's a unit that gets the next instruction (Fetch), another that figures out what that instruction says to do (Decode), and then another that finally does what it says (Execute). Execution, however, itself uses different hardware units for the different kinds of things that an instruction could say to do. So it's actually split up like this (for example, and deliberately leaving out memory access and writeback for simplicity):

The Integer unit is designed to do integer operations, like adding, masking, rotating bits, etc. The Float unit is similarly designed to do only floating-point operations, like add, multiply-add, and so on. The branch unit decides whether to specify where the next instruction fetched should come from. The three share just about nothing; they're completely different logic functions.

In what we have so far, only one instruction is being worked on at a time: One instruction is fetched, then that same instruction is passed to decode, then that same instruction goes to one of the execution units, and only then, after the current instruction is over, does the next instruction get fetched. If we have one clock tick for each of those operations, it does one instruction every three clock ticks, at best.

Once upon a time computers actually did just that: they did one instruction, period, at a time. Then someone introduced the joys of pipelining:

Now we can be doing three instructions simultaneously. Starting from the right:

We are executing instruction N, actually doing what it said to do – an integer add, for example.
At the same time, we're decoding the next instruction to be done, instruction N+1.
And at the same time as that, we're getting hold of (Fetching) the instruction to be done before that one, namely instruction N+2.

"Fetch" now becomes free-running; it just keeps getting the next instruction. This, however, produces a problem: Branches. If instruction N is a branch whose execution says the next instruction performed should be at N+2289 (for example), you do not want to fetching and decoding N+1 and N+2. So you dump them out, replacing them with null operations while the instruction at N+2289 is fetched. The time wasted doing those nulled-out operations constitutes the infamous pipeline stall. There are good heuristics for avoiding this kind of pipeline stall; they basically involve making an educated guess about what the next instruction is going to be. For example, some processors keep a branch history table, a cache of where recent branches went. Then, as soon as a branch is fetched, the next fetch done is from the location that followed the branch last time. This works really well for every iteration of a loop but the last, for example. (And the first.)

We are now running quite a bit faster. Except for pipeline stalls, we finish an instruction every clock cycle, rather than every three. This is good. But there's an opportunity to do even better staring us in the face, if we know where to look.

The Road to Superscalar

Look at those three execution units on the right: Integer, Float, and Branch. Two out of three of them are sitting idle all the time, since no one instruction ever uses them all at once. But it's easy to think of cases where they could be running simultaneously.

For example, suppose X, Y, and Z are floats; and A, B, and C are integers. Then you could do these two instructions at the same time:

X = Y + Z /* fadd (fregY, fregX, fregZ)

A = B + C /* add (iregB, iregC, iregA)

These two use different logic – one the integer execution unit, the other the float execution unit. And they use different registers – one reads and writes integer registers, the other just uses float registers. This kind of thing happens all the time in computational loops, with the integer operation counting an index or something. Nothing intrinsically says you can't do them at the same time, so why not do that?

Well, because not every pair of instructions, or trio, or whatever, is independent the way those two are. Suppose we had these two successive instructions:

A=A+1

B=A+1

Were these done at the same time, B won't see the updated value of A. Crash!

We have to pick out the independent sets of instructions. How can we pull that off?

Superscalar

Like this:

First, we have to fetch faster than one instruction per cycle. No problem there; the units of storage in instruction caches, cache lines, hold well more than one instruction each.

And we have to be able to decode more than one instruction per cycle. Just replicate the decoding logic (not illustrated).

Finally, we have to hold all those instructions at the same time in a queue, the instruction queue. If we don't have them all at the same time, it's rather difficult to do them at the same time. (The queue's length we'll talk about later, but generally it's pretty short, like four.)

That queue isn't just a passive queue; it has to have some special logic winding among its storage elements:

It has to compare each instruction to all others in the queue (logic not shown), and flag dependencies: This instruction uses a register that one uses – no good, we must do them in order. Or: these two are independent in the data they work on, but need the same execution unit – no good, must do them in order.
Whenever it finds some that are independent, it shoves that whole group, simultaneously, at their appropriate execution units. Wham! All done at once. The need to send all at once is why there are so many paths to the execution unit.

That's superscalar. It's "super" because, under the right conditions, it can execute more than one instruction per cycle. It's "scalar" because, well, it's not vector. That's another way, and the only other way at the time this was invented, to do more than one operation per cycle. For example, specify two sets of N operands each, have one instruction that says "add them all," and have a pile of adders available to apply to the job.

Superscalar can get even more super than the previous diagram indicates. What if instructions are independent, but they both use the integer unit, or both use the float unit? Simple: Add more units.

Multiple float units have obvious uses (see vector case above). However, multiple integer units are more often useful, since lots of address calculations go on independent of program-explicit integer operations. Also, integer units are a lot smaller and less complicated that float units. Most superscalar designs have at least two integer units, and often more.

Superscalar designs like the whole IBM POWER series can do very well. What you get out of them is, of course, heavily dependent on the instruction stream – how independent the instructions are from each other. Lots of compiler technology has been applied to this, with the result that aggressive superscalar designs can perform pure computational floating-point loops pretty much at their limit, cranking out multiple results per cycle, until other system aspects intervene (like memory bandwidth). But they can't turn a pig's ear into a silk purse. Operating system code – test a bit and branch, test another bit and branch, etc. – is really hard to get down to even one instruction per cycle; usually it's more than that.

Superscalar Woes

Despite being rather good, the superscalar approach has one glaring problem: That magical queue in the middle. It has to compare every instruction with every other instruction to test independence; and it has to have paths to every execution unit.

Every-to-every checking that must complete in a single cycle (otherwise there's no point) and all those paths, add up to a lot of logic and wires. In fact, the queue and its associated logic grows in size roughly as the square of the queue length. What that means is that a queue eight long is hard but can be accomplished – POWER7 and earlier designs can issue 8 instructions per cycle – but a queue 16 long would be fairly exorbitant.

Question: How do you get past this?

Answer: Punt to software.

Very Large Instruction Word (VLIW)

That's what VLIW does. (Intel calls their version EPIC.) It looks roughly like this:

Seem familiar? Almost all I did is replace the words "instruction queue" with the word "parcels." That's because the prior diagram didn't show all the comparison logic among the instructions; VLIW has none. (I also dropped a few of the paths to the execution units; more on that later.)

So how does it get away with not checking dependencies among instructions? The compiler does it. The "parcels" correspond to more-or-less normal instructions, and a single instruction is now a whole queue-load of those parcels.

That's why it's called "Very Long Instruction Word." The whole collection of parcels is one "instruction," always fetched as a unit, and executed as a unit. All in parallel.

So what does the compiler do when there aren't enough independent parcels to fill up a whole Very Long Instruction? It just pads out the instruction with null operation parcels. And use intensely clever compiler techniques to try to avoid that as much as possible.

There are, additionally, some fun issues concerning executing a number of loop iterations that isn't evenly divisible by the number of parcels, but it can be done. For example, you can have several loop exit paths of different widths; or you can use masking techniques to disable execution of particular parcels according to a vector-like test.

In addition to ditching the n² comparison logic, VLIW designs can also cut down on the number of paths to execution units by simply mandating that only certain parcels can have certain operations.

In the diagram above, for example, only the first parcel can be a branch. There's no loss of generality or performance in doing that, because all parcels are done at once, so you can always move a branch to the first position; and you don't need more than one, since it's a bit hard to do two branches, to two different locations, simultaneously. Superscalar can't do that reduction because a branch could be in any position in the queue, and there can be multiple branches; they just have to be done sequentially (or punted with a pipeline stall).

My other data path changes are pretty arbitrary, but fairly representative; I've allowed for only two floats at the same time, but three integer ops.

With that quadratic factor out of the way, the sky's the limit in terms of the parallelism you can incorporate. Sixteen-way, 32-way, go for it. You'll need a lot of execution units, but what the heck, that's what the customer is paying for.

Sounds good, right? Nah.

Why VLIW Always Was a Bad Idea

There are two reasons why I've always thought VLIW is a bad idea. The first one below is by far the lesser, but it's still significant. The second is, for me, the killer.

Problem 1: Compile is specific to implementation. In particular, what if you have a compiled binary you bought off the shelf. It was compiled several years ago. Now you go out and buy a shiny new VLIW system that executes a new, different number of parcels in parallel. What happens to your binaries? Two cases, but either way you're screwed:

If the new number is smaller – say, your binary is compiled for 4, but but your hardware does 2 – at least the code works. But your code is full of those null operations needed to pad things out to 4, so you waste time doing those. Your OS in particular (test a bit and branch) will probably run twice as slow as it needs to.
If the new number is larger – say, your binary is compiled for 4, but your hardware does 8 – the program doesn't work. The hardware tries to do 8 at a time, but only groups of 4 are independent. As a result, dependent instructions run at the same time and clobber each other's data. Intel does have a fix for this: A bit that says "independence stops here!" The compiler sets the bit at the boundaries it compiled for, and the hardware respects it. So the program works. But it only runs as fast as the old hardware. Your new wider hardware doesn't help; you got nothing for that purchase.

Problem 2: Not enough parallelism. So you can do all this parallelism. Is it present in a single thread of execution to exploit in the first place? Um, no. It's certainly not there in operating systems (test a bit and branch), databases, and other commercial software. Most technical codes find it hard to average over four in a single thread. Parallelism for multiple threads is there, but it doesn't translate down to single threads very well at all.

Now, before all the graphics guys go ballistic on me over Problem 2, consider: Yes, there's a ton of single-thread parallelism in graphics. But it's well-handled by SIMD (vector) techniques. Those are far cheaper to implement in hardware, and furthermore don't have Problem 1, since it's relatively simple to parameterize the code to the length of the SIMD (vector) registers. Also, with SIMD, you can double down and do more smaller-data operations at once, with the same hardware; VLIW can't do that.

Problem 2 has been known for a very long time, even from the inception of VLIW. I remember being in an IBM Research seminar by Josh Fisher of Yale, back when he originally proposed it in the early 1980s. I recall asking "But is there enough single-thread parallelism?" (in so many words). Response: "It gets rid of the quadratic-size logic!" (in so many words). Me again: "But is the parallelism there to exploit?" Respose: "But it's such a neat idea, it gets rid of the quadratic-size logic!" And so on. (To be fair, I don't recall whether I was talking to Fisher at the time, or to someone who invited him. Probably an inviter.)

As to the experimental truth of this, well, there's probably a lot of experimentation and compiler jockeying swirling around the fact that Itanium 2 issues eight parcels per cycle, the same as POWER7.

I think Intel was sold a bill of good on this; some folks from HP were the prime movers in its joint inception.

But it hardly matters. Itanium, incorporating the new magic of VLIW and moving on from the X86 instruction set, was originally going to be Intel's path to 64-bit computing. If anybody remembers that. Then AMD did a successful, upward-compatible, X86-compatible 64-bit architecture. The die was cast, and Itanium became a long-term dead end. (Just like IBM and the PS/2's microchannel vs. the ISA I/O bus.)

Can it be made to work adequately? Sure. Intel's Itanium is far from a pig; it's established a small place in the big-iron category. Additionally, there are still people doing research on VLIW, and more power to them. Just because something is unpalatable initially doesn't mean that gnawing on it some more won't uncover something quite worthwhile.

By the Way, I Left Out Some Hard Stuff

Like, a lot of hard stuff. In particular, once you start initiating integer and FP ops at the same time, in Superscalar or otherwise, they probably don't finish in the same order they were started.

Oops. Or OoO (Out-of-Order) execution.

This produces significant additional complexity in the dependence analysis needed to tell when instructions in the queue (or whole parcel groups) can start executing. And it just gets worse when you include memory operations, which I've left out for simplicity, just because memory operations take a long time.

Furthermore, with all those instructions running at the same time in different units, which one(s) do you punt when a branch comes along?

And in addition, doing multiple instructions per cycle is not very useful unless you can store back multiple results per cycle into your register file. (For example, POWER 7 can retire 5 results per cycle.) There's also the other end: You have to be able to read multiple registers simultaneously, too.

All this can be, and has been, handled. But it's hardly trivial. It requires lots of transistors to squeeze a single instruction stream to be as parallel as it can, which isn't very.

Enter Simple Multicore

So pity the poor core designer/architect, trying to make systems go ever faster, pounding his head against this rock of an obstacle, being more clever than anybody thought possible, but still getting ever-decreasing returns from tsunamis of transistors thrown against the problem.

Is it any wonder that they punted to software again?

But that's not what is said. I suspect many don't realize that's what they did.

What I'm referring to is, of course, the use of many simple cores, immortalized in thousands of presentations of near-meaningless diagrams like this one:

This is so common it's now part of the standard canon. Everybody knows it. It's accepted fact. Don't believe it, and you've branded yourself a Neanderthal. Right?

Basically, it claims that four (always seems to be four…) small, simple cores, the kind described earlier as just pipelined, give more performance than you get with one of those old-fashioned, big, bad, hot, complicated cores.

Well, sure. They can. But let's all understand that this is punting to software big time: Instead of finding parallelism in one instruction stream, programs must be explicitly split up into multiple instruction streams.

And, of course, the kinds of performance measures always quoted for lots of simple cores always assume perfect scaling – splitting into four actually achieves four times the performance. This doesn't happen. You don't get perfect performance out of OoO, Superscalar or VLIW, either, and the comparison does take those well-known limitations into account. So it's the known (to hardware guys) imperfect compared with the unknown (to hardware guys) perfect, an inherently unfair comparison.

Fortunately, there actually is very good scaling possible on a lot of commercial processing; think many individual accesses to a web site. Many, not all, technical / HPC programs scale rather well, too.

Unfortunately, none of those is the client-side volume market needed for any killer app which might support the industry. It may work, with additional hardware support, for power reduction (see my prior post about Parallel PowerPoint), but so far that's a road not taken.

23 comments:

RPG said...: Hi Greg,

3 things

1) Great blog. Please keep it up for the sake of people like us. :)

2) I thought that server side apps could gobble up just about as much cores and threads as you could throw at them. AMD/Intel/Sun etc., all are screaming "moar corez" all the time. Even more loudly in the server market. And people are lapping it up.

3) ILP is limited. Absolutely. It is limited whether you extract it in hw (OoOE) or sw (VLIW). Then why should it be a killer for Itanium? From 30k feet, it would appear that even if an Itanium core could go no faster than a Nehalem core, it should take less area. In the serial era, that made Itanium a disappointment, but in parallel era, it should play right into VLIW's hands (modulo the compiler troubles).; March 2, 2010 at 11:30 PM
kme said...: I'm interested in the way you dismiss operating system code as containing very little inherent parallelism. It seems to me that the most performance-sensitive OS code is likely to be in areas like disk elevators and schedulering algorithms, which could well contain non-trivial inherent parallelism.

At any rate, it should be easy enough to collect data on this - the Linux kernel supports Itanium, so you should be able to analyse a real-world binary to determine the occupancy of each VLIW word. Do you know if anyone has done this?; March 3, 2010 at 4:54 AM
Krzysztof Narkowicz said...: Another great post. Maybe next one could be about out of order execution, LHS...?

BTW I don't see why do You say that Superscalar/VLIM is bad because OS/usual software won't use it optimally. IMHO that doesn't matter because this kind of software doesn't need a lot of CPU power. If someone needs perf, for example for games, he will write the right code.; March 3, 2010 at 6:14 AM
Fabricio Cannini said...: Talking about the demise of Itanium, i still wonder why HP ditched Alpha in favor of it. Correct me if i'm wrong, but AFAICS it was pretty advanced for it's time. Why oh why ...; March 3, 2010 at 11:42 AM
Anonymous said...: HP had to ditch Alpha even though it was a lovable race horse, because business market share wasn't there, and chip design and production costs expected to become exponential.

Itanium had a great future as long as AMD64 wasn't out, but shipment delayed too much to get rid of pesky AMD. A new architecture/ISA for Intel meant no X-licencing for AMD, and also out the door for the horrid x86 architecture, in a newer simpler architecture with more potential.
My opinion: VLIW is not a killer problem, and benefits the icache with smaller instructions/parcels without going CISC, but not being out-of-order is because it limits hyperthreading implementations. You want to find parallelism both at compile time and at runtime, in either spots you can't see it all. You want to squash no-ops with another thread's useful workload.; March 3, 2010 at 2:12 PM
Noah Mendelsohn said...: Great post, Greg. I've put a link to it in my blog at http://blog.arcanedomain.com/2010/03/a-nice-little-tutorial-on-modern-trends-in-processor-architecture/ .

Noah; March 3, 2010 at 7:45 PM
樊川先生 said...: Hi Greg, great blog, but I have some technical questions. Is dependence checking structures not needed for the dynamically scheduled superscalars that use Tomasulo's algorithm, since out-of-order execution can perform register renaming that can handle WAW and WAR hazards? Second, I also hear of another term called "window" defined as the group of the instructions that be executed simultaneously. The window size is typically larger that issue width, that is the size of issue queue in your blog. Do you forget to mention this?; March 6, 2010 at 11:33 PM
Anonymous said...: Greg -

Very nice blog, and especially the most recent entries.

Next up -- I think you should try to tackle the data parallel offload computing exemplified by modernized GPUs (and other devices) in the same common sense, easy to understand vocabulary.

Chuck (an old colleague); March 7, 2010 at 1:50 PM
Alan M. Feldstein said...: 樊川先生,

Dependence checking is required whenever there is multiple issue (static or dynamic). Multiple issue can be thought of as an order change from serial to parallel. "Out-of-order" is then just a an order change of higher degree. As for register renaming, it is one technique for eliminating dependences, performed after dependence checking tells you what dependences are troublesome.

You are correct that "window" in dynamic multiple issue corresponds to the width of Greg's instruction queue. Maximum issue width is typically smaller (correct again) and is limited by the number of execution units (Greg calls them "hardware units"). I would like to point out also that the same concept of "window" can be applied to compile time. In fact, it is another advantage of the EPIC approach that this compile-time window can be so much larger (free of the practical limitations of transistors and wires), allowing more ILP to be found.; March 8, 2010 at 6:04 PM
Chandra said...: This is good Article. The last method in this article - simple & in-order & multiple core processors - give more performance than those heavy & complex & out-of-order processors... That is what Tilera doing... that was Intel's attempt with Larrabee... What were the problems with Larrabee? it is indefinitely postponed :( or, was it only experimental (I don't think so, considering nVidia vs. Intel war) - on the way to something great??

I have one request - if you continue this article... one direction may be Dynamically Reconfigurable Processors (IMEC ADRES) - its CGA is like VVLIW (very very long)... can you write something on that, and its potential for parallel processing (lot of ILP within one core, and there can be Multicore-RP)?; March 9, 2010 at 5:37 AM
樊川先生 said...: Hi Alan M. Feldstein，
For dependence checking, I mean using comparators to check the RAW hazards among instructions, since for statically scheduled processors, instructions with RAW cannot be issued simultaneously. But for dynamical ones, out-of-order execution enables instructions with RAW hazards to be issues at the same time, since reservation stations and/or ROB can perform register renaming. So I don't think comparators used in statically scheduled processors are needed in dynamical ones, don't they?; March 14, 2010 at 3:16 AM
Jonathan Hinkle said...: Greg, thank you for an engaging blog post as usual. I'm starting to become a regular reader.

There was an interesting distinction to make in one of the major questions posed in your post. I noticed that though similar, the question, "But is the parallelism there to exploit?" is not the same as "But is there enough single-thread parallelism?".

I would totally agree that there may not be enough single-thread parallelism to exploit for a VLIW system to make sense with all of its complexities. (design complexity and code incompatibility are two of VLIW's major drawbacks)

However, I think there is enough parallelism to exploit from a different perspective. Much may depend on who's defined as the user. If cloud computing continues to grow with less work done in front of you versus somewhere else, hardware will continue to trend towards running multiple users' applications simultaneously. Individual application (or user) performance may not benefit much except from the small improvements being made in individual thread execution. It seems like this will continue to progress however it can. Yet, from the perspective of the datacenter manager who buys the server with a VLIW or Superscalar processor, there could be significant advantages. Different applications (especially from different users) are usually inherently independent from each other. I think VLIW could do well in this space if SaaS takes off for workloads that already have plenty of performance (maybe the majority of simple consumer apps?).; April 2, 2010 at 12:11 PM
Greg Pfister said...: Belatedly getting back to all these comments (THANKS!), one at a time:

@RPG - Yes, server-side is able to use lots of cores, but in many cases by virtualizing, splitting each N-way multicore into N virtual machines or more. See my much earlier post about MS Azure. They still like multicore, but mainly because it smashes more machines (virtual) into a single box.

More replies coming in separate comments.; April 5, 2010 at 5:51 PM
Greg Pfister said...: @kme - OS parallelism: Data has been collected, over a long period of time, just by watching how many cycles per instruction are used inside kernels when run on superscalar systems. Can't point to public studies; I know this from my years in IBM. But OS code was absolutely *notorious* for being crap on superscalar, taking, rather than 4-5 instructions per cycle, >1 cycle per instruction.

There is a key word in there: "was." My direct info is a couple of years old. For all I know people have been busy mucking around in kernels making them more ILP-friendly. But I suspect it's not changed a whole lot, and it was really bad. Some watching of performance counters could answer it for real, though.; April 5, 2010 at 5:56 PM
Greg Pfister said...: @Kris -

Thanks!

I didn't mean to say superscalar/VLIW was "bad." Well, at least not superscalar. 3-4-way ILP (instruction level parallellism) is regularly achievable outside of nasty cases like the OS, and that factor of 3-4 performance is definitely worth it. It's why Intel's cores in its mainstream processors are still nice big fat guys (OoO superscalar). Only when they go heavy multicore, like Larrabee, do they go back to simple cores to save space.

I just think VLIW was the wrong answer to the wrong question.; April 5, 2010 at 6:00 PM
Greg Pfister said...: @Fabricio -

I agree, Alpha was a great architecture and implementation. As to why HP ditched it, hard to say for sure. I suspect because they were committed to Itanium externally (with Intel), had no such commitment to Alpha, and, well, resources are always limited.; April 5, 2010 at 6:03 PM
Greg Pfister said...: @樊川先生, and @Alan -

Both right and a little bit off.

Yes, Tomasulo's algorithm does checking that's very similar to what I'm talking about here with multiple issue. But Tomasulo's algorithm, as originally described, did that checking (plus RAW and WAR hazards) on one instruction at a time. While multiple instructions could be executing at the same time, multiple instructions didn't start at the same time.

With superscalar, the whole idea is to start up multiple instructions simultaneously. So the checking that Tomasulo's algorithm does has to be carried out among *all* the instructions in the queue at the *same* *time*. This requires a lot more hardware than was feasible at the time Tomasulo was working on this -- it's Tomasulo times N^2.

I didn't get into RAW (read after write) or WAR (write after read) explicitly in this article, bundling all such things into the general word "dependency." While I can't explicitly point to designs that check RAW and WAR too, using register renaming to get around them, I'm sure such designs exist. I could, and probably should, have listed it in the section where I listed hard stuff I left out.

The checking here; April 6, 2010 at 10:55 AM
Greg Pfister said...: @Chandra --

Yes, Tilera is based on multiple single cores. So is Intel's Larrabee, and numerous others. If you want to put the maximum number of cores on a chip, the natural thing to do is make them simple and hence small.

A quick Google told me that IMEC ADRES is a processor for embedded systems. There are *many* more embedded processors sold than "general purpose" cores, for sure, in everything from phones to cars to dishwashers. So I can't dispute its importance.

But I also haven't yet gotten my head around the enormous zoo of architectures used in that area. I've seen multiple simple cores, reconfigurable instruction sets, etc. So while I may mention it casually I'm not likely to focus on anything there.; April 6, 2010 at 11:19 AM
Greg Pfister said...: @anonymous/Chuck - Hi!

Yes, I probably should say something comprehensible about GPUs. Problem is, outside of Larrabee (which is really GP) it's necessary to be an expert in rendering to really grock those guys. I used to be into graphics (Ph.D. thesis in the area), but got massively out of touch.

I've been trying. But I'm not really there yet. Everything seems to be either written in Swahili, or (like graphics textbooks) ignore special-purpose hardware. :-(; April 6, 2010 at 11:25 AM
Greg Pfister said...: @Jonathan --

Superscalar, or even VLIW, architectures are well-used in clouds and clusters and similar systems.

But the reason is not that there many different things going on at the same time. Many completely independent tasks are parallel, to be sure, but that parallelism cannot be exploited by the techniques I described in this post.

This post, and the first two guys discussed (superscalar, VLIW) deals only with parallelism within a single instruction stream. Not multiple streams. A VLIW compiler does tease out multiple loop iterations to do simultaneously, but don't get fooled; that's not multiple independent threads. It's more like taking a single thread and doubling it back on itself.

So why did I start this reply saying they're good in clouds? Because the vast majority of the individual code code threads run on parallel systems of that ilk has to go as fast as it can to get the response time required of individual queries or transactions. Superscalar (or VLIW) is the way to do that. It's why Intel's latest Xeon (Nehalem) is a superscalar, OoO, big, fat, processor. And is being deployed in clouds.; April 6, 2010 at 11:54 AM
Gionatan said...: Hi Greg,
thanks for the wonderful post.

I have some question about scalar vs SIMD vs VLIW:
1) at the same instruction width (eg: 4 x scalar, 1x4 wide vector, 1x4 parlecs-wide VLIW) who is the better in term of power and space efficiency?
2) with 4 x scalar I should be able to match the throughput of 1x4 scalar/VLIW, but I had the "quadratic-size logic" problem. But speaking about vector vs VLIW, why are the advantages of use one over the other?

Excuse me for my imperfect english and thanks for your reply.; September 6, 2010 at 10:01 AM
Greg Pfister said...: Hi @Gionatan,

I'm sorry, but to me that sounds a whole lot like a homework problem. So I'm only going to answer in general terms, with embedded hints.

First, it depends a whole lot on the target applications. HPC? Database? PowerPoint? Dedicated MRI processor? If the code doesn't have sufficient parallelism, or sufficient parallelism of a *specific* type in some cases (SIMD/vector), probably superscalar is as good as anything.

If there's a lot of parallelism of the right type(s), then for power I'd just count the total logic on the chip, and pick the smallest amount.

Greg; September 6, 2010 at 1:12 PM
Gionatan said...: Hi Greg,
thanks for your reply ;)

Keep us informed with these very interesting posts...; September 10, 2010 at 9:30 AM