I'd guess everyone has heard something of the large, public, flame war that erupted between Intel and Nvidia about whose product is or will be superior: Intel Larrabee, or Nvidia's CUDA platforms. There have been many detailed analyses posted about details of these, such as who has (or will have) how many FLOPS, how much bandwidth per cycle, and how many nanoseconds latency when everything lines up right. Of course, all this is “peak values,” which still means “the values the vendor guarantees you cannot exceed” (Jack Dongarra’s definition), and one can argue forever about how much programming genius or which miraculous compiler is needed to get what fraction of those values.
Such discussion, it seems to me, ignores the elephant in the room. I think a key point, if not the key point, is that this is an issue of MIMD (Intel Larrabee) vs. SIMD (Nvidia CUDA).
If you question this, please see the update at the end of this post. Yes, Nvidia is SIMD, not SPMD.
I’d like to point to a Wikipedia article on those terms, from Flynn’s taxonomy, but their article on SIMD has been corrupted by Intel and others’ redefinition of SIMD to “vector.” I mean the original. So this post becomes much longer.
MIMD (Multiple Instruction, Multiple Data) refers to a parallel computer that runs an independent separate program – that’s the “multiple instruction” part – on each of its simultaneously-executing parallel units. SMPs and clusters are MIMD systems. You have multiple, independent programs barging along, doing things that may have nothing to do with each other, or may not. When they are related, they barge into each other at least occasionally, hopefully as intended by the programmer, to exchange data or to synchronize their otherwise totally separate operation. Quite regularly the barging is unintended, leading to a wide variety of insidious data- and time-dependent bugs.
SIMD (Single Instruction, Multiple Data) refers to a parallel computer that runs the EXACT SAME program – that’s the “single instruction” part – on each of its simultaneously-executing parallel units. When ILIAC IV, the original 1960s canonical SIMD system, basically a rectangular array of ALUs, was originally explained to me late in grad school (I think possibly by Bob Metcalfe) it was put this way:
Some guy sits in the middle of the room, shouts ADD!, and everybody adds.
I was a programming language hacker at the time (LISP derivatives), and I was horrified. How could anybody conceivably use such a thing? Well, first, it helps that when you say ADD! you really say something like “ADD Register 3 to Register 4 and put the result in Register 5,” and everybody has their own set of registers. That at least lets everybody have a different answer, which helps. Then you have to bend your head so all the world is linear algebra: Add matrix 1 to matrix 2, with each matrix element in a different parallel unit. Aha. Makes sense. For that. I guess. (Later I wrote about 150 KLOC of APL, which bent my head adequately.)
Unfortunately, the pure version doesn’t make quite enough sense, so Burroughs, Cray, and others developed a close relative called vector processing: You have a couple of lists of values, and say ADD ALL THOSE, producing another list whose elements are the pair wise sums of the originals. The lists can be in memory, but dedicated registers (“vector registers”) are more common. Rather than pure parallel execution, vectors lend themselves to pipelining of the operations done. That doesn’t do it all in the same amount of time – longer vectors take longer – but it’s a lot more parsimonious of hardware. Vectors also provide a lot more programming flexibility, since rows, columns, diagonals, and other structures can all be vectors. However, you still spend a lot of thought lining up all those operations so you can do them in large batches. Notice, however, that it’s a lot harder (but not impossible) for one parallel unit (or pipelined unit) to unintentionally barge into another’s business. SIMD and vector, when you can use them, are a whole lot easier to debug than MIMD because SIMD simply can’t exhibit a whole range of behaviors (bugs) possible with MIMD.
Intel’s SSE and variants, as well as AMD and IBM’s equivalents, are vector operations. But the marketers apparently decided “SIMD” was a cooler name, so this is what is now often called SIMD.
Bah, humbug. This exercises one of my pet peeves: Polluting the language for perceived gain, or just from ignorance, by needlessly redefining words. It damages our ability to communicate, causing people to have arguments about nothing.
Anyway, ILLIAC IV, the CM-1 Connection Machine (which, bizarrely, worked on lists – elements distributed among the units), and a variety of image processing and hard-wired graphics processors have been rather pure SIMD. Clearspeed’s accelerator products for HPC are a current example.
Graphics, by the way, is flat-out crazy mad for linear algebra. Graphics multiplies matrices multiple times for each endpoint of each of thousands or millions of triangles; then, in rasterizing, for each scanline across each triangle it interpolates a texture or color value, with additional illumination calculations involving normals to the approximated surface, doing the same operations for each pixel. There’s an utterly astonishing amount of repetitive arithmetic going on.
Now that we’ve got SIMD and MIMD terms defined, let’s get back to Larrabee and CUDA, or, strictly speaking, the Larrabee architecture and CUDA. (I’m strictly speaking in a state of sin when I say “Larrabee or CUDA,” since one’s an implementation and the other’s an architecture. What the heck, I’ll do penance later.)
Larrabee is a traditional cache-coherent SMP, programmed as a shared-memory MIMD system. Each independent processor does have its own vector unit (SSE stuff), but all 8, 16, 24, 32, or however many cores it has are independent executors of programs. As are each of the threads in those cores. You program it like MIMD, working in each program to batch together operations for each program’s vector (SIMD) unit.
CUDA, on the other hand, is basically SIMD at its top level: You issue an instruction, and many units execute that same instruction. There is an ability to partition those units into separate collections, each of which runs its own instruction stream, but there aren’t a lot of those (4, 8, or so). Nvidia calls that SIMT, where the “T” stands for “thread” and I refuse to look up the rest because this has a perfectly good term already existing: MSIMD, for Multiple SIMD. (See pet peeve above.) The instructions it can do are organized around a graphics pipeline, which adds its own set of issues that I won’t get into here.
Which is better? Here are basic arguments:
For a given technology, SIMD always has the advantage in raw peak operations per second. After all, it mainly consists of as many adders, floating-point units, shaders, or what have you, as you can pack into a given area. There’s little other overhead. All the instruction fetching, decoding, sequencing, etc., are done once, and shouted out, um, I mean broadcast. The silicon is mainly used for function, the business end of what you want to do. If Nvidia doesn’t have gobs of peak performance over Larrabee, they’re doing something really wrong. Engineers who have never programmed don’t understand why SIMD isn’t absolutely the cat’s pajamas.
On the other hand, there’s the problem of batching all those operations. If you really have only one ADD to do, on just two values, and you really have to do it before you do a batch (like, it’s testing for whether you should do the whole batch), then you’re slowed to the speed of one single unit. This is not good. Average speeds get really screwed up when you average with a zero. Also not good is the basic need to batch everything. My own experience in writing a ton of APL, a language where everything is a vector or matrix, is that a whole lot of APL code is written that is basically serial: One thing is done at a time.
So Larrabee should have a big advantage in flexibility, and also familiarity. You can write code for it just like SMP code, in C++ or whatever your favorite language is. You are potentially subject to a pile of nasty bugs that aren’t there, but if you stick to religiously using only parallel primitives pre-programmed by some genius chained in the basement, you’ll be OK.
[Here’s some free advice. Do not ever even program a simple lock for yourself. You’ll regret it. Case in point: A friend of mine is CTO of an
But what about the experience with these architectures in HPC? We should be able to say something useful about that, since MIMD vs. SIMD has been a topic forever in HPC, where forever really means back to ILLIAC days in the late 60s.
It seems to me that the way Intel's headed corresponds to how that topic finally shook out: A MIMD system with, effectively, vectors. This is reminiscent of the original, much beloved, Cray SMPs. (Well, probably except for Cray’s even more beloved memory bandwidth.) So by the lesson of history, Larrabee wins.
However, that history played out over a time when
Yo ho. Excuse me. We’re not in that world any more. Clock rates aren’t improving like that any more; they’re virtually flat. But density improvement is still going strong, so those SIMD guys can keep packing more and more units onto chips.
Ha, right back at ‘cha: MIMD can pack more of their stuff onto chips, too, using the same density. But… It’s not sit on your butt time any more. Making 100s of processors scale up performance is not like making 4 or 8 or even 64 scale up. Providing the same old SMP model can be done, but will be expensive and add ever-increasing overhead, so it won’t be done. Things will trend towards the same kinds of synch done in SIMD.
Furthermore, I've seen game developer interviews where they strongly state that Larrabee is not what they want; they like GPUs. They said the same when IBM had a meeting telling them about Cell, but then they just wanted higher clock rates; presumably everybody's beyond that now.
Pure graphics processing isn’t the end point of all of this, though. For game physics, well, maybe my head just isn't build for SIMD; I don't understand how it can possibly work well. But that may just be me.
If either doesn't win in that game market, the volumes won't exist, and how well it does elsewhere won't matter very much. I'm not at all certain Intel's market position matters; see Itanium. And, of course, execution matters. There Intel at least has a (potential?) process advantage.
I doubt Intel gives two hoots about this issue, since a major part of their motivation is to ensure than the X86 architecture rules the world everywhere.
But, on the gripping hand, does this all really matter in the long run? Can Nvidia survive as an independent graphics and HPC vendor? More density inevitably will lead to really significant graphics hardware integrated onto silicon with the processors, so it will be “free,” in the same sense that Microsoft made Internet Explorer free, which killed Netscape. AMD sucked up ATI for exactly this reason. Intel has decided to build the expertise in house, instead, hoping to rise above their prior less-than-stellar graphics heritage.
My take for now is that CUDA will at least not get wiped out by Larrabee for the foreseeable future, just because Intel no longer has
-----------------------------------------------------------------
Update 8/24/09.
There was some discussion on Reddit of this post; it seems to have aged off now – Reddit search doesn’t find it. I thought I'd comment on it anyway, since this is still one of the most-referenced posts I've made, even after all this time.
Part of what was said there was that I was off-base: Nvidia wasn’t SIMD, it was SPMD (separate instructions for each core). Unfortunately, some of the folks there appear to have been confused by Nvidia-speak. But you don’t have to take my word for that. See this excellent tutorial from SIGGRAPH 09 by Kayvon Fatahalian of Stanford. On pp. 49-53, he explains that in “generic-speak” (as opposed to “Nvidia-speak”) the Nvidia GeForce GTX 285 does have 30 independent MIMD cores, but each of those cores is effectively 1024-way SIMD: It has with groups of 32 “fragments” running as I described above, multiplied by 32 contexts also sharing the same instruction stream for memory stall overlap. So, to get performance you have to think SIMD out to 1024 if you want to get the parallel performance that is theoretically possible. Yes, then you have to use MIMD (SPMD) 30-way on top of that, but if you don’t have a lot of SIMD you just won’t exploit the hardware.
27 comments:
Many thanks for this piece. Having to write a paper on vector processing, I found it a rather enjoyable and informative read.
I think MIMD is superior, because it is relative easy to designing a program, that is thread parallel in efficient way, at least for not too high number of processors.
On the other hand SIMD only fits to very limited kinds of algorithms like rasterization or multiplying big matrixes together. There isn't enough brute force given by SIMD to solve problems, that can be solved efficiently by far more "intelligent" algorithms.
@Joe - You're welcome! Glad you found it useful and enjoyable.
@TuomasT -
I personally agree with you; MIMD is more general and more "natural" in use.
But don't count SIMD out. An awful lot of people spent a lot of time over decades (Cray 1 was a mid-70s system) figuring out how to apply vectors to a wide variety of HPC problems that are not obviously parallel. So it's broader in applicability than initially appears, especially on data-rich problems where the sheer number of, well, numbers gets huge.
SIMD is often easier to program that MIMD in the sense that you don't need to worry about locking.
I think SIMD is going to prevail in situations where people need to minimize power consumption and transistor count: why waste any power on control if you can avoid it?
The set of problems where SIMD works well is larger than most people think: often you'd end up choosing a different approach to solving a problem on a SIMD architecture than you would on MIMD. If I were writing a neutronics code for designing nuclear reactors, for instance, I'd use a monte carlo code on a MIMD machine but I'd really need to use a diffusion code on a SIMD machine.
SIMD systems scale to large N well, so long as you can benefit from a big N. MIMD algorithms that scale to more than 16 or so processors are exotic.
So far as graphics cards go, I think that Nividia's approach would produce a more powerful graphics card for today's applications: however, Larabee could be stronger for different rendering approaches (procedural generation, ray tracing.) Also, the market for powerful graphics cards is tiny compared to the market for weak graphics cards. A coprocessor that competes with entry-level graphics cards and can be repurposed to speed up other tasks could be attractive if the price is right.
It is hard enough to write a program that can get reasonable performance out of a fully capable Core 2 Quad CPU.
The idea that hundreds even thousands of MIMD cores which run a lot slower than a current core is not exactly good news. Many algorithms are not easy to run in parallel and once we get to a point (coming soon) only embarrassing parallel algorithms will get any faster.
We all should be worried not so much of the CUDA programming model (which is horrible) but of the fact that CPUs are ultimately going the same way and too few programs are going to benefit.
Hi, Paul.
I mostly agree with you. In particular, yes: SIMD is applicable in more situations than most people think.
However, I think you weren't totally engaged when you wrote "MIMD algorithms that scale to more than 16 or so processors are exotic." MapReduce? The Monte Carlo codes you mentioned?
Perhaps you are thinking MIMD = shared memory, which in this article it does, mostly (Larrabee shared cache is partitionable), but that isn't an absolute.
Greg
The cell still isn't a very good game processor. A bunch of ALUs with no cache, thanks IBM! There's an example of a theoretical throughput which will never be seen in the real world...
I don't see on dye graphics processing coming near the off dye solutions, you give up too much area (area that "real" processors need for cache).
The fly in the ointment for MIMD is operations per bandwidth. Using MIMD in a fully general way costs in terms of redundant instruction fetch and partially utilized cache lines which need to be flushed back to main memory. You can solve these problems by effectively programming the sea of MIMD processors as if they were a SIMD machine, but why not just design the hardware that way in the first place?
The problem of making good use of MIMD architectures is imagined, not real. If you program in a conventional language (C, C++, C#, Java...) you have trouble imagining how you would make use of a thousand core machine. If you program in Erlang (or Haskell, or Mozart) you don't. In fact you long for this sort of platform
Did you know that nVidia is going to put MIMD cores in their GT300 GPU:
http://www.hardware-infos.com/news.php?news=2663&sprache=1
@Dan O - Each SPE in Cell does have cache, and it's coherent with others and with PPE memory. It is loaded / written back by program control, though, using ops that can overlap other processing. Some fairly non-obvious codes have been ported to it using that, with good speedup; a tour de force of overlap in some cases. I'm not saying it's good to have yet another programming model, but it's what the customer (Sony) wanted.
@Anonymous - Yes, instruction bandwidth was one of the arguments for vectors back in Cray 1 days. I'd be hard pressed to say that's an issue now. Data bandwidth from cache line breakage is likely more of an issue.
@Jesper - No, I didn't know that. Thanks for the link. Kind of reminds me of the switch to MIMD by the Connection Machine people, way back when. They were religiously SIMD, too, and wrote a paper explaining why they changed. Maybe Nvidia will eventually tell us their reasons.
-- Greg
I can't say I know too much about either architecture, but perhaps Intel is investing in the idea that ray-tracing is the future of 3d gaming and applications. It's my understanding (which is limited) that current video cards specialize in rasterization and related operations, but would not be suited for ray-tracing.
Hi, B.
Ah, the ray-tracing issue.
Intel did make large noises about ray-tracing, with demos of ray-traced Doom and the like, saying that it was the future of gaming and rasterizers (like Nvidia & AMD/ATI) were doomed.
This was followed by frack form Nvidia (we can ray-trace too! "we will open an enormous can of whoop ass" (actual quote)).
Then there were published interviews with game developers who basically said WDDNS ray tracing.
Then came a long blog entry from an Intel Larrabee developer saying that it would first and foremost be an awesome rasterizer.
So, whatever, I'm sure it will be an awesome rasterizer, since not doing DirectX/OpenGL would be deadly; but it probably won't be quite as awesome as its Nvidia contemporary, which will probably be more application-area specific. But I think Larrabee will also more easily do game physics well, and, yes, do ray tracing without standing on its head, if the game guys decide that's what they want.
(Maybe I should flesh this story out as a full blog post, including references & explanations of ray-tracing and the like.)
-- Greg
Thank you for an interesting article literally oozing with wisdom and experience.
I think however you have missed a real issue paired with the paradigm shift in graphics processing: Graphics Programming the only real improvements in graphics programming has originated essentially from hacks and workarounds using primarily linear alebra. So instead of actually calulating reflection and refraction, cubemaps and bumpmaps are first generated and applied as faux graphics but, as a good enough to look at, alternative.
The real paradigm shift now needed in the graphics industry is not how to force faux graphics principles onto a shifting processing environment, but to accept that a whole new set of graphics processing tools and techiques need to be created from the ground up (Well almost). Tools focused on the benefits of SIMD on raytracing, so as the number of cores increases (toward infinity, think gigacore) the benefits to realtime graphics processing can be realised by graphics programming techniques.
This is the reason there are vertually no games or API's able to take advantage of the the 8 cores in Cell and why the PS3 is such a slow starter compared to PS1&2.
Game programmers, and game studios who have developed an extensive arsenal of faux graphics techniques, now are faced with the stark realisation that it is all for moot in the new paradigm of graphics processing. And those who are first to realise and accept this fate will win and start to gain competitive advantage.
And so as "B" pointed out above, taytracing is the natual direction of graphics processing, and the "game guys" will have to (Kicking and screaming) follow suit.
Hi, Stephen. Also thanks for the kind words.
And boy, do I ever agree with you about rasterization being a tall pile of kludges. All the dominant techniques have no justification beyond "it pretty much looks OK, doesn't it?"
I've seen some heavy-duty data visualization software do its number on physics simulation output. It used the actual laws of physics and optics. Side-by-side with the kludges, and in this case even ray-tracing is just a higher-order kludge (ain't no rays like that in physics), you can really see the difference. Soaked up a huge system to do it, too, and was nowhere near a real-time display -- after all, it's doing a physics simulation.
Unfortunately, my take is that the comparison has to be side-by-side, rather like doing an A/B compare when buying high-end audio gear. Otherwise, it's hard to see it. When I (used to) do that, I always had the niggling fear that I was getting taken by the salesguy, since I would *never* be able to tell the difference at home, playing only B.
I think that will apply here. Eventually, yes, general better results will become the basic ante for many games. It'll take a while.
-- Greg
Are you aware that some projects related to medical research have started using Nvidia GPUs, such as GPUGRID?
http://www.gpugrid.net/
It allows volunteers to contribute any spare time on their Nvidia GPUs to do calculations related to medical research.
Hello,
"each of those cores is effectively 1024-way SIMD: It has with groups of 32 “fragments” running as I described above, multiplied by 32 contexts also sharing the same instruction stream for memory stall overlap."
I think your off-base with this affirmation, the SIMD is 32 ways and not 1024. The 32 contexts runs SPMD and not SIMD. I mean that each group of 32 "fragments" wich Nvidia called "warp" execute one instruction, but not necessarily the same one at the same time that the others warps.
Architecuraly speaking, you have 8 Alus for 1 fetch/decode unit. The Alus run 2 times faster that that F/D unit, and this latter produce an instruction every two cycles.
Thanks a million. Currently doing a research paper on CUDA for medical applications, very useful article!
Ran across this. Can't resist replying, especially since Greg Pfister has replied.
The GPUs' SIMD is more powerful than Illiac IV's SIMD. And certainly more powerful than Cray or Intel MMX/SSE vector-style SIMD.
In particular, the GPUs all have a program counter per vector lane or slice or ... whatever you call the individual "SIMD" elements. A program counter per vector lane or element, rather than a program counter for the entire vector instruction.
This seems like a small thing, but it is highly important to me as a computer architect. It greatly reduces the need for prediction or vector masking in the instruction set. It basically means that each vector lane could execute as a separate thread. etc.
I have posted and blogged about this extensively. Google me.
In particular, look at the slides for the talk I was invited to give at UC Berkeley Parlab, back when I was still an employee of Intel (on Larrabee).
* Berkeley ParLab talk on Coherent Threading: (2009)
o Coherent Threading
Coherent Vector Lane Threading (SIMT, DIMT, NIMT)
Microarchitectures Intermediate Between SISD and MIMD, Scalar and SIMD parallel vector
o http://docs.google.com/fileview?id=0B5qTWL2s3LcQNGE3NWI4NzQtNTBhNS00YjgyLTljZGMtNTA0YjJmMGIzNDEw&hl=en
I really do think that MIMD > SIMT > SIMD in flexibility, but also in cost.
I really do think that somebody in GPU land invented something new.
Nvidia's term SIMT opened my eyes; but I think my term "coherent threading" describes it more accurately.
Hi, Andy, and many thanks for that comment.
I've read through the presentation you linked, and have to say you had me already on page 11 -- predicates per lane squeezing out wasted cycles. I agree, that kind of consideration shows that SIMT is an important new variation, different from classic SIMD. I certainly don't understand all the implications here, but at least now I know there is something I need to understand.
(I will note that your presentation was in 2009, and this post was originally published in 2008, so I'm officially immune from a charge of not having done my homework. :-) )
-- Greg
For the record I feel I must respond to the comment from "greg" (the one without a last name) that "[i]f you program in Erlang (or Haskell, or Mozart) you don't...have trouble imagining how you would make use of a thousand core machine....In fact you long for this sort of platform":
I have a couple of pieces of pretty strong evidence that this is not the case.
First, I developed a high-frequency trading application in Haskell that required parallelism on current hardware. It simply can't keep up with the market data stream on one or even two cores. (Our development and simulation systems were four-core/eight-thread Intel i7 processors.) While the development was significantly easier (outside of the learning curve of Haskell itself) than using more traditional languages, this applied pretty much equally to all aspects of the program, parallelism exploitation didn't get any extra help beyond the help Haskell gives for everything. And, as it turned out, there were no unusual parallel programming techniques that provided any advantage either; it was all pretty much the same threading and synchronization stuff one does in any other language. (When it came down to it, not even software transactional memory was useful, though we tried it.)
(Don't take this to say that I don't believe Haskell offers some features that provide advantages when it comes to exploiting parallelism. Having data be non- mutable by default is one of these. But that's a small help: it doesn't magically solve the problem of taking advantage of available parallelism that Haskell still shares with all other general-purpose programming languages.)
Second, have a look at a paper presented at ICFP in 2008: Experience report: erlang in acoustic ray tracing. (E-mail me if you need a copy and can't find one on the web.) These folks were doing acoustic ray tracing in parallel using C++, and decided to try rewriting their program in Erlang, with the expected (though apparently not to them) result: "Our C++ implementation outperformed the Erlang program by at least 12x." Yes, this was a somewhat laughable move; clearly someone confused concurrency with parallelism. But it's a good result to have in the literature for times like these.
To elaborate on AF Glew's point on March 15, 2010 8:09 PM, NVIDIA's SIMT is more powerful than old fashioned SIMD.
To the programmer, it presents the illusion that all threads are independent, but to maximize efficiency (reduce instruction processing), there needs to be many threads executing the same code.
Currently, CUDA only allows every adjacent 32 threads to benefit from SIMT, which can cause low throughput if there's a lot control flow divergence. But they can always relax that restriction and allow more threads at the same program location to benefit from SIMT. In my opinion, in most performance critical code, there are not so many divergent execution paths, so SIMT should be completely adequate and full MIMD is not needed.
Plus, data parallelism is much more scalable than task parallelism, so SIMD is more important.
So, I think NVIDIA is very wise to create SIMT, which retains the efficiency of SIMD, but with much more flexibility (none of that SSE grief) by presenting the illusion that all threads are independent.
I think I finally have my head around SIMT, which I didn't when originally writing this post (quite a while ago).
I believe it to be a very useful and new variation on SIMD, and deserves its time in the sun separate from original SIMD notions. But it's still SIMD.
I've a new long post about MIMD/SIMD/SIMT stuff that will appear in a bit. I'll put a pointer here when it is posted.
Programming on the Cell isn't any hard. Their physics and GL libraries already use high-level SPU libraries to speed up processing... It's one of these esoteric architectures which relies on the ability of streaming more than stored data.
Thanks for this informative post.
It was great reading your article and subsequent comments and counter-comments. I am a new entrant to CUDA programming looking to apply it on fractal image compression that involves extensive pattern matching of ranges with domains. I never thought that MIMD could be another option for image compression as it involves lots of data-independent and function-independent processing.
Many thanks for a robust comparison analysis of SIMD vs MIMD.
Post a Comment
Thanks for commenting!
Note: Only a member of this blog may post a comment.