The Perils of Parallel: September 2008

Thursday, September 25, 2008

Jealousy (Off Topic)

Pretty far off-topic, but I feel like pointing out that at the moment I'm jealous of my son. He lives in Shanghai, China, and is about to go on a "golden week" (national holiday week) trip to Wulingyuan. That's the paradigm of those impossible-seeming mist-shrouded, tiny, steep mountains. He's also visiting an ancient city there, Fenghuang, which looks really interesting..

I'm positively green.

I really hope to get there some day myself.

On the other hand, I'm not particularly jealous of his recent work schedule. He's a legal assistant at a Chinese law firm, so he gets traditional Chinese two-hour lunches. On the other hand, he's also expected to exhibit the standard Chinese virtue of hard work by regularly working his brains out until 2 AM. Retirement's a lot less stressful.

Vive la (Killer App) Révolution!

I've pointed to finding applications that are embarrassingly parallel (EP), meaning it is essentially trivial for them to use parallel hardware, as a key way to avoid problems. Why should be the case?

Actually, what I’m trying to find are not just EP applications, but rather killer apps for mass-market parallel systems: Mass-market applications which force many people to buy a “manycore” system, because it’s the only thing that runs them well. In fact, I’m thinking now that EP is probably a misstatement, because that’s really not the point. If a killer app’s parallelism is so convoluted that only three people on the planet understand it, who cares? The main requirement is that the other 6.6 Billion minus 3 of us can’t live without it (or at least a sufficiently large fraction of that 6.6B-3). Those three might get obscenely rich. That’s OK, just so long as the rest of us don’t go down the tubes.

I’m thinking here by analogy to two prior computer industry revolutions: The introduction of the personal computer, and the cluster takeover. The latter is slightly less well-known, but it was there.

The introduction of the PC was clearly a revolution. It created the first mass market for microprocessors and related gear, as well as the first mass market for software. Prior to that, both were boutique products – a substantial boutique, but still a boutique. Certainly the adoption of the PC by IBM spurred this; it meant software developers could rely on there being a product that would last long enough, and be widespread enough, for them to afford to invest in product development. But none of that would have mattered if at the time this revolution began, microprocessors hadn’t gotten just (barely) fast enough to run a killer application – the spreadsheet. Later, with a bit more horsepower, desktop processing became a second level (and boosted Macs substantially) but the first big bump was the spreadsheet. It was the reason for the purchase of PCs in large numbers by businesses, which kicked everything into gear: The fortuitous combination of a particular form and level of computing power and an application using that form and power.

Note that nobody really lost in the PC revolution. A few people, Intel and Microsoft in particular, really won. But the revenue for other microprocessors didn’t decline, they just didn’t have the enormous increases of those two.

The takeover of servers by clusters was another revolution, ignited by another conjunction of fast-enough microprocessors and a killer app. This time, the killer app was the Internet. Microprocessors got just fast enough that they could provide adequate turnaround time for a straightforward “read this page” web request, and once that happened you just didn’t need single big computers any more to serve up your website, no matter how many requests came in: Just spray the request across a pile of very cheap servers. This revolution definitely had losers. It was a big part of what in IBM was referred to as “our near-death experience”; it’s hard to fight a difference in cost of about 300X (which is what I heard it estimated within IBM development for mainframe vs. commodity clusters, for cases where the clusters were adequate – big sore point there). Later Sun succumbed to this, too, even though they started out being one of the spears in IBM’s flanks. Once again, capability plus application produced revolution.

We have another revolution going on now in the turn to explicit parallelism. But so far, it’s not the same kind of revolution because while the sea-change in computer power is there, the killer app, so far, is not – or at least it’s not yet been widely recognized.

We need that killer app. Not a simpler means of parallel programming. What are you going to do with that programming? Parallel Powerpoint? The mind reels.

(By the way, this is perfect – it’s exactly what I hoped to have happen when I started this blog. Trying to respond to a question, I rethought a position and ended with a stronger story as a result. Thanks to David Chess for his comment!)

Wednesday, September 24, 2008

IO, IO, We Really need IO

Now for something completely different:

I had always wondered what the heck was going on with recent large disk systems. They're servers, for heaven's sake, with directly attached disks and, often, some kind of nonvolatile cache. IBM happily makes money selling dual Power systems with theirs, while almost everybody else uses Intel. If your IO system were decent, shouldn't you just attach the disks directly? Is the whole huge, expensive infrastructure of adapters, fibre-channel (soon Data Center Ethernet (probably copyright Cisco)), switches, etc., reall the result of the mechanical millisecond delays in disks?

Apparently I'm not the only person thinking this. According to Is flash a cache or pretend disk drive? By Chris Mellor, an interview with Rick White, one of the three founders of Fusion-io, you just plop your NAND flash directly to your PCIe bus. Then use it as a big disk cache. You have to use some algorithms to spread out the writes so you don't wear out the NAND too fast, but lots of overhead just goes away.

Of course, if you want to share that cache among multiple systems, there are a whole lot of other issues; it's a big block of shared memory. But start with "how the heck to you connect to it?" LAN/SAN time again, with processors on both ends to do the communication. Dang.

Relevance to the topic of this blog: As more compute power is concentrated in systems, you need more and faster IO, or there's a whole new reason why this show may have problems.

Clarifying the Black Hole

The announcement of this blog in LinkedIn’s High Performance and Supercomputing group produced some responses indicating that I need to clarify what this blog is about. Thanks for that discussion, folks. One of my (semi-?) writers' block problems with this book is clearly explaining exactly what the issue is.

Here’s a whack at that, an overall outline of the situation as I see it. Everything in this outline requires greater depth of explanation, with solid numbers and references. This is just the top layer, and I’m really unsatisfied with it. Hang in there with me, prod me with questions if there are parts you are really itchy about, and I’ll get there.

First, I don’t mean to imply or say that there is some dastardly conspiracy to force the industry to use explicit parallelism. The basic problems are (1) physics; (2) the exhaustion of other possibilities. There simply isn't any choice -- but, no choice to do what? To increase performance.

This industry has lived and breathed dramatically increasing year-to-year single-thread performance since the 1960s. The promise has always been: Do nothing, and your programs run 45% faster this time next year, on cheaper hardware. That’s an incredible situation, the only time it’s ever happened in history, and it’s a situation that’s the very bedrock of any number of industry business and technical assumptions. Now we are moving to a new paradigm: To actually achieve the additional performance available in new systems, programmers must do something. What they must do is a major effort, it’s not clear how broadly applicable it is, and it’s so difficult that few can master it. That, to put it mildly, is a major paradigm shift. (And it’s not being caused by some nefarious cabal; as I said, it’s physics.)

This is not a problem for server systems. Those have been broadly parallel – clusters / farms / warehouses -- since the mid-90s, and server processing has arguably been parallelized since the mid-70s on mainframe SMPs. Transaction monitors, web server software like Apache, and other infrastructure all have enabled this parallelism (but the applications had to be there).

Unfortunately, servers alone don't produce the semiconductor volumes that keep an important fraction of this industry moving forward – a fraction, not all of the industry, but certainly the part that's arguably key, to say nothing of the loudest, and will make the biggest ruckus as it increasingly runs into trouble. The industry is going to be a very different place in the foreseeable future.

At this point, it’s reasonable to ask that if this is actually true, why aren’t the combined voices of the blogosphere, industry rags, and industry stock analysts all shouting it from the housetops? Why are these statements believable? Where’s the consensus? Are we in conspiracy theory territory?

No, we’re not. I think we’re in the union of two territories. First and foremost, we’re in the “left hand doesn’t know what the right hand is doing” territory, greatly enhanced by the narrow range of people with the cross-domain expertise needed to fully understand the problem. There was a flare-up of concern, flashing through the blogosphere and the technical news outlets back in 2004, but it was focused (almost correctly) on crying “Moore’s Law Is Ending!” so it was squashed when the technorati high priests responded “No, Moore’s Law isn’t ending, you dummies” which it isn’t, in the original literal sense. But saying that is a fine case of not seeing the forest for the veins on the leaves of the trees, or, in this case, not seeing a chasm because you’re a botanist, not a geologist.

That is the main territory, or at least I hope so. But having lived in the industry for quite a while, I understand that there’s also a form of denial – or hope – going on: There have definitely been occasions, well-remembered by management and business leaders, where those excitable, barely comprehensible techno-nerds have cried wolf, said the sky was falling, and it didn’t. This produces a reaction like: Nah, this can’t really be that big a deal; it’s the wrong answer; obviously we should not panic, since things have always happened to make such issues just go away. Also, aren’t they really posturing for more funding?

Unfortunately, they’re not. And the truth of the situation is starting to come across, with some industry funding for parallel programming, and pleas for a Manhattan project on parallel programming showing up. For reasons detailed in my 101 Parallel Languages series of posts, I think those are looking under the wrong lamppost.

My choice of the place to look is for applications that are “embarrassingly parallel,” where there’s no search for the parallelism – it’s obvious – and little need for explicit parallel programming. There are a few possibilities there (I’m partial to virtual worlds), but I’m far from certain that they’ll be widespread enough to pick up the slack in the client space. So I fear we may be in for a significant, long-term, downturn for companies whose business relies on the replacement cycle for client systems. This is not a pleasant prospect, since those companies are presently at the heart of the industry.

101 Parallel Languages (Part 3, the last)

So why, after all the effort spent on parallel programming languages, have none seen broad use? I posed this question on the mailing list of the IEEE Technical Committee on Scalable Computing, and got a tremendously, gratifyingly high-quality response from people including major researchers in the field, a funding director of NSF, and programming language luminaries.

Nobody disagreed with the premise: MPI (message-passing interface) has a lock on parallel programming, with OpenMP (straightforward shared memory) a distant second. Saying the rest are in the noise level is being overly generous. The discussion was completely centered around reasons why that is the case.

After things settled down, I was requested to collect it for the on-line newsletter of the IEEE TCSC. You can read the whole thing here.

I also boiled down the results as much as I could; many reasons were provided, all of them good. Here’s that final summary, which appears at the end of the newsletter article, turned into prose instead of powerpoint bullets (which I originally used).

A key reason, and probably the key reason, is application longevity and portability. Applications outlive any given generation of hardware, so developers must create them in languages they are sure will be available over time – which is impossible except for broadly used, standard languages. You can’t assume you can just create a new compiler for wonderful language X on the next hardware generation, since good compilers are very expensive to develop. (I first proposed this as a reason; there was repeated agreement.) That longevity issue particularly scares independent software vendors (ISVs). (Pointed out by John D. McCalpin, from his experience at SGI.)

The investment issue isn’t just for developing a compiler; languages are just one link in an ecosystem. All links needed for success. A quote: “the perfect parallel programming language will not succeed [without, at introduction] effective compilers on a wide range of platforms,… massive support for re-writing applications,” analysis, debugging, databases, etc.” (Horst Simon, Lawrence Berkeley Laboratory Associate Lab Director CS; from a 2004 NRC report.)

Users are highly averse to using new languages. They have invested a lot of time and effort in the skills they use writing programs in existing languages, and as a result are reluctant to change. “I don't know what the dominant language for scientific programming will be in 10 years, but I do know it will be called Fortran.” (quoted by Allan Gottlieb, NYU). And there’s a Catch-22 in effect: A language will not be used unless it is popular, but it’s not popular unless it’s used by many people. (Edelman, MIT)

Since the only point of parallelism is performance, high quality object code required at the introduction of the new language; this is seldom the case, and of course adds to the cost issue. (Several)

I can completely understand the motivation for research in new languages; my Ph.D. thesis was a new language. There is always frustration with the limitations of existing languages, and also hubris: Much more than most infrastructure development, when you do language development you are directly influencing how people think, how they structure their approach to problems. This is heady stuff. (I am now of the opinion that nobody should presume to work on a new language until they have written at least 200,000 lines of code themselves – a requirement that would have ditched my thesis – but that’s another issue.)

I also would certainly not presume to stop work on language development. Many useful ideas come out of that, whose developers could not initially conceive of outside of a new language, but often enough then become embedded in the few standard languages and systems. Key example: SmallTalk, the language in which object-oriented programming was conceived, later to be embedded in several other languages. (I suspect that language-based Aspect Oriented Programming is headed that way, too.)

But I’d be extremely surprised to find the adoption of some new parallel language – E.g., Sun’s Fortress, Intel’s Ct – as a significant element in taming parallelism. That’s searching under the wrong lamppost.

I think the way forward is to discover killer applications that “embarrassingly parallel” – so clear and obvious in their parallelism that running them in parallel is not a Computer Science problem, and certainly not a programming language problem, because it’s so straightforward.

Tuesday, September 23, 2008

101 Parallel Languages (Part 2)

So, is there some history and practice that can shed some like on the issue of helping people use parallel computers? There certainly is, in the High-Performance Computing (HPC) community.

The HPC community has been trying to flog the parallel programming language horse info life for almost forty years. These are the people who do scientific and technical computing (mostly), including things like fluid dynamics of combustion in jet engines, pricing of the now-infamous tranches of mortgage-backed securities, and so on. They’re motivated. Faster means more money to the financial guys (I’ve heard estimates of millions of dollars a minute if you’re a millisecond faster than the competition), fame and tenure to scientists, product deadlines for crash simulators, and so on. They include the guys who use (and get the funding for) the government-funded record-smashing installations at national labs like Los Alamos, Livermore, Sandia, etc., the places always reported on the front pages of newspapers as “the world’s fastest computer.” (They’re not, because they’re not one computer, but let that pass for now.)

For HPC, no computer has ever been fast enough and likely no computer ever will be fast enough. (That’s my personal definition of HPC, by the way.) As a result, the HPC community has always wanted to get multiple computers to gang up on problems, meaning operate in parallel, so they have always wanted to make parallel computers easier. Also, it’s not like those people are exactly stupid. We’re talking about a group that probably has the highest density of Ph.D.’s outside a University faculty meeting. Also, they’ve got a good supply of very highly motivated, highly intelligent, mostly highly skilled workers in the persons of graduate students and post-docs.

They have certainly succeeded in using parallelism, in massive quantities. Some of those installations have 1000s of computers, all whacking on the same problem in parallel, and are the very definition of the term “Massively Parallel Processing.”

So they’ve got the motivation, they’ve got the skills, they’ve got the tools, they’ve had the time, and they’ve made it work. What programming methods do they use, after banging on the problem since the late 1960s?

No parallel languages, to a very good approximation. Of course, there are some places that use some of them (Ken, don’t rag on me about HPF in Japan or somewhere), but it’s in the noise. They use:

1. Message-Passing Interface (MPI), a subroutine package callable from Fortran, C, C++, and so on. It does basic “send this to that program on that parallel node” message passing, plus setup, plus collective operations [“Is everybody done with Phase 3?” (also known as a “barrier”) “Who’s got the largest error value – Is it small enough that we can stop?” (also known as a “reduction”)]. There are implementations of MPI exploiting special high performance inter-communication network hardware like InfiniBand, Myrinet, and others, as well as plain old Ethernet LAN, but at its core you get “send X to Y,” and manually write plain old serial code to call procedures to do that.

2. As a distant second, OpenMP. It provides similarly basic use of multiple processors sharing memory, that uses and uses more support at the language level. The big difference is the ability to do a FOR loop in parallel (FOR N=1 to some_big_number, do THIS (parameterized by N)).

That’s it. MPI is king, with some OpenMP. This is basic stuff: pass a message, do a loop. This seems a puny result after forty years of work. In fact, it wouldn’t be far from the truth to say that the major research result of this area is that parallel programming languages don’t work.

Why is this?

I effectively convened my own task force on this, in the open, asking the question on a mailing list. A susprising (to me) number of significant lights in the HPC parallelism community chimed in; this is clearly a significant and/or sore point for everybody.

I’ll talk about that in my next post here.

Monday, September 22, 2008

101 Parallel Languages (Part 1)

A while ago, at a “technical” meeting of the Grand Dukes and Peers of the IBM realm, the Corporate Senior VP of Software was heard to comment (approximately) “If all these new processors are going to be parallel, don’t our clients need a new programming language to deal with this?”

This, of course, produced vigorous agreement and the instantaneous creation of a task force to produce a recommendation. Since even the whim of a Senior VP is a source of funding, the recommendation was, of course, running on rails to a preordained conclusion: Fund research and/or productization of research in some parallel language.

I happened to end up participating in that task force (this was far from automatic). In the airport on the way to its opening meeting, I happened to run into Professor Jim Brown of UT, who’s been doing parallel at least as long as I have, with more emphasis on software than I. Since I knew what his reaction was going to be, I couldn’t resist; I told him why I was going. I wasn’t disappointed. He began muttering, stamped around in a circle in frustration, then whipped around and pointed his finger at me with beetled brows and a scowl, saying “You don’t agree with that, do you?”

I surely didn’t. I told him I was arriving carrying the result of about one evening of web searching: A list of 101 parallel programming languages, all in active development, all with zero significant use. (Spreadsheet here.)

He smiled, I smiled back, and we went on our respective ways.

Why that mutual reaction? The issue is certainly not the underlying sentiment expressed by that Senior VP. That’s right on target, a completely appropriate concern: Don’t we (not just IBM) really need to provide as much help as possible to customers, to deal with these changes? It’s also clearly a correct insight that if your company helps more than others, you will win.

The issue is “programming language.” And history. And current practice.

-- To be continued, in my next posting. --

(I’m trying the experiment of breaking up long streams of thought into multiple posts. The huge ones up to now seem ungainly. Comments on that?)

Friday, September 19, 2008

History Is Written by the Victors

There's a new book to be out in October 2008, published by Intel Press, The Business Value of Service Oriented Grids, by Enrique Castro-Leon, Jackson He, Mark Chang, and Parviz Peiravi. A few chapters are available online, or just the preface in a slightly different print layout.

First, my congratulations to Enrique and his co-authors. Getting any book out the door is more work than anybody can appreciate who hasn't done it. Kudos!

I won't comment on the main content, since I'm not really qualified, "business value" books just aren't my cup of tea, and N years (for large N) in IBM left me profoundly allergic to SOA. SOA roolz. OK? It's wonderful. But I've only read one discussion of it, about 5 pages total, that didn't make my skin crawl.

However, part 3 of the download (Chapter 1) had this, which is relevant to the issues of this blog:

...in 2004 Intel found that a successor to the Pentium 4 processor, codenamed Prescott would have hit a "thermal wall."... the successor chip would have run too hot and consumed too much power at a time when power consumption was becoming an important factor for customer acceptance.

Big positive here: This event has been published, by Intel.

However, heat and power consumption "becoming an important factor for customer acceptance" is an excessively delicate way of putting it.

As I recall, that proposed chip family was so hot that all major system vendors actually refused it. They stood up to Intel, a rather major thing to do, and rejected the proposed design as impossible to package in practical, shippable products. This was a huge deal inside Intel, causing the cancellation of all main-line processor projects and the elevation to stardom of a previously denigrated low-power design created by an outlier lab in Israel, far from the center of mass of development.

So history gets exposed in a way that placates the shareholders.

The preface to this book also had some comments about the ancient history of virtualization that reminded me that I was there. Fodder for a future post.

(My thanks to a good friend and ex-co-worker who pointed me to this book and provided the title phrase.)

Larrabee vs. Nvidia, MIMD vs. SIMD

I'd guess everyone has heard something of the large, public, flame war that erupted between Intel and Nvidia about whose product is or will be superior: Intel Larrabee, or Nvidia's CUDA platforms. There have been many detailed analyses posted about details of these, such as who has (or will have) how many FLOPS, how much bandwidth per cycle, and how many nanoseconds latency when everything lines up right. Of course, all this is “peak values,” which still means “the values the vendor guarantees you cannot exceed” (Jack Dongarra’s definition), and one can argue forever about how much programming genius or which miraculous compiler is needed to get what fraction of those values.

Such discussion, it seems to me, ignores the elephant in the room. I think a key point, if not the key point, is that this is an issue of MIMD (Intel Larrabee) vs. SIMD (Nvidia CUDA).

If you question this, please see the update at the end of this post. Yes, Nvidia is SIMD, not SPMD.

I’d like to point to a Wikipedia article on those terms, from Flynn’s taxonomy, but their article on SIMD has been corrupted by Intel and others’ redefinition of SIMD to “vector.” I mean the original. So this post becomes much longer.

MIMD (Multiple Instruction, Multiple Data) refers to a parallel computer that runs an independent separate program – that’s the “multiple instruction” part – on each of its simultaneously-executing parallel units. SMPs and clusters are MIMD systems. You have multiple, independent programs barging along, doing things that may have nothing to do with each other, or may not. When they are related, they barge into each other at least occasionally, hopefully as intended by the programmer, to exchange data or to synchronize their otherwise totally separate operation. Quite regularly the barging is unintended, leading to a wide variety of insidious data- and time-dependent bugs.

SIMD (Single Instruction, Multiple Data) refers to a parallel computer that runs the EXACT SAME program – that’s the “single instruction” part – on each of its simultaneously-executing parallel units. When ILIAC IV, the original 1960s canonical SIMD system, basically a rectangular array of ALUs, was originally explained to me late in grad school (I think possibly by Bob Metcalfe) it was put this way:

Some guy sits in the middle of the room, shouts ADD!, and everybody adds.

I was a programming language hacker at the time (LISP derivatives), and I was horrified. How could anybody conceivably use such a thing? Well, first, it helps that when you say ADD! you really say something like “ADD Register 3 to Register 4 and put the result in Register 5,” and everybody has their own set of registers. That at least lets everybody have a different answer, which helps. Then you have to bend your head so all the world is linear algebra: Add matrix 1 to matrix 2, with each matrix element in a different parallel unit. Aha. Makes sense. For that. I guess. (Later I wrote about 150 KLOC of APL, which bent my head adequately.)

Unfortunately, the pure version doesn’t make quite enough sense, so Burroughs, Cray, and others developed a close relative called vector processing: You have a couple of lists of values, and say ADD ALL THOSE, producing another list whose elements are the pair wise sums of the originals. The lists can be in memory, but dedicated registers (“vector registers”) are more common. Rather than pure parallel execution, vectors lend themselves to pipelining of the operations done. That doesn’t do it all in the same amount of time – longer vectors take longer – but it’s a lot more parsimonious of hardware. Vectors also provide a lot more programming flexibility, since rows, columns, diagonals, and other structures can all be vectors. However, you still spend a lot of thought lining up all those operations so you can do them in large batches. Notice, however, that it’s a lot harder (but not impossible) for one parallel unit (or pipelined unit) to unintentionally barge into another’s business. SIMD and vector, when you can use them, are a whole lot easier to debug than MIMD because SIMD simply can’t exhibit a whole range of behaviors (bugs) possible with MIMD.

Intel’s SSE and variants, as well as AMD and IBM’s equivalents, are vector operations. But the marketers apparently decided “SIMD” was a cooler name, so this is what is now often called SIMD.

Bah, humbug. This exercises one of my pet peeves: Polluting the language for perceived gain, or just from ignorance, by needlessly redefining words. It damages our ability to communicate, causing people to have arguments about nothing.

Anyway, ILLIAC IV, the CM-1 Connection Machine (which, bizarrely, worked on lists – elements distributed among the units), and a variety of image processing and hard-wired graphics processors have been rather pure SIMD. Clearspeed’s accelerator products for HPC are a current example.

Graphics, by the way, is flat-out crazy mad for linear algebra. Graphics multiplies matrices multiple times for each endpoint of each of thousands or millions of triangles; then, in rasterizing, for each scanline across each triangle it interpolates a texture or color value, with additional illumination calculations involving normals to the approximated surface, doing the same operations for each pixel. There’s an utterly astonishing amount of repetitive arithmetic going on.

Now that we’ve got SIMD and MIMD terms defined, let’s get back to Larrabee and CUDA, or, strictly speaking, the Larrabee architecture and CUDA. (I’m strictly speaking in a state of sin when I say “Larrabee or CUDA,” since one’s an implementation and the other’s an architecture. What the heck, I’ll do penance later.)

Larrabee is a traditional cache-coherent SMP, programmed as a shared-memory MIMD system. Each independent processor does have its own vector unit (SSE stuff), but all 8, 16, 24, 32, or however many cores it has are independent executors of programs. As are each of the threads in those cores. You program it like MIMD, working in each program to batch together operations for each program’s vector (SIMD) unit.

CUDA, on the other hand, is basically SIMD at its top level: You issue an instruction, and many units execute that same instruction. There is an ability to partition those units into separate collections, each of which runs its own instruction stream, but there aren’t a lot of those (4, 8, or so). Nvidia calls that SIMT, where the “T” stands for “thread” and I refuse to look up the rest because this has a perfectly good term already existing: MSIMD, for Multiple SIMD. (See pet peeve above.) The instructions it can do are organized around a graphics pipeline, which adds its own set of issues that I won’t get into here.

Which is better? Here are basic arguments:

For a given technology, SIMD always has the advantage in raw peak operations per second. After all, it mainly consists of as many adders, floating-point units, shaders, or what have you, as you can pack into a given area. There’s little other overhead. All the instruction fetching, decoding, sequencing, etc., are done once, and shouted out, um, I mean broadcast. The silicon is mainly used for function, the business end of what you want to do. If Nvidia doesn’t have gobs of peak performance over Larrabee, they’re doing something really wrong. Engineers who have never programmed don’t understand why SIMD isn’t absolutely the cat’s pajamas.

On the other hand, there’s the problem of batching all those operations. If you really have only one ADD to do, on just two values, and you really have to do it before you do a batch (like, it’s testing for whether you should do the whole batch), then you’re slowed to the speed of one single unit. This is not good. Average speeds get really screwed up when you average with a zero. Also not good is the basic need to batch everything. My own experience in writing a ton of APL, a language where everything is a vector or matrix, is that a whole lot of APL code is written that is basically serial: One thing is done at a time.

So Larrabee should have a big advantage in flexibility, and also familiarity. You can write code for it just like SMP code, in C++ or whatever your favorite language is. You are potentially subject to a pile of nasty bugs that aren’t there, but if you stick to religiously using only parallel primitives pre-programmed by some genius chained in the basement, you’ll be OK.

[Here’s some free advice. Do not ever even program a simple lock for yourself. You’ll regret it. Case in point: A friend of mine is CTO of an Austin company that writes multithreaded parallel device drivers. He’s told me that they regularly hire people who are really good, highly experienced programmers, only to let them go because they can’t handle that kind of work. Granted, device drivers are probably a worst-case scenario among worst cases, but nevertheless this shows that doing it right takes a very special skill set. That’s why they can bill about $190 per hour.]

But what about the experience with these architectures in HPC? We should be able to say something useful about that, since MIMD vs. SIMD has been a topic forever in HPC, where forever really means back to ILLIAC days in the late 60s.

It seems to me that the way Intel's headed corresponds to how that topic finally shook out: A MIMD system with, effectively, vectors. This is reminiscent of the original, much beloved, Cray SMPs. (Well, probably except for Cray’s even more beloved memory bandwidth.) So by the lesson of history, Larrabee wins.

However, that history played out over a time when Moore’s Law was producing a 45% CAGR in performance. So if you start from basically serial code, which is the usual place, you just wait. It will go faster than the current best SIMD/vector/offload/whatever thingy in a short time and all you have to do is sit there on your dumb butt. Under those circumstances, the very large peak advantage of SIMD just dissipates, and doing the work to exploit it simply isn’t worth the effort.

Yo ho. Excuse me. We’re not in that world any more. Clock rates aren’t improving like that any more; they’re virtually flat. But density improvement is still going strong, so those SIMD guys can keep packing more and more units onto chips.

Ha, right back at ‘cha: MIMD can pack more of their stuff onto chips, too, using the same density. But… It’s not sit on your butt time any more. Making 100s of processors scale up performance is not like making 4 or 8 or even 64 scale up. Providing the same old SMP model can be done, but will be expensive and add ever-increasing overhead, so it won’t be done. Things will trend towards the same kinds of synch done in SIMD.

Furthermore, I've seen game developer interviews where they strongly state that Larrabee is not what they want; they like GPUs. They said the same when IBM had a meeting telling them about Cell, but then they just wanted higher clock rates; presumably everybody's beyond that now.

Pure graphics processing isn’t the end point of all of this, though. For game physics, well, maybe my head just isn't build for SIMD; I don't understand how it can possibly work well. But that may just be me.

If either doesn't win in that game market, the volumes won't exist, and how well it does elsewhere won't matter very much. I'm not at all certain Intel's market position matters; see Itanium. And, of course, execution matters. There Intel at least has a (potential?) process advantage.

I doubt Intel gives two hoots about this issue, since a major part of their motivation is to ensure than the X86 architecture rules the world everywhere.

But, on the gripping hand, does this all really matter in the long run? Can Nvidia survive as an independent graphics and HPC vendor? More density inevitably will lead to really significant graphics hardware integrated onto silicon with the processors, so it will be “free,” in the same sense that Microsoft made Internet Explorer free, which killed Netscape. AMD sucked up ATI for exactly this reason. Intel has decided to build the expertise in house, instead, hoping to rise above their prior less-than-stellar graphics heritage.

My take for now is that CUDA will at least not get wiped out by Larrabee for the foreseeable future, just because Intel no longer has Moore’s 45% CAGR on its side. Whether it will survive as a company depends on many things not relevant here, and on how soon embedded graphics becomes “good enough” for nearly everybody, and "good enough" for HPC.

-----------------------------------------------------------------

Update 8/24/09.

There was some discussion on Reddit of this post; it seems to have aged off now – Reddit search doesn’t find it. I thought I'd comment on it anyway, since this is still one of the most-referenced posts I've made, even after all this time.

Part of what was said there was that I was off-base: Nvidia wasn’t SIMD, it was SPMD (separate instructions for each core). Unfortunately, some of the folks there appear to have been confused by Nvidia-speak. But you don’t have to take my word for that. See this excellent tutorial from SIGGRAPH 09 by Kayvon Fatahalian of Stanford. On pp. 49-53, he explains that in “generic-speak” (as opposed to “Nvidia-speak”) the Nvidia GeForce GTX 285 does have 30 independent MIMD cores, but each of those cores is effectively 1024-way SIMD: It has with groups of 32 “fragments” running as I described above, multiplied by 32 contexts also sharing the same instruction stream for memory stall overlap. So, to get performance you have to think SIMD out to 1024 if you want to get the parallel performance that is theoretically possible. Yes, then you have to use MIMD (SPMD) 30-way on top of that, but if you don’t have a lot of SIMD you just won’t exploit the hardware.

Thursday, September 18, 2008

Background: Me and This Blog

Some people still remember me as the author of In Search of Clusters, as I found out when I began posting to the Google group cloud-computing. A hearty Thank You! to all of them. For the rest of you, why not buy a copy? It's still in print, and I still get royalties. :-) (Really only a semi-smiley. Royalties are good.)

For the rest of you, and as a reminder, here’s a short bio:

I was until recently a Distinguished Engineer in the IBM Systems & Technology Group in Austin, Texas. I’m the author of In Search of Clusters, currently in its second edition and still occasionally referred to as “the bible of clusters” even seven years after its publication. I was also, back in the '80s, Chief Scientist of the RP3 massively-parallel computing project in IBM Research (joint with NYU). My job in Austin was to be a leader of future system architecture work, particularly in the areas of accelerators and appliances, and I was chair of the InfiniBand industry standard management workgroup. I've worked on parallel computing for over 30 years, and hold something around 30 patents in parallel computing and computer communications.

I’m currently retired, living in Austin, TX, and intending to move to Colorado (just North of Denver), where I've been invited to be part-time on the faculty of CSU - as soon as I can sell my house. That was supposed to be over and done three or four months ago, but things have ground to a trickle now that buyers find it virtually impossible to get a mortgage. Austin was pretty immune to housing slowdowns until that happened. Dang. Anyway.

In the meantime, I’m working on another book. Its working title is The End of Computing (As We Knew It), and my intention with it is to expose the hole into which the computer industry may be plunging by turning to explicit parallelism. I'm not implying there's any other realistic choice, mind you, but I got a snoot-full while in IBM of how (a) hardware engineers haven’t a clue what a bitch that is to program; (b) most software engineers haven’t a clue that this is being done to them; (c) many upper-management and analyst types have their heads firmly stuck in the sand on what this all will mean.

It particularly bugs me that people still blather on about how Moore's Law will keep on trucking for decades. Maybe it will, interpreted literally. But the Moore's Law that will keep on trucking has been castrated. It lacks a key element (frequency scaling) that drove the computing industry for the last four or more decades. This is a classic case of experts focussing on the veins in the leaves on the trees and ignoring the ravine they're about to fall into.

More. I have this suspicion that many people who really understand how deep into the doodoo we're going are weasel-wording it deliberately. No point in frightening the hoi polloi, now, is there? Maybe there's a cure, who knows, we're not there yet, hm? Horsepuckey.

Or else they're just scared to death and hoping like crazy they'll wake up one morning and somebody will have solved the problem. Unfortunately, that ignores 30 years of history, my friends.

That's the topic. Obviously, I feel strongly about it. This is good. It's motivational.

But...

I am sorry, and somewhat ashamed, to have to say that I’ve made nearly no progress on actually writing the book over the last year. I've messed around starting a couple of chapters, and have an outline, but that's all.

I have been spending a lot of time just keeping up with what’s being said all around this subject, and have amassed a huge number of web references and comments – 600+ MB of them, in fact. Stitching it all together is, however, requiring more focused continuous effort than retirement with a decent pension and waiting for a house to sell seems induce. (Why do it today? There’s always tomorrow.)

It finally occurred to me, through a devious chain of events, that it might help make some of my data collection and musings public.

Hence this blog.

You can expect to find here comments on things like:

Semiconductor technology – in a simple way. I’m no expert in this area, but it’s the basis of the whole issue.
What it takes to make parallel computer hardware, like SMPs and clusters (farms, etc.), interconnects and communications issues in particular.
Issues and difficulties involved in programming that hardware, and what has been learned in about 40 years of experience in the HPC arena (a point seemingly lost in most discussions).
Ways to make use of that hardware that don’t require explicit parallel programming by the mass of programmers, like virtualization, transaction processing.
Yet Another Way to use all those transistors: Accelerators. And why they may now finally have legs and stay around for a while (hint: you can’t win against a 45% CAGR increase in clock speed).
Graphics accelerators and, in particular, Larrabee vs. Nvidia / ATI(AMD). Why it’s needed. Who’s betting on which characteristic, whether they know it or not. (Lessons from HPC.)
Possible killer apps for parallel systems, like virtual worlds, graphics, stream processing, cloud computing (sort of), grids (sort of).
What this all means to the industry.

Readers of In Search of Clusters will notice a lot of familiar material there, and a lot that's new. One big differnce is that I'm going to try to aim for a more popular audience on this one, trying to make it more accessible to more people. There are two reasons for this: First, I think the topic needs to be brought more out into the open, outside the confines of the industry. Second, that sells lots more copies of any book.

In this blog I’ll probably also grouch and rave about some things that have nothing to do with any of the above, just to keep from getting my head clear now and again. Like a short lesson in not to try to carry a three-foot Tai Chi (Taijichuan) Jian, a three-foot straight sword, when flying from Beijing to Shanghai. Or my struggles with using Word to produce a book manuscript, which I apparently have to these days.

I'm also newly twitter-ified (twitified?) as GregPfister, and will try to keep that channel stuffed, too. But this will be where the data is, of necessity.

Oh, and anybody know what the situation is on copyright for blog contents? Until I know otherwise, I’m going to have to be really conservative about posting excerpts of work in progress on the new book.

Anyway, Hi!

I'm here.

(tap, tap) This thing on?

Anybody listening?