The Perils of Parallel: November 2010

Amazon just announced, on the first full day of SC10 (SuperComputing 2010), the availability of Amazon EC2 (cloud) machine instances with dual Nvidia Fermi GPUs. According to Amazon's specification of instance types, this "Cluster GPU Quadruple Extra Large" instance contains:

22 GB of memory

33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core "Nehalem" architecture)

2 x NVIDIA Tesla "Fermi" M2050 GPUs

1690 GB of instance storage

64-bit platform

I/O Performance: Very High (10 Gigabit Ethernet)

So it looks like the future virtualization features of CUDA really are for purposes of using GPUs in the cloud, as I mentioned in my prior post.

One of these XXXXL instances costs $2.10 per hour for Linux; Windows users need not apply. Or, if you reserve an instance for a year – for $5630 – you then pay just $0.74 per hour during that year. (Prices quoted from Amazon's price list as of 11/15/2010; no doubt it will decrease over time.)

This became such hot news that GPU was a trending topic on Twitter for a while.

For those of you who don't watch such things, many of the Top500 HPC sites – the 500 supercomputers worldwide that are the fastest at the Linpack benchmark – have nodes featuring Nvidia Fermi GPUs. This year that list notoriously includes, in the top slot, the system causing the heaviest breathing at present: The Tianhe-1A at the National Supercomputer Center in Tianjin, in China.

I wonder how well this will do in the market. Cloud elasticity – the ability to add or remove nodes on demand – is usually a big cloud selling point for commercial use (expand for holiday rush, drop nodes after). How much it will really be used in HPC applications isn't clear to me, since those are usually batch mode, not continuously operating, growing and shrinking, like commercial web services. So it has to live on price alone. The price above doesn't feel all that inexpensive to me, but I'm not calibrated well in HPC costs these days, and don't know how much it compares with, for example, the cost of running the same calculation on Teragrid. Ad hoc, extemporaneous use of HPC is another possible use, but, while I'm sure it exists, I'm not sure how much exists.

Then again, how about services running games, including the rendering? I wonder if, for example, the communications secret sauce used by OnLive to stream rendered game video fast enough for first-person shooters can operate out of Amazon instances. Even if it doesn't, games that can tolerate a tad more latency may work. Possibly games targeting small screens, requiring less rendering effort, are another possibility. That could crater startup costs for companies offering games over the web.

Time will tell. For accelerators, we certainly are living in interesting times.

I'm getting tired about writing about Nvidia and its Fermi GPU architecture (see here and here for recent posts). So I'm going to just dump out some things I've considered for blog entries into this one, getting it all out of the way.

Past Fermi Product Mix

For those of you wondering about how much Nvidia's product mix is skewed to the low end, here's some data for Q3, 2010 from Investor Village:

Also, note that despite the raging hormones of high-end HPC, the caption indicates that their median and mean prices have decreased from Q2: They became more, not less, skewed towards the low end. As I've pointed out, this will be a real problem as Intel's and AMD's on-die GPUs assert some market presence, with "good enough" graphics for free – built into all PC chips. It won't be long now, since AMD has already started shipping its Zacate integrated-GPU chip to manufacturers.

Future Fermis

Recently Fermi's chief executive Jen-Hsun Huang gave an interview on what they are looking at for future features in the Fermi architecture. Things he mentioned were: (a) More development of their CUDA software; (b) virtual memory and pre-emption; (c) directly attaching InfiniBand, the leading HPC high-speed system-to-system interconnect, to the GPU. Taking these in that order:

More CUDA: When asked why not OpenCL, he said because other people are working on OpenCL and they're the only ones doing CUDA. This answer ranks right up there in the stratosphere of disingenuousness. What the question really meant was why they don't work to make OpenCL, a standard, work as well as their proprietary CUDA on their gear? Of course the answer is that OpenCL doesn't get them lock-in, which one doesn't say in an interview.

Virtual memory and pre-emption: A GPU getting a page fault, then waiting while the data is loaded from main memory, or even disk? I wouldn't want to think of the number of threads it would take to cover that latency. There probably is some application somewhere for which this is the ideal solution, but I doubt it's the main driver. This is a cloud play: Cloud-based systems nearly all use virtual machines (for very good reason; see the link), splitting one each system node into N virtual machines. Virtual memory and pre-emption allows the GPU to participate in that virtualization. The virtual memory part is, I would guess, more intended to provide memory mapping, so applications can be isolated from one another reliably and can bypass issues of contiguous memory allocation. It's effectively partitioning the GPU, which is arguably a form of virtualization. [UPDATE: Just after this was published, John Carmak (of Id Software ) wrote a piece laying out the case for paging into GPUs. So that may be useful in games and generally.]

Direct InfiniBand attachment: At first glance, this sounds as useful as tits on a boar hog (as I occasionally heard from older locals in Austin). But it is suggested, a little, by the typical compute cycle among parallel nodes in HPC systems. That often goes like this: (a) Shove data from main memory out to the GPU. (b) Compute on the GPU. (c) Suck data back from GPU into main memory. (d) Using the interconnect between nodes, send part of that data from main memory to the main memory in other compute nodes, while receiving data into your memory from other compute nodes. (e) Merge the new data with what's in main memory. (f) Test to see if everybody's done. (g) If not, done, shove resulting new data mix in main memory out to the GPU, and repeat. At least naively, one might think that the copying to and from main memory could be avoided since the GPUs are the ones doing all the computing: Just send the data from one GPU to the other, with no CPU involvement. Removing data copying is, of course, good. In practice, however, it's not quite that straightforward; but it is at least worth looking at.

So, that's what may be new in Nvidia CUDA / Fermi land. Each of those are at least marginally justifiable, some very much so (like virtualization). But stepping back a little from these specifics, this all reminds me of dueling Nvidia / AMD (ATI) announcements of about a year ago.

That was the time of the Fermi announcement, which compared with prior Nvidia hardware doubled everything, yada yada, and added… ECC. And support for C++ and the like, and good speed double-precision floating-point.

At that time, Tech Report said that the AMD Radeon HD 5870 doubled everything, yada again, and added … a fancy new anisotropic filtering algorithm for smoothing out texture applications at all angles, and supersampling to better avoid antialiasing.

Fine, Nvidia doesn't think much of graphics any more. But haven't they ever heard of the Wheel of Reincarnation?

The Wheel of Reincarnation

The wheel of reincarnation is a graphics system design phenomenon discovered all the way back in 1968 by T. H. Meyers and Ivan Sutherland. There are probably hundreds of renditions of it floating around the web; here's mine.

Suppose you want to use a computer to draw pictures on a display of some sort. How do you start? Well, the most dirt-simple, least hardware solution is to add an IO device which, prodded by the processor with X and Y coordinates on the device, puts a dot there. That will work, and actually has been used in the deep past. The problem is that you've now got this whole computer sitting there, and all you're doing with it is putting stupid little dots on the screen. It could be doing other useful stuff, like figuring out what to draw next, but it can't; it's 100% saturated with this dumb, repetitious job.

So, you beef up your IO device, like by adding the ability to go through a whole list of X, Y locations and putting dots up at each specified point. That helps, but the computer still has to get back to it very reliably every refresh cycle or the user complains. So you tell it to repeat. But that's really limiting. It would be much more convenient if you could tell the device to go do another list all by itself, like by embedding the next list's address in block of X,Y data. This takes a bit of thought, since it means adding a code to everything, so the device can tell X,Y pairs from next-list addresses; but it's clearly worth it, so in it goes.

Then you notice that there are some graphics patterns that you would like to use repeatedly. Text characters are the first that jump out at you, usually. Hmm. That code on the address is kind of like a branch instruction, isn't it? How about a subroutine branch? Makes sense, simplifies lots of things, so in it goes.

Oh, yes, then some of those objects you are re-using would be really more useful if they could be rotated and scaled… Hello, arithmetic.

At some stage it looks really useful to add conditionals, too, so…

Somewhere along the line, to make this a 21^st century system, you get a frame buffer in there, too, but that's kind of an epicycle; you write to that instead of literally putting dots on the screen. It eliminates the refresh step, but that's all.

Now look at what you have. It's a Turing machine. A complete computer. It's got a somewhat strange instruction set, but it works, and can do any kind of useful work.

And it's spending all its time doing nothing but putting silly dots on a screen.

How about freeing it up to do something more useful by adding a separate device to it to do that?

This is the crucial point. You've reached the 360 degree point on the wheel, spinning off a graphics processor on the graphics processor.

Every incremental stage in this process was very well-justified, and Meyers and Sutherland say they saw examples (in 1968!) of systems that were more than twice around the wheel: A graphics processor hanging on a graphics processor hanging on a graphics processor. These multi-cycles are often justified if there's distance involved; in fact, in these terms, a typical PC on the Internet can be considered to be twice around the wheel: It's got a graphics processor on a processor that uses a server somewhere else.

I've some personal experience with this. For one thing, back in the early 70s I worked for Ivan Sutherland at then-startup Evans and Sutherland Computing Corp., out in Salt Lake City; it was a summer job while I was in grad school. My job was to design nothing less than an IO system on their second planned graphics system (LDS-2). It was, as was asked for, a full-blown minicomputer-level IO system, attached to a system whose purpose in life was to do nothing but put dots on a screen. Why an IO system? Well, why bother the main system with trivia like keyboard and mouse (light pen) interrupts? Just attach them directly to the graphics unit, and let it do the job.

Just like Nvidia is talking about attaching InfiniBand directly to its cards.

Also, in the mid-80s in IBM Research, after the successful completion of an effort to build special-purpose parallel hardware system of another type (a simulator), I spent several months figuring out how to bend my brain and software into using it for more general purposes, with various and sundry additions taken from the standard repertoire of general-purpose systems.

Just like Nvidia is adding virtualization to its systems.

Each incremental step is justified – that's always the case with the wheel – just as in the discussion above, I showed a justification for every general-purpose additions to Nvidia architecture are justifiable.

The issue here is not that this is all necessarily bad. It just is. The wheel of reincarnation is a factor in the development over time of every special-purpose piece of hardware. You can't avoid it; but you can be aware that you are on it, like it or not.

With that knowledge, you can look back at what, in its special-purpose nature, made the original hardware successful – and make your exit from the wheel thoughtfully, picking a point where the reasons for your original success aren't drowned out by the complexity added to chase after ever-widening, and ever more shallow, market areas. That's necessary if you are to retain your success and not go head-to-head with people who have, usually with far more resources than you have, been playing the general-purpose game for decades.

It's not clear to me that Nvidia has figured this out yet. Maybe they have, but so far, I don't see it.

The Perils of Parallel

Monday, November 15, 2010

The Cloud Got GPUs

Thursday, November 11, 2010

Nvidia Past, Future, and Circular

Past Fermi Product Mix

Future Fermis

The Wheel of Reincarnation