Thursday, November 11, 2010

Nvidia Past, Future, and Circular

I'm getting tired about writing about Nvidia and its Fermi GPU architecture (see here and here for recent posts). So I'm going to just dump out some things I've considered for blog entries into this one, getting it all out of the way.

Past Fermi Product Mix

For those of you wondering about how much Nvidia's product mix is skewed to the low end, here's some data for Q3, 2010 from Investor Village:

Also, note that despite the raging hormones of high-end HPC, the caption indicates that their median and mean prices have decreased from Q2: They became more, not less, skewed towards the low end. As I've pointed out, this will be a real problem as Intel's and AMD's on-die GPUs assert some market presence, with "good enough" graphics for free – built into all PC chips. It won't be long now, since AMD has already started shipping its Zacate integrated-GPU chip to manufacturers.

Future Fermis

Recently Fermi's chief executive Jen-Hsun Huang gave an interview on what they are looking at for future features in the Fermi architecture. Things he mentioned were: (a) More development of their CUDA software; (b) virtual memory and pre-emption; (c) directly attaching InfiniBand, the leading HPC high-speed system-to-system interconnect, to the GPU. Taking these in that order:

More CUDA: When asked why not OpenCL, he said because other people are working on OpenCL and they're the only ones doing CUDA. This answer ranks right up there in the stratosphere of disingenuousness. What the question really meant was why they don't work to make OpenCL, a standard, work as well as their proprietary CUDA on their gear? Of course the answer is that OpenCL doesn't get them lock-in, which one doesn't say in an interview.

Virtual memory and pre-emption: A GPU getting a page fault, then waiting while the data is loaded from main memory, or even disk? I wouldn't want to think of the number of threads it would take to cover that latency. There probably is some application somewhere for which this is the ideal solution, but I doubt it's the main driver. This is a cloud play: Cloud-based systems nearly all use virtual machines (for very good reason; see the link), splitting one each system node into N virtual machines. Virtual memory and pre-emption allows the GPU to participate in that virtualization. The virtual memory part is, I would guess, more intended to provide memory mapping, so applications can be isolated from one another reliably and can bypass issues of contiguous memory allocation. It's effectively partitioning the GPU, which is arguably a form of virtualization. [UPDATE: Just after this was published, John Carmak (of Id Software ) wrote a piece laying out the case for paging into GPUs. So that may be useful in games and generally.]

 Direct InfiniBand attachment: At first glance, this sounds as useful as tits on a boar hog (as I occasionally heard from older locals in Austin). But it is suggested, a little, by the typical compute cycle among parallel nodes in HPC systems. That often goes like this: (a) Shove data from main memory out to the GPU. (b) Compute on the GPU. (c) Suck data back from GPU into main memory. (d) Using the interconnect between nodes, send part of that data from main memory to the main memory in other compute nodes, while receiving data into your memory from other compute nodes. (e) Merge the new data with what's in main memory. (f) Test to see if everybody's done. (g) If not, done, shove resulting new data mix in main memory out to the GPU, and repeat. At least naively, one might think that the copying to and from main memory could be avoided since the GPUs are the ones doing all the computing: Just send the data from one GPU to the other, with no CPU involvement. Removing data copying is, of course, good. In practice, however, it's not quite that straightforward; but it is at least worth looking at.

So, that's what may be new in Nvidia CUDA / Fermi land. Each of those are at least marginally justifiable, some very much so (like virtualization). But stepping back a little from these specifics, this all reminds me of dueling Nvidia / AMD (ATI) announcements of about a year ago.

That was the time of the Fermi announcement, which compared with prior Nvidia hardware doubled everything, yada yada, and added… ECC. And support for C++ and the like, and good speed double-precision floating-point.

At that time, Tech Report said that the AMD Radeon HD 5870 doubled everything, yada again, and added … a fancy new anisotropic filtering algorithm for smoothing out texture applications at all angles, and supersampling to better avoid antialiasing.

Fine, Nvidia doesn't think much of graphics any more. But haven't they ever heard of the Wheel of Reincarnation?

The Wheel of Reincarnation

The wheel of reincarnation is a graphics system design phenomenon discovered all the way back in 1968 by T. H. Meyers and Ivan Sutherland. There are probably hundreds of renditions of it floating around the web; here's mine.

Suppose you want to use a computer to draw pictures on a display of some sort. How do you start? Well, the most dirt-simple, least hardware solution is to add an IO device which, prodded by the processor with X and Y coordinates on the device, puts a dot there. That will work, and actually has been used in the deep past. The problem is that you've now got this whole computer sitting there, and all you're doing with it is putting stupid little dots on the screen. It could be doing other useful stuff, like figuring out what to draw next, but it can't; it's 100% saturated with this dumb, repetitious job.

So, you beef up your IO device, like by adding the ability to go through a whole list of X, Y locations and putting dots up at each specified point. That helps, but the computer still has to get back to it very reliably every refresh cycle or the user complains. So you tell it to repeat. But that's really limiting. It would be much more convenient if you could tell the device to go do another list all by itself, like by embedding the next list's address in block of X,Y data. This takes a bit of thought, since it means adding a code to everything, so the device can tell X,Y pairs from next-list addresses; but it's clearly worth it, so in it goes.

Then you notice that there are some graphics patterns that you would like to use repeatedly. Text characters are the first that jump out at you, usually. Hmm. That code on the address is kind of like a branch instruction, isn't it? How about a subroutine branch? Makes sense, simplifies lots of things, so in it goes.

Oh, yes, then some of those objects you are re-using would be really more useful if they could be rotated and scaled… Hello, arithmetic.

At some stage it looks really useful to add conditionals, too, so…

Somewhere along the line, to make this a 21st century system, you get a frame buffer in there, too, but that's kind of an epicycle; you write to that instead of literally putting dots on the screen. It eliminates the refresh step, but that's all.

Now look at what you have. It's a Turing machine. A complete computer. It's got a somewhat strange instruction set, but it works, and can do any kind of useful work.

And it's spending all its time doing nothing but putting silly dots on a screen.

How about freeing it up to do something more useful by adding a separate device to it to do that?

This is the crucial point. You've reached the 360 degree point on the wheel, spinning off a graphics processor on the graphics processor.

Every incremental stage in this process was very well-justified, and Meyers and Sutherland say they saw examples (in 1968!) of systems that were more than twice around the wheel: A graphics processor hanging on a graphics processor hanging on a graphics processor. These multi-cycles are often justified if there's distance involved; in fact, in these terms, a typical PC on the Internet can be considered to be twice around the wheel: It's got a graphics processor on a processor that uses a server somewhere else.

I've some personal experience with this. For one thing, back in the early 70s I worked for Ivan Sutherland at then-startup Evans and Sutherland Computing Corp., out in Salt Lake City; it was a summer job while I was in grad school. My job was to design nothing less than an IO system on their second planned graphics system (LDS-2). It was, as was asked for, a full-blown minicomputer-level IO system, attached to a system whose purpose in life was to do nothing but put dots on a screen. Why an IO system? Well, why bother the main system with trivia like keyboard and mouse (light pen) interrupts? Just attach them directly to the graphics unit, and let it do the job.

Just like Nvidia is talking about attaching InfiniBand directly to its cards.

Also, in the mid-80s in IBM Research, after the successful completion of an effort to build special-purpose parallel hardware system of another type (a simulator), I spent several months figuring out how to bend my brain and software into using it for more general purposes, with various and sundry additions taken from the standard repertoire of general-purpose systems.

Just like Nvidia is adding virtualization to its systems.

Each incremental step is justified – that's always the case with the wheel – just as in the discussion above, I showed a justification for every general-purpose additions to Nvidia architecture are justifiable.

The issue here is not that this is all necessarily bad. It just is. The wheel of reincarnation is a factor in the development over time of every special-purpose piece of hardware. You can't avoid it; but you can be aware that you are on it, like it or not.

With that knowledge, you can look back at what, in its special-purpose nature, made the original hardware successful – and make your exit from the wheel thoughtfully, picking a point where the reasons for your original success aren't drowned out by the complexity added to chase after ever-widening, and ever more shallow, market areas. That's necessary if you are to retain your success and not go head-to-head with people who have, usually with far more resources than you have, been playing the general-purpose game for decades.

It's not clear to me that Nvidia has figured this out yet. Maybe they have, but so far, I don't see it.


Anonymous said...

better preemption is for better desktop interactivity I guess .. virtual memory doesnt have to translate to demand paging and hence your latency argument you can pipeline swap-in/outs and having a page table supported system allows non contiguous surfaces I suppose.

I suppose nvidia does push IBM offerings in HPC to the background eh :-)

Greg Pfister said...

IBM, yes indeed. There was even an announcement that they're reviving Cell - and nobody else is paying for it! ( Sounds to me like some high exec got tired of all those big systems with Fermis, and no follow-on to Cell in RoadRunner.

Pipelining swap in/out - well, I said there are undoubtedly some cases that work. I think just being able to allocate GPU jobs without worrying about contiguous storage issues is enough justification.

Dale Innis said...

Interesting stuff! Various MMO / Virtual World people are talking about using Fermi/Tesla or ATI Fusion or whatever to do server-side rendering of your orcs or spaceships or virtual strippers or artworks or whatever (see for instance the Blue Mars / OTOY announcement at "").

Wondering if you have any insights into the technical and business aspects of that? I suspect that latency spikes might be a real problem for twitch games (but not so much for sandbox virtual worlds), and I'm wondering if it's economically feasible for your typical MMO provider to, basically, buy high-end graphics rendering for every player. But I don't actually know how the costs scale...

Alan Commike said...

I expect it's not Infiniband per se that they'll be putting in their chips, but RDMA capability. There's value to being able to DMA GPU to GPU, whether from the same host or remote hosts with the same API.

I've looked into doing this with external support chips and custom FPGA work. I'm glad they're doing it in-house.

Anonymous said...

I'm wondering: do you actually have any experience in writing any kind of GPGPU code? Namely, I can certainly agree with your points on graphics and such, but when GPGPU involved, then if you actually got engaged in writing code, you would know that both OpenCL as a standard, as well as its current implementation, are such a crap that there is no wonder than NVIDIA couldn't care less about it...

Greg Pfister said...


Games in the cloud are here already -- including twitch games. OnLive has them. See my post for a discussion of the issues at its announce, and for my review of trying it on an FPS: It works.

Latency does not seem to be a problem -- but bandwidth may well be. Gamers may well bump up against ISP limits. (See my posts.)

I think MMORPGs are a clear target for that kind of technology.

I don't know whether GPU virtualization is a necessary economic component for this, though. It may be, but at least for FPS, you may need a whole card's power for a player. Is there an economic advantage to getting big cards and using them partitioned? I don't know; that will depend on detailed numbers.


Greg Pfister said...


Huang specifically said InfiniBand. You can, of course, do RDMA across InfiniBand, and that's what I suspect would be used.

If you're going to RDMA card-to-card over any significant distance with a large number of cards, you would have to invent a transport anyway. Using IB saves you that trouble, and hooks you into the existing IB ecosystem, with many benefits there.


Greg Pfister said...


Yes, I know OpenCL is way behind CUDA. If Nvidia had been putting the effort into it that they have into CUDA, that wouldn't be the case.

And yes, you're right, I've not written GPGPU code. I have written SIMD code, though, and code (and a compiler) for stranger architectures than that.


Andrew Richards said...

I think the difference between what's happening now, and what happened in the past, is that CPUs hit higher levels of performance, so you could implement custom graphics hardware on the CPU. That isn't happening now, because GPUs have become so fast and so power-efficient, that CPUs will never catch up. So, we're in a different situation where GPUs are here to stay.

GPUs are gradually adding increased levels of programmability. This helps HPC, but it also helps graphics and games. These new technologies (like virtualization) help graphics because they enable developers to do more complex lighting calculations. Lighting is incredibly complicated to get right, and there is a huge amount of variation involved.

As the lighting calculations get more complicated, and as we use more complex ways of generating and processing geometry, then it becomes increasingly useful for a GPU to be able to share data structures with a CPU. This is also true for GPGPU.

Sharing data structures between GPU and CPU require closer access to the main bus, virtual memory support and pre-emption. So, all of the things NVIDIA list as future desirables, will enable innovation in graphics as well as other areas.

NVIDIA are banking on a time in the future when the GPU becomes more important to customers than the CPU. For some many users, that time is coming.

Greg Pfister said...

Hi, Andrew.

I guess I should have been clearer: I don't think GPUs shouldn't become more general; the CPU performance flattening you cite is a good reason. I just think developers should be aware of the historical tendency to generalize, and also be aware that it can reduce the good coming from specialization.

Regarding GPU paging in particular, it so happens that just today John Carmak (of Id Software) wrote a piece laying out the case for paging into GPUs ( So there's a case where I may have it wrong above. I'm going to edit the main text to point to this - not my usual practice, but this is clear enough to warrant it.


Andrew Richards said...

Yes, I know you don't think GPUs shouldn't become more general, but I do! But it has to happen in a way that doesn't compromise GPUs ability to achieve high performance/watt, high bandwidth, and (especially) high graphics performance. So, the solutions may be similar, but not the same, to what we do on CPUs.

Paging is quite nasty on a GPU. GPUs don't stall much, seeing as the execute in lock-step. So, stalling for a page-fault could lock your GPU until the OS has sorted out the page-fault problem, but the OS might need the GPU to do something in the meantime: like show an error on the UI. GPUs could have thousands of memory-operations in-flight at the point of a page fault. That's hard to communicate to an OS.

Vegar said...

Greg, this was a very good read. I just don't see any real new objections against NVIDIA here. Sure they are allowed to focus on their CUDA, and their Cg shaders, and other areas where they are in the lead. Of course there is a lot of lock in when using NVIDIAs API. Smart people know the trade-offs connected to lock-in.

For a lot of people the NVIDIA interfaces work fine. For instance, many are using other languages than the native ones for GPGPU. Having different options on how to runs some Python code, on CUDA, OpenCL, or DirectCompute is NEVER a downside. It may on the other hand be very USEFUL for unit testing, and for finding the driver bugs.

John Carmack may get his paging soon too. But stuff like that is difficult to do without help from MS and the other GPU vendors. Even if the spec seems to be good, I still thinks we need input from other people on how to do this.

I'm not at all worried about the integrated GPU in Intel and AMD CPUs.

My perspective is that the serious obstacle issue for GPGPU is the LACK OF MARKET. What I learned in school and work is that using new technology in a new market is very risky. Just having 2 codepaths for a simple renderer, may in some cases be a lifesaver, when the drivers are bad.

AMD and Intel will actually help the GPU market, by putting at GPU in every computer. At the same time the price/performance will be better for NVIDIA mid to high range for those who need the speed. This kind of competition will also force NVIDIA to deliver cheap parallel performance.

And I hope NVIDIA will deliver both MPI and Infiniband extensions in their CUDA toolkit. In most practical implementations the GPU part and the MPI part deal with the same memory areas, and switch between each other.

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.