The Perils of Parallel: graphics

Showing posts with label graphics. Show all posts

Tuesday, May 17, 2011

Sandy Bridge Graphics Disappoints

See update and the end of this post: New drivers released.
Well, I'm bummed out.

I was really looking forward to purchasing a new laptop that had one of Intel's new Sandy Bridge chips. That's the chip with integrated graphics which, while it wouldn't exactly rock, would at least be adequate for games at midrange settings. No more fussing around comparing added discrete graphics chips, fewer scorch marks on my lap, and other associated goodness would ensue.

The pre-ship performance estimates and hands'-on trials said that would be possible, as I pointed out in Intel Graphics in Sandy Bridge: Good Enough. This would have had the side effect of pulling the rug out from under Nvidia's volumes for GPUs, causing the HPC market to have to pull its own weight, meaning have traditional HPC price tags (see Nvidia-based Cheap Supercomputing Coming to an End). That would have been an earthquake, since most of the highest-end HPC systems now get their peak speeds from Nvidia CUDA accelerators, a situation not in small part due to their (relatively) low prices arising from high graphics volumes.

Then TechSpot had to go and do a performance comparison of low-end graphics cards, and later, just as a side addition, throw in measurements of Sandy Bridge graphics, too.

Now, I'm sufficiently old-fashioned in my language that I really try to avoid even marginally obscene terms, even if they are in widespread everyday use, but in this case I have to make an exception:

Damn, Sandy Bridge really sucks at graphics.

It's the lowest of the low in every case. It's unusable for every game tested (and they tested quite a few), unless you're on some time-dilation drug that makes less than 15 frames per second seem zippy. Some frame rates – at medium settings – are in single digits.

With Sandy Bridge, Intel has solidly maintained its historic lock on the worst graphics performance in the industry. This, by the way, is with the Intel i7 chips overclocked to 3.4GHz. That should also overclock the graphics (unless Intel is doing something I don't know about with the graphics clock).

Ah, but possibly there is a "3D" fix for this coming soon? Ivy Bridge, the upcoming 22nm shrink of Sandy Bridge (the Intel "tock" following Sandy Bridge "tick"), has those wondrous new much-promoted transistors. Heh. Intel says Ivy Bridge will have – drum roll – 30% faster graphics than Sandy Bridge.

See prior marginal obscenity.

Intel does tend to sandbag future performance estimates, but not by enough to lift 30% up to 200-300%; that's what would be needed to produce what people were saying Sandy Bridge would do. Is that all we get from those "3D" transistors? The way the Intel media guys are going on about 3D, I expected Tri-Gate (which can be two- or five- or whatever-gate) to give me an Avatar-like mind meld or something.

All that stuff about on-chip integrated graphics taking over the low-end high-volume market for discrete graphics just isn't going to happen this year with Sandy Bridge, or later with Ivy Bridge. As a further grain of salt in my wound, Nvidia is even seeing a nice revenue uptick from selling discrete graphics add-ons to new Sandy Bridge systems. It's not that I have anything against Nvidia. I just didn't think that uptick, of all things, was going to happen.

This doesn't change my opinion that GPUs integrated on-chip won't ultimately take over the low-end graphics market. As the real Moore's Law – the law about transistor densities, not clock rates – continues to march on, it's inevitable that on-chip integrated graphics will be just fine for low- and medium-range games. It just won't happen soon with Intel products.

Ah, but what about AMD? Their Fusion chips with integrated graphics, which they call APUs, are supposed to be rather good. Performance information leaked on message boards about their upcoming A4-3400, A6-3650 and A8-3850 APUs make them sound as good as, well, um, as good as Sandy Bridge was supposed to be. Hm.

Several years ago I heard a high-level AMD designer say that people looking for performance with Fusion were going to be disappointed; it was strictly a cost/performance product. That was several years ago, and things could have changed, but chip design lead times are still multi-year.

In any event, this time I think I'll wait until shipped products are tested before declaring victory.

Meanwhile, here I go again, flipping back and forth between laptop specs and GPU specs, as usual.

Sigh.

UPDATE May 23, 2011

Intel has just released new drivers for Sandy Bridge. The press release says they provide “up to 40% performance improvements on select games, support for the latest games like Valve’s Portal 2 and Stereoscopic 3D playback on DisplayPort monitors.”

At this time I don't know of test results that would confirm whether this really makes a difference, but if it’s real, and applies broadly enough, it might be just barely enough to make the Ivy Bridge chip the beginning of the end for low-end discrete graphics.

Tuesday, January 11, 2011

Intel-Nvidia Agreement Does Not Portend a CUDABridge or Sandy CUDA

Intel and Nvidia reached a legal agreement recently in which they cross-license patents, stop suing each other over chipset interfaces, and oh, yeah, Nvidia gets $1.5B from Intel in five easy payments of $300M each.

This has been covered in many places, like here, here, and here, but in particular Ars Technica originally lead with a headline about a Sandy Bridge (Intel GPU integrated on-chip with CPUs; see my post if you like) using Nvidia GPUs as the graphics engine. Ars has since retracted that (see web page referenced above), replacing the original web page. (The URL still reads "bombshell-look-for-nvidia-gpu-on-intel-processor-die.")

Since that's been retracted, maybe I shouldn't bother bringing it up, but let me be more specific about why this is wrong, based on my reading the actual legal agreement (redacted, meaning a confidential part was deleted). Note: I'm not a lawyer, although I've had to wade through lots of legalese over my career; so this is based on an "informed" layman's reading.

Yes, they have cross-licensed each others' patents. So if Intel does something in its GPU that is covered by an Nvidia patent, no suits. Likewise, if Nvidia does something covered by Intel patents, no suits. This is the usual intention of cross-licensing deals: Each side has "freedom of action," meaning they don't have to worry about inadvertently (or not) stepping on someone else's intellectual property.

It does mean that Intel could, in theory, build a whole dang Nvidia GPU and sell it. Such things have happened, historically, but usually without cross-licensing, and are uncommon (IBM mainframe clones, X86 clones), but as a practical matter, wholesale inclusion of one company's processor design into another company's products is a hard job. There is a lot to a large digital widget not covered by the patents – numbers of undocumented implementation-specific corner cases that can mess up full software compatibility, without which there's no point. Finding them all is massive undertaking.

So switching to a CUDA GPU architecture would be a massive undertaking, and furthermore it's a job Intel apparently doesn't want to do. Intel has its own graphics designs, with years of the design / test / fabricate pipeline already in place; and between the ill-begotten Larrabee (now MICA) and its own specific GPUs and media processors Intel has demonstrated that they really want to do graphics in house.

Remember, what this whole suit was originally all about was Nvidia's chipset business – building stuff that connects processors to memory and IO. Intel's interfaces to the chipset were patent protected, and Nvidia was complaining that Intel didn't let Nvidia get at the newer ones, even though they were allegedly covered by a legal agreement. It's still about that issue.

This makes it surprising that, buried down in section 8.1, is this statement:

"Notwithstanding anything else in this Agreement, NVIDIA Licensed Chipsets shall not include any Intel Chipsets that are capable of electrically interfacing directly (with or without buffering or pin, pad or bump reassignment) with an Intel Processor that has an integrated (whether on-die or in-package) main memory controller, such as, without limitation, the Intel Processor families that are code named 'Nehalem', 'Westmere' and 'Sandy Bridge.'"

So all Nvidia gets is the old FSB (front side bus) interfaces. They can't directly connect into Intel's newer processors, since those interfaces are still patent protected, and those patents aren't covered. They have to use PCI, like any other IO device.

So what did Nvidia really get? They get bupkis, that's what. Nada. Zilch. Access to an obsolete bus interface. Well, they get bupkis plus $1.5B, which is a pretty fair sweetener. Seems to me that it's probably compensation for the chipset business Nvidia lost when there was still a chipset business to have, which there isn't now.

And both sides can stop paying lawyers. On this issue, anyway.

Postscript

Sorry, this blog hasn't been very active recently, and a legal dispute over obsolete busses isn't a particularly wonderful re-start. At least it's short. Nvidia's Project Denver – sticking a general-purpose ARM processor in with a GPU – might be an interesting topic, but I'm going to hold off on that until I can find out what the architecture really looks like. I'm getting a little tired of just writing about GPUs, though. I'm not going to stop that, but I am looking for other topics on which I can provide some value-add.

Thursday, November 11, 2010

Nvidia Past, Future, and Circular

I'm getting tired about writing about Nvidia and its Fermi GPU architecture (see here and here for recent posts). So I'm going to just dump out some things I've considered for blog entries into this one, getting it all out of the way.

Past Fermi Product Mix

For those of you wondering about how much Nvidia's product mix is skewed to the low end, here's some data for Q3, 2010 from Investor Village:

Also, note that despite the raging hormones of high-end HPC, the caption indicates that their median and mean prices have decreased from Q2: They became more, not less, skewed towards the low end. As I've pointed out, this will be a real problem as Intel's and AMD's on-die GPUs assert some market presence, with "good enough" graphics for free – built into all PC chips. It won't be long now, since AMD has already started shipping its Zacate integrated-GPU chip to manufacturers.

Future Fermis

Recently Fermi's chief executive Jen-Hsun Huang gave an interview on what they are looking at for future features in the Fermi architecture. Things he mentioned were: (a) More development of their CUDA software; (b) virtual memory and pre-emption; (c) directly attaching InfiniBand, the leading HPC high-speed system-to-system interconnect, to the GPU. Taking these in that order:

More CUDA: When asked why not OpenCL, he said because other people are working on OpenCL and they're the only ones doing CUDA. This answer ranks right up there in the stratosphere of disingenuousness. What the question really meant was why they don't work to make OpenCL, a standard, work as well as their proprietary CUDA on their gear? Of course the answer is that OpenCL doesn't get them lock-in, which one doesn't say in an interview.

Virtual memory and pre-emption: A GPU getting a page fault, then waiting while the data is loaded from main memory, or even disk? I wouldn't want to think of the number of threads it would take to cover that latency. There probably is some application somewhere for which this is the ideal solution, but I doubt it's the main driver. This is a cloud play: Cloud-based systems nearly all use virtual machines (for very good reason; see the link), splitting one each system node into N virtual machines. Virtual memory and pre-emption allows the GPU to participate in that virtualization. The virtual memory part is, I would guess, more intended to provide memory mapping, so applications can be isolated from one another reliably and can bypass issues of contiguous memory allocation. It's effectively partitioning the GPU, which is arguably a form of virtualization. [UPDATE: Just after this was published, John Carmak (of Id Software ) wrote a piece laying out the case for paging into GPUs. So that may be useful in games and generally.]

Direct InfiniBand attachment: At first glance, this sounds as useful as tits on a boar hog (as I occasionally heard from older locals in Austin). But it is suggested, a little, by the typical compute cycle among parallel nodes in HPC systems. That often goes like this: (a) Shove data from main memory out to the GPU. (b) Compute on the GPU. (c) Suck data back from GPU into main memory. (d) Using the interconnect between nodes, send part of that data from main memory to the main memory in other compute nodes, while receiving data into your memory from other compute nodes. (e) Merge the new data with what's in main memory. (f) Test to see if everybody's done. (g) If not, done, shove resulting new data mix in main memory out to the GPU, and repeat. At least naively, one might think that the copying to and from main memory could be avoided since the GPUs are the ones doing all the computing: Just send the data from one GPU to the other, with no CPU involvement. Removing data copying is, of course, good. In practice, however, it's not quite that straightforward; but it is at least worth looking at.

So, that's what may be new in Nvidia CUDA / Fermi land. Each of those are at least marginally justifiable, some very much so (like virtualization). But stepping back a little from these specifics, this all reminds me of dueling Nvidia / AMD (ATI) announcements of about a year ago.

That was the time of the Fermi announcement, which compared with prior Nvidia hardware doubled everything, yada yada, and added… ECC. And support for C++ and the like, and good speed double-precision floating-point.

At that time, Tech Report said that the AMD Radeon HD 5870 doubled everything, yada again, and added … a fancy new anisotropic filtering algorithm for smoothing out texture applications at all angles, and supersampling to better avoid antialiasing.

Fine, Nvidia doesn't think much of graphics any more. But haven't they ever heard of the Wheel of Reincarnation?

The Wheel of Reincarnation

The wheel of reincarnation is a graphics system design phenomenon discovered all the way back in 1968 by T. H. Meyers and Ivan Sutherland. There are probably hundreds of renditions of it floating around the web; here's mine.

Suppose you want to use a computer to draw pictures on a display of some sort. How do you start? Well, the most dirt-simple, least hardware solution is to add an IO device which, prodded by the processor with X and Y coordinates on the device, puts a dot there. That will work, and actually has been used in the deep past. The problem is that you've now got this whole computer sitting there, and all you're doing with it is putting stupid little dots on the screen. It could be doing other useful stuff, like figuring out what to draw next, but it can't; it's 100% saturated with this dumb, repetitious job.

So, you beef up your IO device, like by adding the ability to go through a whole list of X, Y locations and putting dots up at each specified point. That helps, but the computer still has to get back to it very reliably every refresh cycle or the user complains. So you tell it to repeat. But that's really limiting. It would be much more convenient if you could tell the device to go do another list all by itself, like by embedding the next list's address in block of X,Y data. This takes a bit of thought, since it means adding a code to everything, so the device can tell X,Y pairs from next-list addresses; but it's clearly worth it, so in it goes.

Then you notice that there are some graphics patterns that you would like to use repeatedly. Text characters are the first that jump out at you, usually. Hmm. That code on the address is kind of like a branch instruction, isn't it? How about a subroutine branch? Makes sense, simplifies lots of things, so in it goes.

Oh, yes, then some of those objects you are re-using would be really more useful if they could be rotated and scaled… Hello, arithmetic.

At some stage it looks really useful to add conditionals, too, so…

Somewhere along the line, to make this a 21^st century system, you get a frame buffer in there, too, but that's kind of an epicycle; you write to that instead of literally putting dots on the screen. It eliminates the refresh step, but that's all.

Now look at what you have. It's a Turing machine. A complete computer. It's got a somewhat strange instruction set, but it works, and can do any kind of useful work.

And it's spending all its time doing nothing but putting silly dots on a screen.

How about freeing it up to do something more useful by adding a separate device to it to do that?

This is the crucial point. You've reached the 360 degree point on the wheel, spinning off a graphics processor on the graphics processor.

Every incremental stage in this process was very well-justified, and Meyers and Sutherland say they saw examples (in 1968!) of systems that were more than twice around the wheel: A graphics processor hanging on a graphics processor hanging on a graphics processor. These multi-cycles are often justified if there's distance involved; in fact, in these terms, a typical PC on the Internet can be considered to be twice around the wheel: It's got a graphics processor on a processor that uses a server somewhere else.

I've some personal experience with this. For one thing, back in the early 70s I worked for Ivan Sutherland at then-startup Evans and Sutherland Computing Corp., out in Salt Lake City; it was a summer job while I was in grad school. My job was to design nothing less than an IO system on their second planned graphics system (LDS-2). It was, as was asked for, a full-blown minicomputer-level IO system, attached to a system whose purpose in life was to do nothing but put dots on a screen. Why an IO system? Well, why bother the main system with trivia like keyboard and mouse (light pen) interrupts? Just attach them directly to the graphics unit, and let it do the job.

Just like Nvidia is talking about attaching InfiniBand directly to its cards.

Also, in the mid-80s in IBM Research, after the successful completion of an effort to build special-purpose parallel hardware system of another type (a simulator), I spent several months figuring out how to bend my brain and software into using it for more general purposes, with various and sundry additions taken from the standard repertoire of general-purpose systems.

Just like Nvidia is adding virtualization to its systems.

Each incremental step is justified – that's always the case with the wheel – just as in the discussion above, I showed a justification for every general-purpose additions to Nvidia architecture are justifiable.

The issue here is not that this is all necessarily bad. It just is. The wheel of reincarnation is a factor in the development over time of every special-purpose piece of hardware. You can't avoid it; but you can be aware that you are on it, like it or not.

With that knowledge, you can look back at what, in its special-purpose nature, made the original hardware successful – and make your exit from the wheel thoughtfully, picking a point where the reasons for your original success aren't drowned out by the complexity added to chase after ever-widening, and ever more shallow, market areas. That's necessary if you are to retain your success and not go head-to-head with people who have, usually with far more resources than you have, been playing the general-purpose game for decades.

It's not clear to me that Nvidia has figured this out yet. Maybe they have, but so far, I don't see it.

Saturday, September 4, 2010

Intel Graphics in Sandy Bridge: Good Enough

As I and others expected, Intel is gradually rolling out how much better the graphics in its next generation will be. Anandtech got an early demo part of Sandy Bridge and checked out the graphics, among other things. The results show that the "good enough" performance I argued for in my prior post (Nvidia-based Cheap Supercomputing Coming to an End) will be good enough to sink third party low-end graphics chip sets. So it's good enough to hurt Nvidia's business model, and make their HPC products fully carry their own development burden, raising prices notably.

The net is that for this early chip, with early device drivers, at low, but usable resolution (1024x768) there's adequate performance on games like "Batman: Arkham Asylum," "Call of Duty MW2," and a bunch of others, significantly including "Worlds of Warfare." And it'll play Blue-Ray 3D, too.

Anandtech's conclusion is "If this is the low end of what to expect, I'm not sure we'll need more than integrated graphics for non-gaming specific notebooks." I agree. I'd add desktops, too. Nvidia isn't standing still, of course; on the low end they are saying they'll do 3D, too, and will save power. But integrated graphics are, effectively, free. It'll be there anyway. Everywhere. And as a result, everything will be tuned to work best on that among the PC platforms; that's where the volumes will be.

Some comments I've received elsewhere on my prior post have been along the lines of "but Nvidia has such a good computing model and such good software support – Intel's rotten IGP can't match that." True. I agree. But.

There's a long history of ugly architectures dominating clever, elegant architectures that are superior targets for coding and compiling. Where are the RISC-based CAD workstations of 15+ years ago? They turned into PCs with graphics cards. The DEC Alpha, MIPS, Sun SPARC, IBM POWER and others, all arguably far better exemplars of the computing art, have been trounced by X86, which nobody would call elegant. Oh, and the IBM zSeries, also high on the inelegant ISA scale, just keeps truckin' through the decades, most recently at an astounding 5.2 GHz.

So we're just repeating history here. Volume, silicon technology, and market will again trump elegance and computing model.

PostScript: According to Bloomberg, look for a demo at Intel Developer Forum next week.

Wednesday, August 11, 2010

Nvidia-based Cheap Supercomputing Coming to an End

Nvidia's CUDA has been hailed as "Supercomputing for the Masses," and with good reason. Amazing speedups on scientific / technical code have been reported, ranging from a mere 10X through hundreds. It's become a darling of academic computing and a major player in DARPA's Exascale program, but performance alone is not the reason; it's price. For that computing power, they're incredibly cheap. As Sharon Glotzer of UMich noted, "Today you can get 2GF for $500. That is ridiculous." It is indeed. And it's only possible because CUDA is subsidized by sinking the fixed costs of its development into the high volumes of Nvidia's mass market low-end GPUs.

Unfortunately, that subsidy won't last forever; its end is now visible. Here's why:

Apparently ignored in the usual media fuss over Intel's next and greatest, Sandy Bridge, is the integration of Intel's graphics onto the same die as the processor chip.

The current best integration is onto the same package, as illustrated in the photo of the current best, Clarkdale (a.k.a. Westmere), as shown in the photo on the right. As illustrated, the processor is in 32nm silicon technology, and the graphics, with memory controller, is in 45nm silicon technology. Yes, the graphics and memory controller is the larger chip.

Intel has not been touting higher graphics performance from this tighter integration. In fact, Intel's press releasers for Clarkdale claimed that being on two die wouldn't reduce performance because they were in the same package. But unless someone has changed the laws of physics as I know them, that's simply false; at a minimum, eliminating off-chip drivers will reduce latency substantially. Also, being on the same die as the processor implies the same process, so graphics (and memory control) goes all the way from 45nm to 32nm, the same as the processor, in one jump; this certainly will also result in increased performance. For graphics, this is a very loud the Intel "Tock" in its "Tick-Tock" (architecture / silicon) alternation.

So I'll semi-fearlessly predict some demos of midrange games out of Intel when Sandy Bridge is ready to hit the streets, which hasn't been announced in detail aside from being in 2011.

Probably not coincidentally, mid-2011 is when AMD's Llano processor sees daylight. Also in 32nm silicon, it incorporates enough graphics-related processing to be an apparently decent DX11 GPU, although to my knowledge the architecture hasn't been disclosed in detail.

Both of these are lower-end units, destined for laptops, and intent on keeping a tight power budget; so they're not going to run high-end games well or be a superior target for HPC. It seems that they will, however, provide at least adequate low-end, if not midrange, graphics.

Result: All of Nvidia's low-end market disappears by the end of next year.

As long as passable performance is provided, integrated into the processor equates with "free," and you can't beat free. Actually, it equates with cheaper than free, since there's one less chip to socket onto the motherboard, eliminating socket space and wiring costs. The power supply will probably shrink slightly, too.

This means the end of the low-end graphics subsidy of high-performance GPGPUs like Nvidia's CUDA. It will have to pay its own way, with two results:

First, prices will rise. It will no longer have a huge advantage over purpose-built HPC gear. The market for that gear is certainly expanding. In a long talk at the 2010 ISC in Berlin, Intel's Kirk Skaugan (VP of Intel Architecture Group and GM, Data Center Group, USA) stated that HPC was now 25% of Intel's revenue – a number double the HPC market I last heard a few years ago. But larger doesn't mean it has anywhere near the volume of low-end graphics.

DARPA has pumped more money in, with Nvidia leading a $25M chunk of DARPA's Exascale project. But that's not enough to stay alive. (Anybody remember Thinking Machines?)

The second result will be that Nvidia become a much smaller company.

But for users, it's the loss of that subsidy that will hurt the most. No more supercomputing for the masses, I'm afraid. Intel will have MIC (son of Larrabee); that will have a partial subsidy since it probably can re-use some X86 designs, but that's not the same as large low-end sales volumes.

So enjoy your "supercomputing for the masses," while it lasts.

Thursday, July 15, 2010

OnLive Follow-Up: Bandwidth and Cost

As mentioned earlier in OnLive Works! First Use Impressions, I've tried OnLive, and it works quite well, with no noticeable lag and fine video quality. As I've discussed, this could affect GPU volumes, a lot, if it becomes a market force, since you can play high-end games with a low-end PC. However, additional testing has confirmed that users will run into bandwidth and data usage issues, and the cost is not what I'd like for continued use.

To repeat some background, for completeness: OnLive is a service that runs games on their servers up in the cloud, streaming the video to your PC or Mac. It lets you run the highest-end games on very inexpensive systems, avoiding the cost of a rip-roaring gamer system. I've noted previously that this could hurt the mass market for GPUs, since OnLive doesn't need much graphics on the client. But there were serious questions (see my post Twilight of the GPU?) as to whether they could overcome bandwidth and lag issues: Can OnLive respond to your inputs fast enough for games to be playable? And could its bandwidth requirements be met with a normal household ISP?

As I said earlier, and can re-confirm: Video, check. I found no problems there; no artifacts, including in displayed text. Lag, hence gameplay, is perfectly adequate, at least for my level of skill. Those with sub-millisecond reflexes might feel otherwise; I can't tell. There's confirmation of the low lag from Eurogamer, which measured it at "150ms - similar to playing … locally".

Bandwidth

Bandwidth, on the other hand, does not present a pretty picture.

When I was playing or watching action, OnLive continuously ran at about 5.8% - 6.4% utilization of a 100 Mb/sec LAN card. (OnLive won't run on WiFi, only on a wired connection.) This rate is very consistent. Displayed image resolution didn't cause it to vary outside that range, whether it was full-screen on my 1600 x 900 laptop display, full-screen on my 1920 x 1080 monitor, or windowed to about half the laptop screen area (which was the window size OnLive picked without input from me). When looking at static text displays, like OnLive control panels, it dropped down to a much smaller amount, in the 0.01% range; but that's not what you want to spend time doing with a system like this.

I observed these values playing (Borderlands) and watching game trailers for a collection of "coming soon" games like Deus Ex, Drive, Darksiders, Two Worlds, Driver, etc. If you stand still in a non-action situation, it does go down to about 3% (of 100 Mb/sec) for me, but with action games that isn't the point.

6.4% of 100 Mb/sec is about 2.9 GB (bytes) per hour. That hurts.

My ISP, Comcast, considers over 250 GB/month "excessive usage" and grounds for terminating your account if you keep doing it regularly. That limit and OnLive's bandwidth together mean that over a 30-day period, Comcast customers can't play more than 3 hours a day without being considered "excessive."

Prices

I also found that prices are not a bargain, unless you're counting the money you save using a bargain PC – one that costs, say, what a game console costs.

First, you pay for access to OnLive itself. For now that can be free, but after a year it's slated to be $4.95 a month. That's scarcely horrible. But you can't play anything with just access; you need to also buy a GamePass for each game you want to play.

A Full GamePass, which lets you play it forever (or, presumably, as long as OnLive carries the game) is generally comparable to the price of the game itself, or more for the PC version. For example, the Borderlands Full GamePass is $29.99, and the game can be purchased for $30 or less (one site lists it for $3! (plus about $9 shipping)). F.E.A.R. 2 is $19.99 GamePass, and the purchase price is $19-$12. Assassin's Creed II was a loser, with GamePass for $39.99 and purchased game available for $24-$17. The standalone game prices are from online sources, and don't include shipping, so OnLive can net a somewhat smaller total. And you can play it on a cheap PC, right? Hmmm. Or a console.

There are also, in many cases, 5 day and 3 day passes, typically $9-$7 for 5-day and $4-$6 for 3-day. As a try before you buy, maybe those are OK, but 30 minute free demos are available, too, making a reasonably adequate try available for free.

Not all the prices are that high. There's something called AAAAAAA, which seems to consist entirely of falling from tall buildings, with a full GamePass for $9.99; and Brain Challenge is $4.99. I'll bet Brain Challenge doesn't use much bandwidth, either.

The correspondence between Full GamePass and the retail price is obviously no coincidence. I wouldn't be surprised at all to find that relationship to be wired into the deals OnLive has with game publishers. Speculation, since I just don't know: Do the 5 or 3 day pass prices correspond to normal rental rates? I'd guess yes.

Simplicity & the Mac Factor

A real plus for OnLive is simplicity. Installation is just pure dead simple, and so is starting to play. Not only do you not have to acquire the game, there's no installation and no patching; you just select the game, get a PayPass (zero time with a required pre-registered credit card), and go. Instant gratification.

Then there's the Mac factor. If you have only Apple products – no console and no Windows PC – you are simply shut out of many games unless you pursue the major hassle of BootCamp, which also requires purchasing a copy of Windows and doing the Windows maintenance. But OnLive runs on Macs, so a wide game experience is available to you immediately, without a hassle.

Conclusion

To sum up:

Positive: great video quality, great playability, hassle-free instant gratification, and the Mac factor.

Negative: Marginally competitive game prices (at best) and bandwidth, bandwidth, bandwidth. The cost can be argued, and may get better over time, but your ISP cutting you off for excessive data usage is pretty much a killer.

So where does this leave OnLive and, as a consequence, the market for GPUs? I think the bandwidth issue says that OnLive will have little impact in the near future.

However, this might change. Locally, Comcast TV ads showing off their "Xfinity" rebranding had a small notice indicating that 105 Mb data rates would be available in the future. It seems those have disappeared, so maybe it won't happen. But a 10X data rate improvement wouldn't mean much if you also didn't increase the data usage cap, and a 10X usage cap increase would completely eliminate the bandwidth issue.

Or maybe the Net Neutrality guys will pick this up and succeed. I'm not sure on that one. It seems like trying to get water from a stone if the backbone won't handle it, but who knows?

The proof, however, is in the playing and its market share, so we can just watch to see how this works out. The threat is still there, just masked by bandwidth requirements.

(And I still think virtual worlds should evaluate this technology closely. Installation difficulty is a key inhibitor to several markets there, forcing extreme measures – like shipping laptops already installed – in one documented case; see Living In It: A Tale of Learning in Second Life.)

Tuesday, July 6, 2010

OnLive Works! First Use Impressions

I've tried OnLive, and it works. At least for the games I tried, it seems to work quite well, with no noticeable lag and fine video quality. But I'm not sure about the bandwidth issue yet, or the cost.

OnLive is a service that runs games on their servers up in the cloud, streaming the video to your PC or Mac. I've noted previously that this could hurt the mass market for GPUs, since it doesn't need much graphics on the client. But there were serious questions (see my post Twilight of the GPU?) as to whether they could overcome bandwidth and lag issues: Can OnLive respond to your inputs fast enough for games to be playable? And could its bandwidth requirements be met with a normal household ISP?

As I said above: Lag, check. Video, check. I found no problems there. Bandwidth, inconclusive. Cost, ditto. More data will answer those, but I've not had the chance to gather it yet. Here's what I did:

I somehow was "selected" from their wait-list as an OnLive founding member, getting me free access for a year – which doesn't mean I play totally free for a year; see below – and tried it out today, playing free 30-minute demos of Assassin's Creed II a little bit, and Borderlands enough for a good impression.

Assassin's Creed II was fine through initial cutscenes and minor initial movement. But when I reached the point where I was reborn as a player in medieval times, I ran into a showstopper. As an introduction to the controls, the game wanted me to press <squiggle_icon> to move my legs. <squiggle_icon>, unfortunately, corresponds to no key on my laptop. I tried everything plus shift, control, and alt variations, and nothing worked. In the process I accidentally created a brag clip, went back to the OnLive dashboard, and did some other obscure things I never did figure out, but never did move my legs. I moved my arms with about four different key combinations, but the game wasn't satisfied with that. So I ditched it. For all I know there's something on the OnLive web site explaining this, but I didn't look enough to find it.

I was much more successful with Borderlands, a post-apocalyptic first-person shooter. I completed the initial training mission, leveled up, and was enjoying myself when the demo time – 30 minutes, which I consider adequately generous – ran out. Targeting and firing seemed to be just as good as native games on my system. I played both in a window and in fullscreen mode, and at no time was there noticeable lag or any visual artifacts. It just played smoothly and nicely.

I wanted to try Dragon Age – I'm more of an RPG guy – but while it shows up on the web site, I couldn't find it among the games available for play on the live system.

This is not to say there weren't hassles and pains involved in getting going. Here are some details.

First, my environment: The system I used is a Sony Vaio VGN-2670N, with Intel Core Duo @ 2.66 GHz, a 1600x900 pixel display, with 4GB RAM and an Nvidia GeForce 9300M; but the Nvidia display adapter wasn't being used. For those of you wondering about speed-of-light delays, my location is just North of Denver, CO, so this was all done more than 1000 miles from the closest server farm they have (Dallas, TX). My ISP is Comcast cable, nominally providing 10 Mb/sec; I have seen it peak as high as 15 Mb/sec in spurts during downloads. My OS is 32-bit Windows Vista. (I know…)

There was a minor annoyance at the start, since their client installer refuses to even try using Google Chrome as the browser. IE, Firefox, and Safari are supported. But that only required me to use IE, which I shun, for the install; it's not used running the client.

The much bigger pain is that OnLive adamantly refuses to run over Wifi. The launcher checks, gives you one option – exit – and points you to a FAQ, which pointer gets a 404 (page not found). I did find the relevant FAQ manually on the web site. There they apologize and say it "does indeed work well with good quality Wi-Fi connections, and in the future OnLive will support wireless" but initially they're scared of bad packet-dropping low-signal-strength crud. I can understand this; they're fighting an uphill battle convincing people this works at all, and do not need a multitude complaining they don't work when the problem is crummy Wi-Fi. (Or WiFi in a coffee shop – a more serious issue; see bandwidth discussion below.)

Nevertheless, this is a pain for me. I had to go down in the basement and set up a chair where my router is, next to my water heater, to get a wired connection. When I did go down there, after convincing Vista (I know!) to actually use the wired connection, things went as described above.

That leaves one question: Bandwidth. My ISP, Comcast, has a 250 GB/month limit beyond which I am an "excessive user" and apparently get a stern talking-to, followed by account termination if I don't mend my excessive ways. Up to now, this has been far from an issue. With OnLive, it may be a significant limitation.

Unfortunately, I didn't monitor my network use carefully when using OnLive, and ran out of time to go back and do better monitoring. I'll report more when I've done that. However, checking some numbers provided by Comcast after the fact, I can see the possibility that averaging four hours a day is all the OnLive I could do and not get terminated, since my hour of use may (just may) have sucked down 2 GB. This could be a significant issue, limiting OnLive to only very casual users, but I need better measurement to be sure.

This also points to a reason for not initially allowing Wifi that they didn't mention: I doubt your local free Wifi hot spot in a Starbucks or McDonald's is really up to the task of serving several OnLive players all day.

Finally, there's cost. What I have free is access to the OnLive system; after a year that's $4.95/month (which may be a "founding member" deal). But to play other than a free demo, I need to purchase a PlayPass for each game played. I didn't do that, and still need to check that cost. Sorry, time limitations again.

So where does this leave the market for GPUs? With the information I have so far, all I can say is that the verdict is inconclusive. I think they really have the lag and display issues licked; those just aren't a problem. If I'm wrong about the bandwidth (entirely possible), and the PlayPasses don't cost too much, it could over time deal a large blow to the mass market for GPUs, which among other problems would sink the volumes that make them relatively inexpensive for HPC use.

On the other hand, if the bandwidth and cost make OnLive suitable only for very casual gaming, there may actually be a positive effect on the GPU market, since OnLive could be used as a very good "try before you buy" facility. It worked for me; I've been avoiding first-person shooters in favor of RPGs, but found the Borderlands demo to be a lot more fun than I expected.

Finally, I'll just note that Second Life recently changed direction and is saying they're going to move to a browser-based client. They, and other virtual world systems, might do well to consider instead a system using this type of technology. It would expand the range of client systems dramatically, and, even though there is a client, simplify use dramatically.

Friday, May 28, 2010

No, Larrabee is Not Dead

We interrupt our series of posts on virtualization for a public service announcement: The numerous reports of Larrabee being dead are, at a minimum, greatly exaggerated. (Note, a significant addition after posting was made below, marked in red.)

Larrabee is the highly-publicized erstwhile mega discrete graphics chip from Intel, the subject of flame wars with Nvidia's CEO, whose initial product introduction was cancelled last December because its performance wasn't yet competitive.

Now, a recent Technology @ Intel blog post about Intel graphics (An Update On Our Graphics-related Programs) has resulted in a flurry of "Larrabee is Dead!" postings. There's Anandtech (Intel Kills Larrabee GPU), Device Magazine (Intel Cancels Larrabee Project), PCWorld (Intel Cancels Larrabee), ZDNet (Intel officially (again) kills off Larrabee), the Inquirer (Larrabee will not be), and … I'll stop there, since a quick Google will find you many more.

In the minority is eWeek (Intel Clarifies Graphics Plans, Hints at HPC Project), taking a balanced "this is what the blog actually said" approach, with SemiAccurate (Larrabee alive and well) on the other side, considering those "dead Larrabee" posts this a case of mass flakiness. I agree.

All the doom-sayers should actually listen to what Paul Otellini, Intel's CEO, said at Intel's Investor Meeting of May 11 2010 – not to what someone else interpreted what he heard his cousin's dog say Otellini said, but to the words he actually said. The full webcasts are archived and publically available. In particular, listen to the last segment, Q&A, starting at time 1:39, when someone named Hans asked about Larrabee with the comment that it hadn't appeared in the prior presentations.

Here's my partial transcript of the response, and I urge you not to believe me. It's a pain to transcribe, and I'm sure I got something wrong. Go listen to the webcast. Some words of mine, and a comment or two, are inserted in brackets [like this].

"Everything you saw in the roadmap today does not have Larrabee built into it. … [our mainstream product will be] based on evolving our mainstream integrated graphics products… [but there will be a] sea change in the architecture with Sandy Bridge… by going onto the chip and moving from two generations behind silicon to [current] silicon you get… best of class [best for integrated graphics, which isn't saying much, but even that will be a huge change for Intel if it happens].

"… In terms of Larrabee, we did not stop the project. If we made any mistake with Larrabee, we probably should not have talked about something that was high risk and long term. We have not stopped the project. We have shipped STVs out. We're looking at how and when to bring it to market. It still has very very high promise in areas of throughput computing and in terms of a general reprogrammable graphics engine using small IA cores. We still like the idea. But we've taken the risk associated with a new architecture out of our roadmap over the next few years so we have the flexibility to stay competitive while still working on it."

Nothing in the tech blog entry says anything more than is said above. It's not in the roadmap, so when the blog says "We will not bring a discrete graphics product to market, at least in the short-term" – the statement that fired up the nay-sayers' posts – the blog is simply re-stating the official party line, an action which is doubtless far from an accident.

Also, please note that Otellini did say, twice, "we did not stop the project" and that they are looking at how and when to bring it to market. All of this is completely consistent with the position taken last January that I reported in another post (The Problem with Larrabee), about relating things I picked up in a talk by Tom Forsyth about "the future of Intel graphics despite what the press says." There has been no change.

**ADDED**

Contrast this with another case: InfiniBand. Intel realized, late in the game and after publicity, that its initial IB product would also be noncompetitive. There, the response was to just fold up the shop, completely. People were reassigned, and the organization ceased to be. (I was there when this all happened, as a significant, working, designing, writing committee-leading IBM rep on the IB standard committee.) The situation is totally different for Larrabee; the shop is open and working, and high-level statements saying so have been repeated. This indicates a very different attitude, a continuing commitment to the technology.

With that kind of consistent high-level corporate backing, to say nothing of the large number of very talented individuals working on Larrabee, were they to kill it there would be a bit more of a ruckus than a sentence or two in a tech blog post.

It must have been a very slow news day.

We now return you to our previously scheduled blog posts, which will resume after Memorial Day.

Monday, May 3, 2010

All Hail the GPU! – a Tweetstream

I recently attended a talk at Colorado State University (CSU) by Sharon Glotzer, and tweeted what was going on in real time. Someone listening said he missed the start, and suggested I blog it. So, here is a nearly zero-effort blog post of my literal tweetstream, complete with hashtags.

I did add a few comments at the end, and some in the middle [marked like this]

Value add: This is a CSU Information Science and Technology (ISTeC) Distinguished Lecture Series. The slides and a video of the lecture will soon appear on the page summarizing all of the lectures. Keep scrolling down to the bottom. Many prior lectures are there, too.

Starting Tweetstream:

At CSU talk by Sharon Glotzer UMich Ann Arbor "All Hail the GPU: How the Video Game Industry is Transforming Molecular & Materials Selection [no hashtag because I hit the 140 character limit]

Right now everybody's waiting to get the projector working #HailGPU

At meetup b4 talk, She said they redid their code in CUDA and overnight got 100X speedup #HailGPU [talked to her for about 3 minutes]

Backup projector found & works, presentation starting soon, I hope. #HailGPU

None of her affiliations is CS or EE -- all in materials, fluids, etc. "Good to be talking about our new tools." #HailGPU

First code she ever wrote as a grade student was for the CM-2. 64K procs. #HailGPU

Image examples of game graphics 2003, 05, 08 - 03 looks like Second Life. #HailGPU

Also b4 talk, said when they moved to Fermi they got another 2X. Not bothered using CUDA; says "it's easy." #HailGPU

Just reviewing CPU vs. GPU arch now. #HailGPU

"Typical scientific apps running on GPUs are getting 75% of peak speed." Hoowha? #HailGPU [This is an almost impossibly large efficiency. Says more about her problems than about GPUs.]

"Huge infusion from DARPA to make GPGPUs" -- Huh? Again. Who? When? #HailGPU

"Today you can get 2GF for $500. That is ridiculous." #HailGPU [bold obviously added here, to better indicate what she said]

Answer to Q: Nvidia got huge funding from DARPA to develop GPGPU technology over last 5 years. #HailGPU [I didn't know that. It makes all kinds of sense.]

"If you've ever written MPI code, CUDA is easy. Summer school students do it productively. Docs 1st rate." #HailGPU [MPI? As in message-passing? Leads to CUDA, which is stream? Say what? Must be a statement unique to her problem domain.]

Who should use them? Folks with data-parallel problems. Yes, indeed. #HailGPU

She works on self-assembly of molecules. Like lipids self-assembled into membranes. #HailGPU

Her group doing materials that change (Terminator), multi-function & sensors (Iron Man), cloaking (illustration was a blank :-) #HailGPU [cloaking as in "Klingons"] [Bah.]

Said those kinds of things are "what the material science community is doing now." #HailGPU

Hm, not seeing tweets from anybody else. Is this thing working? // ra_livesey @gregpfister - it most certainly is, keep going [Just wanted some feedback; wasn't seeing anything else.]

Her prob, Molecular Dynamics, is F=ma across a bazillion particles a bazillion times. Yeah, data parallel. #HailGPU [The second bazillion is doing the first bazillion over a bazillion time steps.]

First generates neighbor list for each particle - what particles does each particle interact with? Mainly based on distance. #HailGPU

Response to Q: Says can reduce neighbor calc from N^2 to less (but not "Barnes-Hut"), but no slides for that. #HailGPU

Typically have ~100 neighbors per particle. #HailGPU [Aha! This is where a chunk of the 100X speedup comes from! For each molecule or whatever, do exactly the same code in simple SIMD parallel for all 100 neighbors, at the same time, just varying their locations. If they had 100 threads; I think they do, would have to check. !Added in edit to this post!]

Says get same perf on $1200 GPU workstation as on $20,000 cluster. (whole MD code HOOMD-Blue) #HailGPU [I think I may have the numbers slightly wrong here – may have been $40,000, etc. – but the spirit is right; see the slides and presentation for what she exactly said.]

Most people would rewrite code for 3X speedup. For 100X, do it yesterday. #HailGPU

Done work on "patchy nanotetrahedra" forming strands that bundle together spontaneously. #HailGPU

"Monte Carlo not so data parallel" (I don't agree.) #HailGPU

Used to be a guy at IBM who did molecular dynamics on mainframes with attached vector procs. It is easy to parallelize. #HailGPU [Very, very easy. See "bazillions" above. In addition, lots of floating-point computing at each individual F=ma calculation.]

Guy at IBM was Enrico something-or-other. Forget last name. #HailGPU [Unfortunately, the only things after "Enrico" that come to my mind are "Fermi" -- which I know is wrong -- and, for some unknown psychological reason, "vermicelli." Also um, wrong. But tasty.]

Worked on how water molecules interacted. Thought massively parallel was trash. #myenemy #HailGPU

Trying to design material that, when you do something like turn on a light, chante: become opaque, start flowing, etc. #HailGPU

Also studying Tetris as a "primitive model of complex patchy particles" Like crystal structures form. #HailGPU

Students named their software suite "Glotzilla". Uh huh. She doesn't object. Self-analysis code called Freud. #HailGPU

My general take: MD simulation is a field in need of massive compute capabilities, is pleasantly parallel, more FPUs=good. #HailGPU

Answer to post-talk Q: Her Monte Carlo affects state of the system, can't accept moves that isn't legal and affects others. Strange.

Limits of GPU usability relate to memory size. They can do 100K particles, with limited-range interaction. #HailGPU

So if you have a really large-scale problem, can't use GPUs without going off-chip and losing a LOT. #HailGPU

Talk over, insane volume of tweets will now cease. #HailGPU

End of Tweetstream.

I went to a lunch with her, but didn't get a chance to ask any meaningful questions. Well, this depends on what your definition of "meaningful" is; she grew up in NYC, and therefore thinks thin-crusted, soft, drippy pizza is the only pizza. As do I. But she folds it. Heresy! That muffles the flavor!

More (or less) seriously, molecular dynamics has always been an area in which it is really fairly simple to achieve tremendous parallel efficiency: Many identical calculations (except for the data), lots of floating-point for each calculation (electric charge force, Van Der Walls forces, etc.), not a whole lot of different data required for each calculation. I have no doubt whatsoever that she gets 75% efficiency; I wouldn't be surprised at even better results. But I think it would be a mistake to think it's easy to extend such results outside that area. It was probably well worth DARPA's investment, though, in terms of the materials science enabled. I mean, cloaking? Really?

Thursday, January 28, 2010

The Problem with Larrabee

Memory bandwidth. And, most likely, software cost. Now that I've given you the punch lines, here's the rest of the story.

Larrabee, Intel's venture into high-performance graphics (and accelerated HPC), the root of months of trash talk between Intel and Nvidia, is well-known to have been delayed sin die: The pre-announced 2010 product won't happen, although some number will be available for software development, and no new date has been given for a product. It's also well-known for being an architecture that's clearly programmable with standard thinking and tools, as opposed to other GPGPUs (Nvidia, ATI/AMD), which look like something from another planet to anybody but a graphics wizard. In that context, a talk at Stanford earlier this month by Tom Forsyth, one of the Larrabee architects at Intel, is an interesting event.

Tom's topic was The Challenges of Larrabee as a GPU, and it began with Tom carefully stating the official word on Larrabee (delay) and then interpreting it: "Essentially, the first one isn't as cool as we hoped, and so there's no point in trying to sell it, because no one would buy it." Fair enough. He promised they'd get it right on the 2^nd, 3^rd, 4^th, or whatever try. Everybody's doing what they were doing before the announcement; they'll just keeping on truckin'.

But, among many other interesting things, he also said the target memory bandwidth – presumably required for adequate performance on the software renderer being written (and rewritten, and rewritten…) was to be able to read 2 bytes / thread / cycle.

He said this explicitly, vehemently, and quite repeatedly, further asserting that they were going to try to maintain that number in the future. And he's clearly designing code to that assertion. Here's a quote I copied: "Every 10 instructions, dual-issue means 5 clocks, that's 10 bytes. That's it. Starve." And most good code will be memory-limited.

The thing is: 2 bytes / cycle / thread is a lot. It's so big that a mere whiff of it would get HPC people, die-hard old-school HPC people, salivating. Let's do some math:

Let's say there are 100 processors (high end of numbers I've heard). 4 threads / processor. 2 GHz (he said the clock was measured in GHz).

That's 100 cores x 4 treads x 2 GHz x 2 bytes = 1600 GB/s.

Let's put that number in perspective:

It's moving more than the entire contents of a 1.5 TB disk drive every second.
It's more than 100 times the bandwidth of Intel's shiny new QuickPath system interconnect (12.8 GB/s per direction).
It would soak up the output of 33 banks of DDR3-SDRAM, all three channels, 192 bits per channel, 48 GB/s aggregate per bank.

In other words, it's impossible. Today. It might be that Intel is pausing Larrabee to wait for product shipment of some futuristic memory technology, like the 3D stacked chips with direct vias (vertical wires) passing all the way through the RAM chip to the processor stacked on it (Exascale Ambitions talk at Salishan 20 by Bill Camp, Intel’s Chief Architect/CTO of HPC, p. 21). Tom referred to the memory system designers as wizards beyond his comprehension; but even so, such exotica seems a flaky assumption to me.

What are the levers we have to reduce it? Processor count, clock rate, and that seems to be it. They need those 4 threads / processor (it's at the low end of really keeping their 4-stage pipe busy). He said the clock rate was "measured in GHz," so 1 GHz is a floor there. That's still 800 GB/s. Down to 25 processors we go; I don't know about you, but much lower than 24 cores starts moving out of the realm I was lead to expect. But 25 processors still gives 200 GB/s. This is still probably impossible, but starting to get in the realm of feasibility. Nvidia's Fermi, for example, is estimated as having in excess of 96 GB/s.

So maybe I'm being a dummy: He's not talking about main memory bandwidth, he's talking about bandwidth to cache memory. But then the number is too low. Take his 5 instructions, dual issue, 10 bytes example: You can get a whole lot more than 10 bytes out of an L1 cache in 5 instructions, not even counting the fact that it's probably multi-ported (able to do multiple accesses in a single cycle).

So why don't other GPU vendors have the same problem? I suspect it's at least partly because they have several separate, specialized memories, all explicitly controlled. The OpenCL memory model, for example, includes four separate memory spaces: private, local, constant, and global (cached). These are all explicitly managed, and if you align your stars correctly can all be running simultaneously. (See OpenCL Fundamentals, Episode 2). In contrast, Larrabee has to chokes it all through one general-purpose memory.

Now, switching gears:

Tom also said that the good news is that they can do 20 different rendering pipelines, all on the same hardware, since it's a software renderer; and the bad news is that they have to. He spoke of shipping a new renderer optimized to a new hot game six months after the game shipped.

Doesn't this imply that they expect their software rendering pipeline to be fairly marginal – so they are forced to make that massive continuing investment? When asked why others didn't do that, he indicated that they didn't have a choice; the pipeline's in hardware, so that one size fits all. Well, in the first place that's far less true with newer architectures; both Nvidia and ATI (AMD) are fairly programmable these days (they'd say "very programmable," I'm sure). In the second place, if it works adequately, who cares if you don't have a choice? In the third place, there's a feedback loop between applications and the hardware: Application developers work to match what they do to the hardware that's most generally available. This is the case in general, but is particularly true with fancy graphics. So the games will be designed to what the hardware does well, anyway.

And I don't know about you, but in principle I wouldn't be really excited about having to wait 6 months to play the latest and greatest game at adequate performance. (In practice, I'm never the first one out of the blocks for a new game.)

I have to confess that I've really got a certain amount of fondness for Larrabee. Its architecture seems so much more sane and programmable than the Nvidia/ATI gradual mutation away from fixed-function units. But these bandwidth and programming issues really bother me, and shake out some uncomfortable memories: The last I recall Intel attempting, as in Larrabee, to build whatever hardware the software wanted (or appeared to want on the surface), what came out was the ill-fated and nearly forgotten iAPX 432, renowned for hardware support of multitasking, object-oriented programming, and even garbage collection – and for being 4x slower than an 80286 of the same frequency.

Different situation, different era, different kind of hardware design, I know. But it still makes me worry.

(Acknowledgement: The graphic comparison to 1.5 TB disk transfer, was suggested by my still-wishing-to-remain-anonymous colleague, who also pointed me to the referenced video. This post generally benefited from email discussion with him.)

Thursday, November 26, 2009

Oh, for the Good Old Days to Come

I recently had a glorious flashback to 2004.

Remember how, back then, when you got a new computer you would be just slightly grinning for a few weeks because all your programs were suddenly so crisp and responsive? You hadn't realized your old machine had a rubbery-feeling delay responding to your clicks and key-presses until zip! You booted the new machine for the first time, and wow. It just felt good.

I hadn't realized how much I'd missed that. My last couple of upgrades have been OK. I've gotten a brighter screen, better graphics, lighter weight, and so on. They were worth it, intellectually at least. But the new system zip, the new system crispness of response – it just wasn't there.

I have to say I hadn't consciously noticed that lack because, basically, I mostly didn't need it. How much faster do you want a word processor to be, anyway? So I muddled along like everyone else, all our lives just a tad more drab than they used to be.

Of course, the culprit denying us this small pleasure has been the flattening of single-thread performance wrought by the half-death of Moore's Law. Used to be, after a couple of years delay you would naturally get a system that ran 150% or 200% faster, so everything just went faster. All your programs were rejuvenated, and you noticed, instantly. A few weeks or so later you were of course used to it. But for a while, life was just a little bit better.

That hasn't happened for nigh unto five years now. Sure, we have more cores. I personally didn't get much use out of them. All my regular programs don't perk up. But as I said, I really didn't notice, consciously.

So what happened to make me realize how deprived I – and everybody else – has been? The Second Life client.

I'd always been less than totally satisfied with how well SL ran on my system. It was usable. But it was rubbery. Click to walk or turn and it took just a little … time before responding. It wasn't enough to make things truly unpleasant (except when lots of folks were together, but that's another issue). But it was enough to be noticeably less than great. I just told myself, what the heck, it's not Quake but who cares, that's not what SL is about.

Then for reasons I'll explain in another post, I was motivated to reanimate my SL avatar. It hadn't seen any use for at least six months, so I was not at all surprised to find a new SL client required when I connected. I downloaded, installed, and cranked it up.

Ho. Ly. Crap.

The rubber was gone.

There were immediate, direct responses to everything I told it to do. I proceeded to spend much more time in SL than I originally intended, wandering around and visiting old haunts just because it was so pleasant. It was a major difference, on the order of the difference I used to encounter when using a brand-new system. It was like those good old days of CPU clock-cranking madness. The grin was back.

So was this "just" a new, better, software release? Well, of course it was that. But I wouldn't have bothered writing this post if I hadn't noticed two other things:

First, my CPU utilization meter was often pegged. Pegged, as in 100% utilization, where flooring only one of my two CPUs only reads 50%. When I looked a little deeper, I saw the one, single SL process was regularly over 50%. I've not looked at any of the SL documentation on this, but from that data I can pretty confidently say that this release of the SL client can make effective use of both cores simultaneously. It's the only program I've got with that property.

Second, my thighs started burning. Not literally. But that heat tells me when my discrete GPU gets cranking. So, this client was also exercising the GPU, to good effect.

Apparently, this SL client actually does exploit the theoretical performance improvements from graphics units and multiple cores that had been laying around unused in my system. I was, in effect, pole-vaulted about two system generations down the road – that's how long it's been since there was a discernible difference. The SL client is my first post-Moore client program.

All of this resonates for me with the recent SC09 (Supercomputing Conference 2009) keynote of Intel's Justin Rattner. Unfortunately it wasn't recorded by conference rules (boo!), but reports are that he told the crowd they were in a stagnant, let us not say decaying, business unless they got their butts behind pushing the 3D web. (UPDATE: Intel has posted video of Rattner's talk.)

Say What? No. For me, particularly following the SL experience above, this is not a "Say What?" moment. It makes perfect sense. Without a killer application, the chip volumes won't be there to keep down the costs of the higher-end chips used in non-boutique supercomputers. Asking that audience for a killer app, though, is like asking an industrial assembly-line designer for next year's toy fashion trends. Killer apps have to be client-side and used by the masses, or the volumes aren't there.

Hence, the 3D Web. This would take the kind of processing in the SL client, which can take advantage of multicore and great graphics processing, and put it in something that everybody uses every day: the browser. Get a new system, crank up the browser, and bang! you feel the difference immediately.

Only problem: Why does anybody need the web to be 3D? This is the same basic problem with virtual worlds: OK, here's a virtual world. You can run around and bump into people. What, exactly, do you do there? Chat? Bogus. That's more easily done, with more easily achieved breadth of interaction, on regular (2D) social networking sites. (Hence Google's virtual world failure.)

There are things that virtual worlds and a "3D web" can, potentially, excel at; but that's a topic for a later post.

In the meantime, I'll note that in a great crawl-first development, there are real plans to use graphics accelerators to speed up the regular old 2D web, by speeding up page rendering. Both Microsoft and Mozilla (IE & Firefox) are saying they'll bring accelerator-based speedups to browsers (see CNET and Bas Schouten's Mozilla blog) using Direct2D and DirectWrite to exploit specialized graphics hardware.

One could ask what good it is to render a Twitter page twice as fast. (That really was one of the quoted results.) What's the point? Asking that, however, would only prove that One doesn't Get It. You boot your new system, crank up the browser and bam! Everything you do there, and you do more and more there, has more zip. The web itself – the plain, old 2D web – feels more directly connected to your inputs, to your wishes; it feels more alive. Result?

The grin will be back. That's the point.

Sunday, August 30, 2009

A Larrabee in Every PC and Mac

There's a rumor that Intel is planning to integrate Larrabee, its forthcoming high-end graphics / HPC accelerator, into its processors in 2012. A number of things about this make a great deal of sense at a detailed level, but there's another level at which you have to ask "What will this mean?"

Some quick background, first: Larrabee is a well-publicized Intel product-of-the-future, where this particular future is late this year ('09) or early next year ('10). It's Intel's first foray into the realm of high-end computer graphics engines. Nvidia and ATI (part of AMD now) are now the big leagues of that market, with CUDA and Stream products respectively. While Larrabee, CUDA and Stream differ greatly, all three use parallel processing to get massive performance. Larrabee may be in the 1,000 GFLOPS range, while today Nvidia is 518 GFLOPS and ATI is 2400 GFLOPS. In comparison, Intel's latest core i7 processor reaches about 50 GFLOPS.

Integrating Larrabee into the processor (or at least its package) fits well with what's known of Intel's coarse roadmap, illustrated below (from canardpc.com, self-proclaimed "un scandale" below):

"Ticks" are new lithography, meaning smaller chips; Intel "just" shrinks the prior design. "Tocks" keep the same lithography, but add new architecture or features. So integrating Larabee on the "Haswell" tock makes sense as the earliest point at which it could be done.

The march of the remainder of Moore's Law – more transistors, same clock rate – makes such integration possible, and, for cost purposes, inevitable.

The upcoming "Westmere" parts start this process, integrating Intel's traditional "integrated graphics" onto the same package with the processor; "integrated" here means low-end graphics integrated into the processor-supporting chipset that does IO and other functions. AMD will do the same. According to Jon Peddie Research, this will destroy the integrated graphics market. No surprise there: same function, one less chip to package on a motherboard, probably lower power, and… free. Sufficiently, anyway. Like Internet Explorer built into Windows for "free" (subsidized) destroying Netscape, this will just come with the processor at no extra charge.

We will ultimately see the same thing for high-end graphics. 2012 for Larrabee integration just puts a date on it. AMD will have to follow suit with ATI-related hardware. And now you know why Nvidia has been starting its own X86 design, a pursuit that otherwise would make little sense.

Obviously, this will destroy the add-in high-end graphics market. There might be some residual super-high-end graphics left for the super-ultimate gamer, folks who buy or build systems with multiple high-end cards now, but whether there will be enough volume for that to survive at all is arguable.

Note that once integration is a fact, "X inside" will have graphics architecture implications it never had before. You pick Intel, you get Larraabee; AMD, ATI. Will Apple go with Intel/Larrabee, AMD/ATI, or whatever Nvidia cooks up? They began OpenCL, to abstract the hardware, but as an interface it is rather low-level and reflective of Nvidia's memory hierarchy. Apple will have to make a choice that PC users will individually, but for their entire user base.

That's how, and "why" in a low-level technical hardware sense. It is perfectly logical that, come 2012, every new PC and Mac has what by then will probably have around 2,000 GFLOPS. This is serious computing power. On your lap.

What the heck are most customers going to do with this? Will there be a Windows GlassWax and Mac OS XII Yeti where the user interface is a full 3D virtual world, and instead of navigating a directory tree to find things, you do dungeon crawls? Unlikely, but I think more likely than verbal input, even really well done, since talking aloud isn't viable in too many situations. Video editing, yes. Image search, yes too, but that's already here for some, and there are only so many times I want to find all the photos of Aunt Bessie. 3D FaceSpace? Maybe, but if it were a big win, I think it would already exist in 2.5D. Same for simple translations of the web pages into 3D. Games? Sure, but that's targeting a comparatively narrow user base, with increasingly less relevance to gameplay. And it's a user base that may shrink substantially due to cloud gaming (see my post Twilight of the GPU?).

It strikes me that this following of one's nose on hardware technology is a prime example of what Robert Capps brought up in a recent Wired article (The Good Enough Revolution: When Cheap and Simple Is Just Fine) quoting Clay Shirky, an NYU new media studies professor, who was commenting on CDs and lossless compression compared to MP3:

"There comes a point at which improving upon the thing that was important in the past is a bad move," Shirky said in a recent interview. "It's actually feeding competitive advantage to outsiders by not recognizing the value of other qualities." In other words, companies that focus on traditional measures of quality—fidelity, resolution, features—can become myopic and fail to address other, now essential attributes like convenience and shareability. And that means someone else can come along and drink their milk shake.

It may be that Intel is making a bet that the superior programmability of Larrabee compared with strongly graphics-oriented architectures like CUDA and Stream will give it a tremendous market advantage once integration sets in: Get "Intel Inside" and you get all these wonderful applications that AMD (Nvidia?) doesn't have. That, however, presumes that there are such applications. As soon as I hear of one, I'll be the first to say they're right. In the meantime, see my admittedly sarcastic post just before this one.

My solution? I don't know of one yet. I just look at integrated Larrabee and immediately think peacock, or Irish Elk – 88 lbs. of antlers, 12 feet tip-to-tip.

Megaloceros Giganteus, the Irish Elk, as integrated Larrabee.
Based on an image that is Copyright Pavel Riha, used with implicit permission
(Wikipedia Commons, GNU Free Documentation License)

They're extinct. Will the traditional high-performance personal computer also go extinct, leaving us with a volume market occupied only by the successors of netbooks and smart phones?

---------------------------------------

The effect discussed by Shirky makes predicting the future based on current trends inherently likely to fail. That happens to apply to me at the moment. I have, with some misgivings, accepted an invitation to be on a panel at the Anniversary Celebration of the Coalition for Academic Scientific Computation.

The misgivings come from the panel topic: HPC - the next 20 years. I'm not a futurist. In fact, futurists usually give me hives. I'm collecting my ideas on this; right now I'm thinking of democratization (2TF on everybody's lap), really big data, everything bigger in the cloud, parallelism still untamed but widely used due to really big data. I'm not too happy with those, since they're mostly linear extrapolations of where we are now, and ultimately likely to be as silly as the flying car extrapolations of the 1950s. Any suggestions will be welcome, particularly suggestions that point away from linear extrapolations. They'll of course be attributed if used. I do intend to use a Tweet from Tim Bray (timbray) to illustrate the futility of linear extrapolation: “Here it is, 2009, and I'm typing SQL statements into my telephone. This is not quite the future I'd imagined.”

Saturday, August 15, 2009

Today’s Graphics Hardware is Too Hard

Tim Sweeny recently gave a keynote at High Performance Graphics 2009 titled "The End of the GPU Roadmap" (slides). Tim is CEO and founder of Epic Games, producers of over 30 games including Gears of War, as well as the Unreal game engine used in 100s of games. There are lots of really interesting points in that 74-slide presentation, but my biggest keeper is slide 71:

[begin quote]

Lessons learned: Today's hardware is too hard!

If it costs X (time, money, pain) to develop an efficient single-threaded algorithm, then…
- Multithreaded version costs 2X
- PlayStation 3 Cell version costs 5X
- Current "GPGPU" version costs: 10X or more
Over 2X is uneconomical for most software companies!
This is an argument against:
- Hardware that requires difficult programming techniques
- Non-unified memory architectures
- Limited "GPGPU" programming models

[end quote]

Judging from the prior slides, by '"GPGPU"' Tim apparently means the DirectX 10 pipeline with programmable shaders.

I'm not sure what else to make of this beyond rehashing Tim's words, and I'd rather point you to his slides than start doing that. The overall tenor somewhat echoes comments I made in one of my first posts; it continues to be the most hit-on page of this blog, so I must have said something useful there.

I will note, though, that Tim's estimates of effort are based on very extensive experience – with game programming. For low-ish levels of parallelism, like 4 or 8, multithreading adds zero cost to typical commercial applications already running under a competent transaction monitor. It just works, since they're already at that level of software multithreading for other reasons (like achieving overlap with IO waits). Of course, that's not at all universally true for commercial applications, particularly for high levels of parallelism, no matter how much cloud evangelists talk about elasticity.

Once again, thanks to my friend who is expert at finding things like this slide set (it's not on the conference web site) and doesn't want his name mentioned.

Short post this time.

Monday, July 20, 2009

Why Accelerators Now?

Accelerators have always been the professional wrestlers of computing. They're ripped, trash-talking superheroes, whose special signature moves and bodybuilder physiques promise to reduce diamond-hard computing problems to soft blobs quivering in abject surrender. Wham! Nvidia "The Green Giant" CUDA body-slams a Black-Scholes equation financial model! Shreik! Intel "bong-da-Dum-da-Dum" Larrabee cobra clutches a fast fourier transform to agonizing surrender!

And they're "green"! Many more FLOPS/OPS/whatever per watt and per square inch of floor space than your standard server! And I'm using way too many exclamation points!!

Logical Sidebar: What is an accelerator, anyway? My definition: An accelerator is a device optimized to enhance the performance or function of a computing system. An accelerator does not function on its own; it requires invocation from host programs. This is by intention and design optimization, not physics, since an accelerator may contain general purpose system parts (like a standard processor), be substantially software or firmware, and (recursively) contain other accelerators. The strategy is specialization; there is no such thing as a "general-purpose" accelerator. Claims to the contrary usually assume just one application area, usually HPC, but there are many kinds of accelerators – see the table appearing later. The big four "general purpose" GPUs – IBM Cell, Intel Larrabee, Nvidia CUDA, ATI/AMD Stream – are just the tip of the iceberg. The architecture of accelerators is a glorious zoo that is home to the most bizarre organizations imaginable, a veritable Cambrian explosion of computing evolution.

So, if they're so wonderful, why haven't accelerators already taken over the world?

Let me count the ways:

Nonstandard software that never quite works with your OS release; disappointing results when you find out you're going 200 times faster – on 5% of the whole problem; lethargic data transfer whose overhead squanders the performance; narrow applicability that might exactly hit your specific problem, or might not when you hit the details; difficult integration into system management and software development processes; and a continual need for painful upgrades to the next, greatest version with its different nonstandard software and new hardware features; etc.

When everything lines up just right, the results can be fantastic; check any accelerator company's web page for numerous examples. But getting there can be a mess. Anyone who was a gamer in the bad old days before Microsoft DirectX is personally familiar with this; every new game was a challenge to get working on your gear. Those perennial problems are also the reason for a split reaction in the finance industry to computational accelerators. The quants want them; if they can make them work (and they're always optimists), a few milliseconds advantage over a competitor can yield millions of dollars per day. But their CIOs' reaction is usually unprintable.

I think there are indicators that this worm may well be turning, though, allowing many more types of accelerators to become far more mainstream. Which implies another question: Why is this happening now?

Indicators

First of all, vendors seem to be embracing actual industry software standards for programming accelerators. I'm referring here to the Khronos Group's OpenCL, which Nvidia, AMD/ATI, Intel, and IBM, among others, are supporting. This may well replace proprietary interfaces like Nvidia's CUDA API and AMD/ATI's CTM, and in doing so have an effect as good and simplifying as Microsoft's DirectX API series, which eliminated a plethora of problems for graphics accelerators (GPUs).

Another indicator is that connecting general systems to accelerators is becoming easier and faster, reducing both the CPU overhead and latency involved in transferring data and kicking off accelerator operations. This is happening on two fronts: IO, and system bus.

On the IO side, there's AMD developing and intermittently showcasing its Torrenza high-speed connection. In addition, the PCI-SIG will, presumably real soon now, publish PCIe version 3.0, which contains architectural features designed for lower-overhead accelerator attachment, like atomic operations and caching of IO data.

On the system bus side, both Intel and AMD have been licensing their inner inter-processor system busses as attachment points to selected companies. This is the lowest-overhead, fastest way to communicate that exists in any system; the latencies are in sub-nanoseconds and the data rates in gigabytes/second. This indicates a real commitment to accelerators, because foreign attachment directly to one's system bus was heretofore unheard-of, for very good reason. The protocols used on system busses, particularly the aspects controlling cache coherence, are mind-numbingly complex. They're the kinds of things best developed and used by a team whose cubes/offices are within whispering range of each other. When they don't work, you get nasty intermittent errors that can corrupt data and crash the system. Letting a foreign company onto your system bus is like agreeing to the most intimate unprotected sex imaginable. Or doing a person-to-person mutual blood transfusion. Or swapping DNA. If the other guy is messed up, you are toast. Yet, it's happening. My mind is boggled.

Another indicator is the width of the market. The vast majority of the accelerator press has focused on GPGPUs, but there are actually a huge number of accelerators out there, spanning an oceanic range of application areas. Cryptography? Got it. Java execution? Yep. XML processing? – not just parsing, but schema validation, XSLT transformations, XPaths, etc. – Oh, yes, that too. Here's a table of some of the companies involved in some of the areas. It is nowhere near comprehensive, but it will give you a flavor (click on it to enlarge) (I hope):

Beyond companies making accelerators, there are a collection of companies who are accelerator arms dealers – they live by making technology that's particularly good for creating accelerators, like semi-custom single-chip systems with your own specified processing blocks and/or instructions. Some names: Cavium, Freescale Semiconductor, Infineon, LSI Logic, Raza Microelectronics, STMicroeletronics, Teja, Tensilica, Britestream. That's not to leave out FPGA vendors who make custom hardware simple by providing chips that are seas of gates and functions you can electrically wire up as you like.

Why Now?

That's all fine and good, but. Tech centers around the world are littered with the debris of failed accelerator companies. There have always been accelerator companies in a variety of areas, particularly floating point computing and the offloading of communications protocols (chiefly TCP/IP); efforts date back to the early 1970s. Is there some fundamental reason why the present surge won't crash and burn like it always has?

A list can certainly be made of how circumstances have changed for accelerator development. There didn't used to be silicon foundries, for example. Or Linux. Or increasingly capable building blocks like FPGAs. I think there's a more fundamental reason.

Until recently, everybody has had to run a Red Queen's race with general purpose hardware. There's no point in obtaining an accelerator if by the time you convince your IT organization to allow it, order it, receive it, get it installed, and modify your software to use it, you could have gone faster by just sitting there on your butt, doing nothing, and getting a later generation general-purpose system. When the general-purpose system has gotten twice as fast, for example, the effective value of your accelerator has halved.

How bad a problem is this? Here's a simple graph that illustrates it:

What the graph shows is this: Suppose you buy an accelerator that does something 10 times faster than the fastest general-purpose "commodity" system does, today. Now, assume GP systems increase in speed as they have over the last couple of decades, a 45% CAGR. After only two years, you're only 5 times faster. The value of your investment in that accelerator has been halved. After four years, it's nearly divided by 5. After five years, it's worthless; it's actually slower than a general purpose system.

This is devastating economics for any company trying to make a living by selling accelerators. It means they have to turn over designs continually to keep their advantage, and furthermore, they have to time their development very carefully – a schedule slip means they have effectively lost performance. They're in a race with the likes of Intel, AMD, IBM, and whoever else is out there making systems out of their own technology, and they have nowhere near the resources being applied to general purpose systems (even if they are part of Intel, AMD, and IBM).

Now look at what happens when the rate of increase slows down:

Look at that graph, keeping in mind that the best guess for single-thread performance increases over time is now in the range of 10%-15% CAGR at best. Now your hardware design can actually provide value for five years. You have some slack in your development schedule.

It means that the single-thread performance reduction of Moore's Law makes accelerators economically viable to a degree they never have been before.

Cambrian explosion? I think it's going to be a Cambrian Fourth-of-July, except that the traditional finale won't end soon.

Objections and Rejoinders

I've heard a couple of objections raised to this line of thinking, so I may as well bring them up and try to shoot them down right away.

Objection 1: Total performance gains aren't slowing down, just single-thread gains. Parallel performance continues to rise. To use accelerators you have to parallelize anyway, so you just apply that to the general purpose systems and the accelerator advantage goes away again.

Response: This comes from the mindset that accelerator = GPGPU. GPGPUs all get their performance from explicit parallelism, and, as the "GP" part says, that parallelism is becoming more and more general purpose. But the world of accelerators isn't limited to GPGPUs; some use architectures that simply (hah! Isn't often simple) embed algorithms directly in silicon. The guts of a crypto accelerator aren't anything like a general-purpose processor, for example. Conventional parallelism on conventional general processors will lose out to it. And in any event, this is comparing past apples to present oranges: Previously you did not have to do anything at all to reap the performance benefit of faster systems. This objection assumes that you do have to do something – parallelize code – and that something is far from trivial. Avoiding it may be a major benefit of accelerators.

Objection 2: Accelerator, schaccelerator, if a function is actually useful it will get embedded into the instruction set of general purpose systems, so the accelerator goes away. SIMD operations are an example of this.

Response: This will happen, and has happened, for some functions. But how did anybody get the experience to know what instruction set extensions were the most useful ones? Decades of outboard floating point processing preceded SIMD instructions. AMD says it will "fuse" graphics functions with processors – and how many years of GPU development and experience will let it pick the right functions to do that with? For other functions, well, I don't think many CPU designers will be all that happy absorbing the strange things done in, say, XML acceleration hardware.