The Perils of Parallel: July 2009

Accelerators have always been the professional wrestlers of computing. They're ripped, trash-talking superheroes, whose special signature moves and bodybuilder physiques promise to reduce diamond-hard computing problems to soft blobs quivering in abject surrender. Wham! Nvidia "The Green Giant" CUDA body-slams a Black-Scholes equation financial model! Shreik! Intel "bong-da-Dum-da-Dum" Larrabee cobra clutches a fast fourier transform to agonizing surrender!

And they're "green"! Many more FLOPS/OPS/whatever per watt and per square inch of floor space than your standard server! And I'm using way too many exclamation points!!

Logical Sidebar: What is an accelerator, anyway? My definition: An accelerator is a device optimized to enhance the performance or function of a computing system. An accelerator does not function on its own; it requires invocation from host programs. This is by intention and design optimization, not physics, since an accelerator may contain general purpose system parts (like a standard processor), be substantially software or firmware, and (recursively) contain other accelerators. The strategy is specialization; there is no such thing as a "general-purpose" accelerator. Claims to the contrary usually assume just one application area, usually HPC, but there are many kinds of accelerators – see the table appearing later. The big four "general purpose" GPUs – IBM Cell, Intel Larrabee, Nvidia CUDA, ATI/AMD Stream – are just the tip of the iceberg. The architecture of accelerators is a glorious zoo that is home to the most bizarre organizations imaginable, a veritable Cambrian explosion of computing evolution.

So, if they're so wonderful, why haven't accelerators already taken over the world?

Let me count the ways:

Nonstandard software that never quite works with your OS release; disappointing results when you find out you're going 200 times faster – on 5% of the whole problem; lethargic data transfer whose overhead squanders the performance; narrow applicability that might exactly hit your specific problem, or might not when you hit the details; difficult integration into system management and software development processes; and a continual need for painful upgrades to the next, greatest version with its different nonstandard software and new hardware features; etc.

When everything lines up just right, the results can be fantastic; check any accelerator company's web page for numerous examples. But getting there can be a mess. Anyone who was a gamer in the bad old days before Microsoft DirectX is personally familiar with this; every new game was a challenge to get working on your gear. Those perennial problems are also the reason for a split reaction in the finance industry to computational accelerators. The quants want them; if they can make them work (and they're always optimists), a few milliseconds advantage over a competitor can yield millions of dollars per day. But their CIOs' reaction is usually unprintable.

I think there are indicators that this worm may well be turning, though, allowing many more types of accelerators to become far more mainstream. Which implies another question: Why is this happening now?

Indicators

First of all, vendors seem to be embracing actual industry software standards for programming accelerators. I'm referring here to the Khronos Group's OpenCL, which Nvidia, AMD/ATI, Intel, and IBM, among others, are supporting. This may well replace proprietary interfaces like Nvidia's CUDA API and AMD/ATI's CTM, and in doing so have an effect as good and simplifying as Microsoft's DirectX API series, which eliminated a plethora of problems for graphics accelerators (GPUs).

Another indicator is that connecting general systems to accelerators is becoming easier and faster, reducing both the CPU overhead and latency involved in transferring data and kicking off accelerator operations. This is happening on two fronts: IO, and system bus.

On the IO side, there's AMD developing and intermittently showcasing its Torrenza high-speed connection. In addition, the PCI-SIG will, presumably real soon now, publish PCIe version 3.0, which contains architectural features designed for lower-overhead accelerator attachment, like atomic operations and caching of IO data.

On the system bus side, both Intel and AMD have been licensing their inner inter-processor system busses as attachment points to selected companies. This is the lowest-overhead, fastest way to communicate that exists in any system; the latencies are in sub-nanoseconds and the data rates in gigabytes/second. This indicates a real commitment to accelerators, because foreign attachment directly to one's system bus was heretofore unheard-of, for very good reason. The protocols used on system busses, particularly the aspects controlling cache coherence, are mind-numbingly complex. They're the kinds of things best developed and used by a team whose cubes/offices are within whispering range of each other. When they don't work, you get nasty intermittent errors that can corrupt data and crash the system. Letting a foreign company onto your system bus is like agreeing to the most intimate unprotected sex imaginable. Or doing a person-to-person mutual blood transfusion. Or swapping DNA. If the other guy is messed up, you are toast. Yet, it's happening. My mind is boggled.

Another indicator is the width of the market. The vast majority of the accelerator press has focused on GPGPUs, but there are actually a huge number of accelerators out there, spanning an oceanic range of application areas. Cryptography? Got it. Java execution? Yep. XML processing? – not just parsing, but schema validation, XSLT transformations, XPaths, etc. – Oh, yes, that too. Here's a table of some of the companies involved in some of the areas. It is nowhere near comprehensive, but it will give you a flavor (click on it to enlarge) (I hope):

Beyond companies making accelerators, there are a collection of companies who are accelerator arms dealers – they live by making technology that's particularly good for creating accelerators, like semi-custom single-chip systems with your own specified processing blocks and/or instructions. Some names: Cavium, Freescale Semiconductor, Infineon, LSI Logic, Raza Microelectronics, STMicroeletronics, Teja, Tensilica, Britestream. That's not to leave out FPGA vendors who make custom hardware simple by providing chips that are seas of gates and functions you can electrically wire up as you like.

Why Now?

That's all fine and good, but. Tech centers around the world are littered with the debris of failed accelerator companies. There have always been accelerator companies in a variety of areas, particularly floating point computing and the offloading of communications protocols (chiefly TCP/IP); efforts date back to the early 1970s. Is there some fundamental reason why the present surge won't crash and burn like it always has?

A list can certainly be made of how circumstances have changed for accelerator development. There didn't used to be silicon foundries, for example. Or Linux. Or increasingly capable building blocks like FPGAs. I think there's a more fundamental reason.

Until recently, everybody has had to run a Red Queen's race with general purpose hardware. There's no point in obtaining an accelerator if by the time you convince your IT organization to allow it, order it, receive it, get it installed, and modify your software to use it, you could have gone faster by just sitting there on your butt, doing nothing, and getting a later generation general-purpose system. When the general-purpose system has gotten twice as fast, for example, the effective value of your accelerator has halved.

How bad a problem is this? Here's a simple graph that illustrates it:

What the graph shows is this: Suppose you buy an accelerator that does something 10 times faster than the fastest general-purpose "commodity" system does, today. Now, assume GP systems increase in speed as they have over the last couple of decades, a 45% CAGR. After only two years, you're only 5 times faster. The value of your investment in that accelerator has been halved. After four years, it's nearly divided by 5. After five years, it's worthless; it's actually slower than a general purpose system.

This is devastating economics for any company trying to make a living by selling accelerators. It means they have to turn over designs continually to keep their advantage, and furthermore, they have to time their development very carefully – a schedule slip means they have effectively lost performance. They're in a race with the likes of Intel, AMD, IBM, and whoever else is out there making systems out of their own technology, and they have nowhere near the resources being applied to general purpose systems (even if they are part of Intel, AMD, and IBM).

Now look at what happens when the rate of increase slows down:

Look at that graph, keeping in mind that the best guess for single-thread performance increases over time is now in the range of 10%-15% CAGR at best. Now your hardware design can actually provide value for five years. You have some slack in your development schedule.

It means that the single-thread performance reduction of Moore's Law makes accelerators economically viable to a degree they never have been before.

Cambrian explosion? I think it's going to be a Cambrian Fourth-of-July, except that the traditional finale won't end soon.

Objections and Rejoinders

I've heard a couple of objections raised to this line of thinking, so I may as well bring them up and try to shoot them down right away.

Objection 1: Total performance gains aren't slowing down, just single-thread gains. Parallel performance continues to rise. To use accelerators you have to parallelize anyway, so you just apply that to the general purpose systems and the accelerator advantage goes away again.

Response: This comes from the mindset that accelerator = GPGPU. GPGPUs all get their performance from explicit parallelism, and, as the "GP" part says, that parallelism is becoming more and more general purpose. But the world of accelerators isn't limited to GPGPUs; some use architectures that simply (hah! Isn't often simple) embed algorithms directly in silicon. The guts of a crypto accelerator aren't anything like a general-purpose processor, for example. Conventional parallelism on conventional general processors will lose out to it. And in any event, this is comparing past apples to present oranges: Previously you did not have to do anything at all to reap the performance benefit of faster systems. This objection assumes that you do have to do something – parallelize code – and that something is far from trivial. Avoiding it may be a major benefit of accelerators.

Objection 2: Accelerator, schaccelerator, if a function is actually useful it will get embedded into the instruction set of general purpose systems, so the accelerator goes away. SIMD operations are an example of this.

Response: This will happen, and has happened, for some functions. But how did anybody get the experience to know what instruction set extensions were the most useful ones? Decades of outboard floating point processing preceded SIMD instructions. AMD says it will "fuse" graphics functions with processors – and how many years of GPU development and experience will let it pick the right functions to do that with? For other functions, well, I don't think many CPU designers will be all that happy absorbing the strange things done in, say, XML acceleration hardware.

The Perils of Parallel

Monday, July 20, 2009

Why Accelerators Now?

Indicators

Why Now?

Objections and Rejoinders