The Perils of Parallel: Why Accelerators Now?

Accelerators have always been the professional wrestlers of computing. They're ripped, trash-talking superheroes, whose special signature moves and bodybuilder physiques promise to reduce diamond-hard computing problems to soft blobs quivering in abject surrender. Wham! Nvidia "The Green Giant" CUDA body-slams a Black-Scholes equation financial model! Shreik! Intel "bong-da-Dum-da-Dum" Larrabee cobra clutches a fast fourier transform to agonizing surrender!

And they're "green"! Many more FLOPS/OPS/whatever per watt and per square inch of floor space than your standard server! And I'm using way too many exclamation points!!

Logical Sidebar: What is an accelerator, anyway? My definition: An accelerator is a device optimized to enhance the performance or function of a computing system. An accelerator does not function on its own; it requires invocation from host programs. This is by intention and design optimization, not physics, since an accelerator may contain general purpose system parts (like a standard processor), be substantially software or firmware, and (recursively) contain other accelerators. The strategy is specialization; there is no such thing as a "general-purpose" accelerator. Claims to the contrary usually assume just one application area, usually HPC, but there are many kinds of accelerators – see the table appearing later. The big four "general purpose" GPUs – IBM Cell, Intel Larrabee, Nvidia CUDA, ATI/AMD Stream – are just the tip of the iceberg. The architecture of accelerators is a glorious zoo that is home to the most bizarre organizations imaginable, a veritable Cambrian explosion of computing evolution.

So, if they're so wonderful, why haven't accelerators already taken over the world?

Let me count the ways:

Nonstandard software that never quite works with your OS release; disappointing results when you find out you're going 200 times faster – on 5% of the whole problem; lethargic data transfer whose overhead squanders the performance; narrow applicability that might exactly hit your specific problem, or might not when you hit the details; difficult integration into system management and software development processes; and a continual need for painful upgrades to the next, greatest version with its different nonstandard software and new hardware features; etc.

When everything lines up just right, the results can be fantastic; check any accelerator company's web page for numerous examples. But getting there can be a mess. Anyone who was a gamer in the bad old days before Microsoft DirectX is personally familiar with this; every new game was a challenge to get working on your gear. Those perennial problems are also the reason for a split reaction in the finance industry to computational accelerators. The quants want them; if they can make them work (and they're always optimists), a few milliseconds advantage over a competitor can yield millions of dollars per day. But their CIOs' reaction is usually unprintable.

I think there are indicators that this worm may well be turning, though, allowing many more types of accelerators to become far more mainstream. Which implies another question: Why is this happening now?

Indicators

First of all, vendors seem to be embracing actual industry software standards for programming accelerators. I'm referring here to the Khronos Group's OpenCL, which Nvidia, AMD/ATI, Intel, and IBM, among others, are supporting. This may well replace proprietary interfaces like Nvidia's CUDA API and AMD/ATI's CTM, and in doing so have an effect as good and simplifying as Microsoft's DirectX API series, which eliminated a plethora of problems for graphics accelerators (GPUs).

Another indicator is that connecting general systems to accelerators is becoming easier and faster, reducing both the CPU overhead and latency involved in transferring data and kicking off accelerator operations. This is happening on two fronts: IO, and system bus.

On the IO side, there's AMD developing and intermittently showcasing its Torrenza high-speed connection. In addition, the PCI-SIG will, presumably real soon now, publish PCIe version 3.0, which contains architectural features designed for lower-overhead accelerator attachment, like atomic operations and caching of IO data.

On the system bus side, both Intel and AMD have been licensing their inner inter-processor system busses as attachment points to selected companies. This is the lowest-overhead, fastest way to communicate that exists in any system; the latencies are in sub-nanoseconds and the data rates in gigabytes/second. This indicates a real commitment to accelerators, because foreign attachment directly to one's system bus was heretofore unheard-of, for very good reason. The protocols used on system busses, particularly the aspects controlling cache coherence, are mind-numbingly complex. They're the kinds of things best developed and used by a team whose cubes/offices are within whispering range of each other. When they don't work, you get nasty intermittent errors that can corrupt data and crash the system. Letting a foreign company onto your system bus is like agreeing to the most intimate unprotected sex imaginable. Or doing a person-to-person mutual blood transfusion. Or swapping DNA. If the other guy is messed up, you are toast. Yet, it's happening. My mind is boggled.

Another indicator is the width of the market. The vast majority of the accelerator press has focused on GPGPUs, but there are actually a huge number of accelerators out there, spanning an oceanic range of application areas. Cryptography? Got it. Java execution? Yep. XML processing? – not just parsing, but schema validation, XSLT transformations, XPaths, etc. – Oh, yes, that too. Here's a table of some of the companies involved in some of the areas. It is nowhere near comprehensive, but it will give you a flavor (click on it to enlarge) (I hope):

Beyond companies making accelerators, there are a collection of companies who are accelerator arms dealers – they live by making technology that's particularly good for creating accelerators, like semi-custom single-chip systems with your own specified processing blocks and/or instructions. Some names: Cavium, Freescale Semiconductor, Infineon, LSI Logic, Raza Microelectronics, STMicroeletronics, Teja, Tensilica, Britestream. That's not to leave out FPGA vendors who make custom hardware simple by providing chips that are seas of gates and functions you can electrically wire up as you like.

Why Now?

That's all fine and good, but. Tech centers around the world are littered with the debris of failed accelerator companies. There have always been accelerator companies in a variety of areas, particularly floating point computing and the offloading of communications protocols (chiefly TCP/IP); efforts date back to the early 1970s. Is there some fundamental reason why the present surge won't crash and burn like it always has?

A list can certainly be made of how circumstances have changed for accelerator development. There didn't used to be silicon foundries, for example. Or Linux. Or increasingly capable building blocks like FPGAs. I think there's a more fundamental reason.

Until recently, everybody has had to run a Red Queen's race with general purpose hardware. There's no point in obtaining an accelerator if by the time you convince your IT organization to allow it, order it, receive it, get it installed, and modify your software to use it, you could have gone faster by just sitting there on your butt, doing nothing, and getting a later generation general-purpose system. When the general-purpose system has gotten twice as fast, for example, the effective value of your accelerator has halved.

How bad a problem is this? Here's a simple graph that illustrates it:

What the graph shows is this: Suppose you buy an accelerator that does something 10 times faster than the fastest general-purpose "commodity" system does, today. Now, assume GP systems increase in speed as they have over the last couple of decades, a 45% CAGR. After only two years, you're only 5 times faster. The value of your investment in that accelerator has been halved. After four years, it's nearly divided by 5. After five years, it's worthless; it's actually slower than a general purpose system.

This is devastating economics for any company trying to make a living by selling accelerators. It means they have to turn over designs continually to keep their advantage, and furthermore, they have to time their development very carefully – a schedule slip means they have effectively lost performance. They're in a race with the likes of Intel, AMD, IBM, and whoever else is out there making systems out of their own technology, and they have nowhere near the resources being applied to general purpose systems (even if they are part of Intel, AMD, and IBM).

Now look at what happens when the rate of increase slows down:

Look at that graph, keeping in mind that the best guess for single-thread performance increases over time is now in the range of 10%-15% CAGR at best. Now your hardware design can actually provide value for five years. You have some slack in your development schedule.

It means that the single-thread performance reduction of Moore's Law makes accelerators economically viable to a degree they never have been before.

Cambrian explosion? I think it's going to be a Cambrian Fourth-of-July, except that the traditional finale won't end soon.

Objections and Rejoinders

I've heard a couple of objections raised to this line of thinking, so I may as well bring them up and try to shoot them down right away.

Objection 1: Total performance gains aren't slowing down, just single-thread gains. Parallel performance continues to rise. To use accelerators you have to parallelize anyway, so you just apply that to the general purpose systems and the accelerator advantage goes away again.

Response: This comes from the mindset that accelerator = GPGPU. GPGPUs all get their performance from explicit parallelism, and, as the "GP" part says, that parallelism is becoming more and more general purpose. But the world of accelerators isn't limited to GPGPUs; some use architectures that simply (hah! Isn't often simple) embed algorithms directly in silicon. The guts of a crypto accelerator aren't anything like a general-purpose processor, for example. Conventional parallelism on conventional general processors will lose out to it. And in any event, this is comparing past apples to present oranges: Previously you did not have to do anything at all to reap the performance benefit of faster systems. This objection assumes that you do have to do something – parallelize code – and that something is far from trivial. Avoiding it may be a major benefit of accelerators.

Objection 2: Accelerator, schaccelerator, if a function is actually useful it will get embedded into the instruction set of general purpose systems, so the accelerator goes away. SIMD operations are an example of this.

Response: This will happen, and has happened, for some functions. But how did anybody get the experience to know what instruction set extensions were the most useful ones? Decades of outboard floating point processing preceded SIMD instructions. AMD says it will "fuse" graphics functions with processors – and how many years of GPU development and experience will let it pick the right functions to do that with? For other functions, well, I don't think many CPU designers will be all that happy absorbing the strange things done in, say, XML acceleration hardware.

7 comments:

melonakos said...: Hi Greg,

Thanks for the article. Great point about why accelerators (including non-GPU-based accelerators) are more likely to stick today than yesterday.

In the case of GPUs, it seems to me that they also have a big advantage in being useful for something other than acceleration - video games. This means that they're better poised to overcome the longhaul of "crossing the chasm" than previous accelerator attempts. Also, as games rely more on physics simulations, GPUs will need more and more support for accelerated computation and the drive for acceleration will continue to have dual-purpose.

Great blog - glad I found it!

Best,

John; July 22, 2009 at 7:54 PM
Greg Pfister said...: Hi, John, and thanks for the kudos.

You're of course correct that GPGPUs can ride on game system volumes, and that will indeed help them stick.

But while that’s a factor, the size of that effect can be debated. Game system volumes are smaller than the general business PC / netbook / commercial server / etc. market. I don’t know the relative volumes of HPC and higher-end games, but I think the higher-end PC gamer market is waning (and HPC is waxing). This is one reason Nvidia is explicitly branching out into HPC. Gaming may wane faster in the future; see my post on “Twilight of the GPU” (FPS shooters meet cloud computing).

From the point of view of HPC, this isn’t all bad. If the game market were the fanatical focus of GPUs, nobody would (for example) spend silicon on full-precision floating-point; it increases the cost without benefit to that market. Some HPC is perfectly happy with single precision, but it wasn’t games that caused IBM to do a second Cell BE iteration that enhanced the performance of dual precision floating point.

So I still think the slowing of single-thread performance increases is the primary issue for accelerator survival, including GPGPUs.

Also, sorry, but I have to pick a nit with one statement you made: “useful for something other than acceleration - video games.” The geometric transformations, texture mapping, rasterization, etc., GPUs do for games is also acceleration. It’s a lot of work, and being much better at it than general CPUs at it is the key reason GPUs exist.

Thanks for reading and commenting!

Greg; July 22, 2009 at 9:28 PM
Greg Pfister said...: Oops, sorry -- somehow the settings for comments wouldn't let anybody comment except for registered users of something or other. That was not my intention.

Anybody can now comment, even anonymously.

Greg; July 22, 2009 at 9:32 PM
Rich Salz said...: Nice post. I posted a couple paragraphs on why accelerators for appliances can make sense. Thanks for the subject line. :); July 24, 2009 at 1:44 PM
Greg Pfister said...: Thanks, Rich.

About some points in your blog:

I think you raise a good point. Appliances are a great locale for accelerators, and can insulate users from many of the problems I list.

That does, of course, in general depend on the interface. It it's a standard, then things can be sweet, and standard interfaces are a natural for some appliances -- ship them SQL or XML schemas, a Java jar file, or what have you, and you may be home free.

There may be limitations. For example, game consoles are effectively game appliances. Clearly, gamers are insulated. But the game developers may not be. Similarly, the people writing the Java apps or SQL or schemas may not be.

Greg; July 24, 2009 at 4:29 PM
Jure Sah said...: I realize this post is a couple months old but...

I am wondering if all this talk of accelerators is really very relevant to the end-user of a PC. While the theory might work out well right down to the chipset interconnect, one finds very very frequently that things do not work nearly as well as on the ad. People very very rarely compare real-life performance of the system they have brought and the market is littered with improvised hardware, where the ad says it works well, but in reality it does not work well at all.

Perhaps since you mention OpenCL, a good example is OpenGL? On many old AGP-based systems, software using OpenGL-based hardware acceleration will underpreform compared to CPU emulation of the same tasks, running on hardware where the accelerator and the CPU are from the same era. Upon deeper inspection one comes to realize that the OpenGL implementation of some of these functions was simply very poor and none of the games used them.

What bugs me is that nobody knew, nobody cared and nobody ever fixed it. The only explanation I can think of as to why it was done in such a way is that the architecture was layed out in this way in order to maintain compatibility with fancy feature/concept X. As an end-user of this hardware I find this disturbing, as somebody who's job it was to produce fast hardware was in fact instead focusing on something quite different and the point that the end result wasn't very efficient never prevented them from selling it as the next greatest thing.

In summary, I think a lot of these fancy technologies are invented, made to work in a system that's not really designed for it, effectively destroyed and then sold as if it worked as originally planned. And nobody notices that it in fact does not.; November 27, 2009 at 12:43 PM
Greg Pfister said...: Hi, Jur.

I fully agree with what you said above, historically. That's what I was trying to summarize in the 6th "let me count the ways" paragraph above.

Agreed,I didn't emphasize that the result to end users is bleah. Nothing. No gain. That is a point worth making.

The point of the rest of this post, though, was to lay out a number of reasons why that might just be changing.

As to what that might mean to end users, take a look at my most recent post: http://perilsofparallel.blogspot.com/2009/11/oh-for-good-old-days-to-come.html

Greg; November 27, 2009 at 4:32 PM

The Perils of Parallel

Monday, July 20, 2009

Why Accelerators Now?

Indicators

Why Now?

Objections and Rejoinders

7 comments:

Post a Comment