Wednesday, September 23, 2009

HPC – The Next Twenty Years

The Coalition for Academic Scientific Computation had its 20-year anniversary celebration symposium recently, and I was invited to participate on a panel with the topic HPC – The Next 20 Years. I thought it would be interesting to write down here what I said in my short position presentation. Eventually all the slides of the talks will be available; I’ll update this post when I know where.

First, my part.

Thank you for inviting me here today. I accepted with misgivings, since futurists give me hives. So I stand here now with a kind of self-induced autoimmune disorder.

I have no clue about what high-performance computing will look like 20 years from now.

(Later note: I was rather surprised that the other panelists did not say that; they all did agree.)

So, I asked a few of my colleagues. The answers can be summarized simply, since there were only three, really:

A blank stare. This was the most common reaction. Like “Look, I have a deadline tomorrow.”

Laughter. I understand that response completely.

And, finally, someone said: What an incredible opportunity! You get to make totally outrageous statements that you’ll never be held accountable for! How about offshore data centers, powered by wave motion, continuously serviced by autonomous robots with salamander-level consciousness, spidering around replacing chicklet-sized compute units, all made by the world’s largest computer vendor – Haier! [They make refrigerators.] And lots of graphs, all going up to the right!

There’s a man after my own heart. I clearly owe him a beer.

And he’s got a lot more imagination than I have. I went out and boringly looked for some data.

What I found was the chart below, from the ITRS, the International Technology Roadmap for Semiconductors, a consortium sponsored by other semiconductor consortia for the purpose of creating and publishing roadmaps. It’s their 2008 update, the latest published, since they meet in December to do the deed. Here it is:

Oooooo. Complicated. Lots of details! Even the details have details, and lesser details upon ‘em. Anything with that much detail obviously must be correct, right?

My immediate reaction to this chart, having created thousands of technical presentations in my non-retired life, is that this is actually a transparent application of technical presentation rule #34: Overwhelm the audience with detail.

The implied message this creates is: This stuff is very, very complicated. I understand it. You do not. Therefore, obviously, I am smarter than you. So what I say must be correct, even if your feeble brain cannot understand why.

It doesn’t go out a full 20 years, but does go to 2020, and it says that by then we’ll be at 10 nm. feature sizes, roughly a quarter of what’s shipping today. Elsewhere it elaborates on how this will mean many hundreds of processors per chip, multi-terabit flash chips, and other wonders.

But you don’t have to understand all of that detail. You want to know what that chart really means? I’ll tell you. It means this, in a bright, happy, green:

Everything’s Fine! .

We’ll just keep rolling down the road, with progress all the way. No worries, mate!

Why does it mean that? Because it’s a future roadmap. Any company publishing a roadmap that does not say “Everything’s Fine!” is clearly inviting everybody to short their stock. The enormous compromise that must be performed within a consortium of consortia clearly must say that, or agreement could not conceivably be reached.

That said, I note two things on this graph:

First, the historical points on the left really don’t say to me that the linear extrapolation will hold at that slope. They look like they’re flattening out. Another year or so of data would make that more clear one way or another, but for now, it doesn’t look too supportive of the extrapolated future predictions.

Second, a significant update for 2008 is noted as changing the slope from a 2.5-year cycle to a 3-year cycle of making improvements. In conjunction with the first observation, I’d expect future updates to increase the cycle length even more, gradually flattening out the slope, extending the period over which the improvements will be made.

The implication: Moore’s Law won’t end with a bang; it will end with a whimper. It will gradually fade out in a period stretching over at least two decades.

I lack the imagination to say what happens when things really flatten out; that will depend on a lot of things other than hardware or software technology. But in the period leading up to this, there are some things I think will happen.

First, computing will in general become cheaper – but not necessarily that much faster. Certainly it won’t be much faster per processor. Whether it will be faster taking parallelism into account we’ll talk about in a bit.

Second, there will be a democratization of at least some HPC: Everybody will be able to do it. Well before 20 years are out, high-end graphics engines will be integrated into traditional high-end personal PC CPUs (see my post A Larrabee in Every PC and Mac). That means there will be TeraFLOPS on everybody’s lap, at least for some values of “lap”; lap may really be pocket or purse.

Third, computing will be done either on one’s laptop / cellphone / whatever; or out in a bloody huge mist/fog/cloud -like thing somewhere. There may be a hierarchy of such cloud resources, but I don’t think anybody will get charged up about what level they happen to be using at the moment.

Those resources will not be the high-quality compute cycles most of the people in the room – huge HPC center managers and users – are usually concerned with. They’ll be garbage computing; the leftovers when Amazon or Google or Microsoft or IBM are finished doing what they want to do.

Now, there’s nothing wrong with dumpster-diving for computing. That, after all, is what many of the original clusters were all about. In fact, the first email I got after publishing the second edition of my book said, roughly, “Hey love the book, but you forgot my favorite cluster – Beowulf.” True enough. Tom Sterling’s first press release on Beowulf came out two weeks after my camera-ready copy was shipped. “I use that,” he continued. “I rescued five PCs from the trash, hooked the up with cheap Ethernet, put Linux on them, and am doing [some complicated technical computing thing or other, I forget] on them. My boss was so impressed he gave me a budget of $600 to expand it!”

So, garbage cycles. But really cheap. In lots of cases, they’ll get the job done.

Fourth, as we get further out, you won’t get billed by how many processors or memory or racks you use – but by how much power your computation takes. And possibly by how much bandwidth it consumes.

Then there’s parallelism.

I’m personally convinced that there will be no savior architecture or savior language that makes parallel processing simple or easy. I’ve lived through a good four decades of trying to find such a thing, with significant funding available, and nothing’s emerged. For languages in particular, take a look at my much earlier post series about there being 101 Parallel Languages, with none of them are in real use. We’ve got MPI – a package for doing message-passing – and that’s about it. Sometimes OpenMP (for shared memory) gets used, but it’s a distant second.

That’s the bad news. The good news is that it doesn’t matter in many cases, because the data sets involved will be absolutely humongous. Genomes, sensor networks, multimedia streams, the entire corpus of human literature will all be out there. This will offer enormous amounts of the kinds of parallelism traditionally derided as “embarrassingly parallel” because it didn’t pose any kind of computer science challenge: There was nothing interesting to say about it because it was too easy to exploit. So despite the lack of saviors in architecture and languages, there will be a lot of parallel computing. There are people now trying to call this kind of computation “pleasantly parallel.”

Probably the biggest challenges will arise in getting access to the highest-quality, most extensive exemplars of such huge data sets.

The traditional kind of “hard” computer-science-y parallel problems may well still be an area of interest, however, because of a curious physical fact: The amount of power consumed by a collection of processing elements goes up linearly with the number of processors; but it also goes up as the square of the clock frequency. So if you can do the same computation, in the same time, with more processors that run more slowly, you use less power. This is much less macho than traditional massive parallelism. “I get twice as much battery life as you” just doesn’t compete with “I have the biggest badass computer on the planet!” But it will be a significant economic issue. From this perspective, parallel office applications – such as parallel browsers, and even the oft-derided Parallel PowerPoint – actually make sense, as a way to extend the life of your cell phone battery charge.

Finally, I’d like to remind everybody of something that I think was quite well expressed by Tim Bray, when he tweeted this:

Here it is, 2009, and I'm typing SQL statements into my telephone. This is not quite the future I'd imagined.

The future will be different – strangely different – from anything we now imagine.

* * * *

That was my presentation. I’ll put some notes about some interesting things I heard from others in a separate post.

Friday, September 11, 2009

Of Muffins and Megahertz

Some readers have indicated, offline, that they liked the car salesman dialog about multicore systems that appeared in What Multicore Really Means. So I thought it might be interesting to relate the actual incident that prompted the miles- to muffins-per-hour performance switch I used.

If you haven't read that post, be warned that what follows will be more meaningful if you do.

It was inspired by a presentation to upper-level management of a 6-month-plus study of what was going on in the silicon concerning clock rate, and what if anything could be done about it. This occurred several years ago. I was involved, but not charged with providing any of the key slides. Well, OK, not one of my slides ended up being used.

It of course began with the usual "here's the team, here's how hard we worked" introduction.

First content chart: a cloud of data points collected from all over the industry that showed performance – specifically SPECINT, an integer benchmark – keeling over. It showed a big, obvious switch from the usual rise in performance, with a rough curve fit, breaking to a much lower predicted performance increase from now on. Pretty impressive. The obvious conclusion: Something major has happened. Things are different. There is big trouble.

Now, there's a rule for executive presentations: Never show a problem without proposing a solution. (Kind of like never letting a crisis go to waste.) So,

Second chart: a very similar-looking cloud of data points, sailing on at the usual growth rate for many years to come – labeled as multiprocessor (MP) results, what the industry would do in response. Yay, no problem! It's all right! We just keep on going! MP is the future! Lots of the rest of the pitch was about various forms of MP, from normal to bizarre.

Small print on second chart: It graphed SPECRATE. Not SPECINT.

SPECINT is a single-processor measure of performance. SPECRATE is, basically, how many completely separate SPECINTs you can do at once. Like, say, instead of the response time of PowerPoint, you get the incredibly useful measure of how many different PowerPoint slides you can modify at the same time. Or you change from miles per hour to muffins per hour.

Nothing on any slide or in any verbal statements referred to the difference. The chart makers - mostly high-level silicon technology experts - knew the difference, at least in theory. At least some of them did. I know others definitely did not in any meaningful sense.

At any event, throughout the entire rest of the presentation they displayed no inclination to inform anybody what it really meant. They didn't even distinguish the good result: typical server tasks can in general make really good use of parallelism. (See IT Departments Should NOT Fear Multicore.)

I was aghast. I couldn't believe that would be presented, like that, no matter what political positioning was going on. But I "knew better" than to say anything. Those charts were a result not just of data mining the industry for performance data but of data mining the company politically to get something that would reflect best on everybody involved. Speak up, and you get told that you don't know the big picture.

My opinion about feeding nonsense to anybody should be obvious from this blog. I don't think I'm totally blind in the political spectrum, but hey, guys, come on. That's blatant.

One hopes that the people who were the target knew enough to know the difference. I suspect that the whole point of the exercise, from their point of view, was just to really, firmly, nail down the point that the first chart – SPECINT keeling over – was a physical fact, and not just one of the regularly-scheduled pitches from the silicon folks for more development funds because they were in trouble. The target audience probably stopped paying attention after that first slide.

I don't mean to imply above that the gents who are responsible for the physical silicon don't regularly didn't have real problems; they do. But this situation was a problem of a whole different dimension.

It still is.

Tuesday, September 8, 2009

Japan, Inc., vs. Intel – Not Really

Fujitsu, Toshiba, Panasonic, Renesas Technology, NEC, Hitachi and Canon, with Japan's Ministry of Economy, Trade and Industry supplying 3-4 billion yen, are pooling resources to build a new "super CPU" for consumer electronics by the end of 2012, according to an article in Forbes. It's being publicized as a "taking on Intel."

The design is based on the work of Hironori Kasahara, professor of computer science at Waseda University, and is allegedly extremely power-efficient. It even "runs on solar cells that will use less than 70% of the power consumed by normal ones." Man, I hate silly marketing talk, especially when subject to translation.

El Reg also picked up on this development.

Why a new CPU design? To jump to the conclusion: I don't know. I don't see it. Not clear what's really going on here.

Digging around for info runs into an almost impenetrable wall of academic publisher copyrights. I did find a paper downloadable from back in 2006, and what looks like a conference poster session exhibit, and a friend got me a copy of a more recent paper that gave a few more clues.

The main advances here appear to be in Kasahara's OSCAR compiler, which produces a hierarchical coarse-grain task graph that is statically scheduled on multiprocessors by the compiler itself. The lowest levels appear to target all the way down to an accelerator. I'm not enough of a compiler expert to judge this, but fine, I'll agree it works. A compiler doesn't require a new CPU design.

The multicore system targeted – and of course there's no guarantee this is what the funded project will ultimately produce – seems to be a conventional cache-coherent MP integrated with a Rapport Kilocore-style reconfigurable 2D array of few-bit (don't know how many, likely 2 or 4) ALUs and multipliers. Some inter-processor block data transfer and straightforward synchronization registers are there, too. Use of the accelerator can produce the usual kinds of accelerator speedups, like 24X over one core for MP3 encoding.

Except for their specific accelerator, this is fairly standard stuff for the embedded market. So far, I don't see anything justifying the huge cost of a developing a new architecture, and, more importantly, producing the never-ending stream of software support it requires: compilers, SDKs, development systems, simulators, etc.

One feature that does not appear standard is the power control. Apparently individual cores can have their frequency and voltage changed independently. For example, one core can run full tilt while another is at half-speed and a third is quarter-speed. Embedded systems today, like IBM/Frescale PowerPC and ARM licensees, typically just provide on and off, with several variants of "off" using less power the longer it takes to turn on again.

All the scheduling, synchronization, and power control is under the control of the compiler. This is particularly useful when subtask times are knowable and you have a deadline that's less than flat-out performance. In those circumstances, the compiler can arrange execution to optimize power. For example, 60% less energy is needed to run a computational fluid dynamics benchmark (Applu) and 87% less for mpeg2encode. As a purely automatic result, this is pretty good. It didn't, in this case, use the accelerator.

Enough for a new architecture? I wouldn't think so. I don't see why they wouldn't, for example, license ARM or PowerPC and thereby get a huge leg up on the support software. Something else is driving this, and I'm not sure what. The Intel reference is, of course, just silly; it is instead competing with the very wide variety of embedded system chips instead. Of course, those have volumes 1000s of times larger than desktops and servers, so any perceived advantage has a huge multiplier.

Oh, and there’s no way this can be the basis of a new general-purpose desktop or server system. All the synch and power control under compiler control, which is key to the OSCAR compiler operation, has to be directly accessible in user mode for the compiler to play with. This is standard in embedded systems that run only one application, forever (like your car transmission), but necessarily anathema in a “general-purpose” system.