The Perils of Parallel: IBM POWER

Showing posts with label IBM POWER. Show all posts

Monday, August 15, 2011

IBM Dumps Blue Waters – Final Curtain on the Old Days

IBM has pulled out of the much-touted Blue Waters supercomputer project of IBM and National Center for Supercomputing Applications at the University of Illinois, an effort which was supposed to produce one petaflops of sustained performance by the end of 2012. Googling “IBM Blue Waters” and selecting “news” will give you a bevy of reports on this, (like this, this, this, this) so I’m going to refrain from reduplicating what everybody else has said.

I don’t have any inside scoop on this, in the sense that I have no under-the-table secret contacts or communications channels back into IBM. However, I can make some connections between dots already out there, based on my experience leading one flashy HPC project (RP3) back in the 1980s (possibly the first IBM did), and being close to such projects after that. My conclusion: There has been a major change in IBM executive management’s attitude towards flashy HPC projects, a change that is probably the drop of the final shoe of the “good old days” of IT architecture research.

I deduce the attitude change from HPCwire’s call to Herb Schultz, marketing manager for IBM's Deep Computing unit, in which he said a while ago that “There is really no appetite in IBM anymore -- with some of the leadership changes over the last few years – for revenue that has no profit with it”.

So, IBM wants to make money on its high performance computing products. What’s wrong with that? Nothing. As every IBM manager is taught in their first management training – at least I was – the purpose of IBM isn’t to advance technology, or make the world a better place, or be a good corporate citizen; it’s to make money. (Those were the multiple choices in a quiz, by the way.) It’s perfectly obvious that any company that doesn’t make money, and thereby stay in business, can’t do anything. It’s like the first and most important rule of breathing I was taught in Tai Chi, which was: Breathe. If you don’t do that, you won’t be around long.

But as everyone should also know, there’s a focus on making money now, directly, measurably; and there’s setting up to make more money in the future. The first is needed; but if done exclusively, without the second, your corporate lifetime is also being limited – rather like living on a tasty but unhealthy diet.

I recall distinctly the response of Ralph Gomory, then IBM Senior VP of Science and Technology, to a cadre of high-level development managers who were complaining about the cost of some HPC project, proposing to kill it. He told them “This will make you money in ways you can’t conceive of” (approximate quote). He was right. What they return isn’t money, directly; it’s column-inches on the front page of the New York Times and similar media.

This works. I’ve recounted in a much earlier post a case I was involved in where an IBM account rep absolutely owned the entire IT account of a large, conservative retailer in the Midwest – because an IBM RISC system was given the credit for beating Kasparov. (Winning Jeopardy! hardly has the same cachet.)

Also, while it may be hard to fathom now, there was a time when computer architecture and hardware development research was simply pursued for its own sake, primarily because we might find something out by doing it, without knowing what that might be.

This also works. My personal example of that is tree saturation^{^[1]} (a.k.a. congestion spreading, but in non-lossy networks), which I and Alan Norton serendipitously discovered in the RP3 project. I distinctly recall involuntarily standing and my whole body stiffening when I looked at the graphs revealing it, and realized what was happening. It was my own personal “eureka!” kind of moment. We’d no clue we’d find that, and it was the occasion of my only recursive award – an award from IBM research for getting an award for the paper on it. Gomory (who, coincidentally, was Research Division president at the time) said that was exactly the kind of thing he had hoped to get from RP3.

However, two things have changed since then: There’s a much stronger focus on showing results today (which the IBM stock price rise duly reflects). And the cost of entry has become quite a bit higher, particularly entries like Blue Waters.

Back when Gomory said what I recounted above, IBM was riding high on steady income from mainframes and their software. Those still bring in substantial money, particularly via drag of software along with them (which the hardware guys aren’t allowed to count… grrr…). Now, though, the software business has moved on to the much more competitive arena of stand-alone software products that run on a variety of platforms. Of course, there is also now the whole service business that practically didn’t exist back then.

In addition, the cost of entry has skyrocketed. Back when I was involved in RP3, we had a contract with DARPA that brought in a whole $1M or so, which paid something like half the real bill. Compare that with El Reg’s estimate that a single Blue Waters rack is an $8M proposition, with over 200 racks needed for the final configuration and you’re over $1B. Those are all rough numbers, and they’re retail, not cost (an impossible number to pin down from outside), but you can see where the table stakes have gotten beyond many of the highest high rollers stash.

So I’m going to label this pull out from Blue Waters as the final ringing down of the last curtain on an era of free-wheeling profit-unconstrained research into computer architecture and systems.

It was fun while it lasted, but now, no matter what you do, the issue is where and when the profit comes out. That’s normal now, but I think we need to remember that it was not always so.

[1] I’d like to give a URL for that, but it was back in the early 80s pre-web. There are lots of papers still out there about avoiding or fixing it (many wrong) that you can find by Googling “tree saturation”, though. Finally figured out how to fix it in InfiniBand. Complicated. Possibly not worth the effort. Added: Since someone asked, here's bibliographical information on the paper: "Hot spot" contention and combining in multistage interconnection networks. GF Pfister, V Norton IEEE TRANS. COMP. 34:1010, 943-948, 1985

Saturday, September 4, 2010

Intel Graphics in Sandy Bridge: Good Enough

As I and others expected, Intel is gradually rolling out how much better the graphics in its next generation will be. Anandtech got an early demo part of Sandy Bridge and checked out the graphics, among other things. The results show that the "good enough" performance I argued for in my prior post (Nvidia-based Cheap Supercomputing Coming to an End) will be good enough to sink third party low-end graphics chip sets. So it's good enough to hurt Nvidia's business model, and make their HPC products fully carry their own development burden, raising prices notably.

The net is that for this early chip, with early device drivers, at low, but usable resolution (1024x768) there's adequate performance on games like "Batman: Arkham Asylum," "Call of Duty MW2," and a bunch of others, significantly including "Worlds of Warfare." And it'll play Blue-Ray 3D, too.

Anandtech's conclusion is "If this is the low end of what to expect, I'm not sure we'll need more than integrated graphics for non-gaming specific notebooks." I agree. I'd add desktops, too. Nvidia isn't standing still, of course; on the low end they are saying they'll do 3D, too, and will save power. But integrated graphics are, effectively, free. It'll be there anyway. Everywhere. And as a result, everything will be tuned to work best on that among the PC platforms; that's where the volumes will be.

Some comments I've received elsewhere on my prior post have been along the lines of "but Nvidia has such a good computing model and such good software support – Intel's rotten IGP can't match that." True. I agree. But.

There's a long history of ugly architectures dominating clever, elegant architectures that are superior targets for coding and compiling. Where are the RISC-based CAD workstations of 15+ years ago? They turned into PCs with graphics cards. The DEC Alpha, MIPS, Sun SPARC, IBM POWER and others, all arguably far better exemplars of the computing art, have been trounced by X86, which nobody would call elegant. Oh, and the IBM zSeries, also high on the inelegant ISA scale, just keeps truckin' through the decades, most recently at an astounding 5.2 GHz.

So we're just repeating history here. Volume, silicon technology, and market will again trump elegance and computing model.

PostScript: According to Bloomberg, look for a demo at Intel Developer Forum next week.

Tuesday, June 8, 2010

Ten Ways to Trash your Performance Credibility

Watered by rains of development sweat, warmed in the sunny smiles of ecstatic customers, sheltered from the hailstones of Moore's Law, the accelerator speedup flowers are blossoming.

Danger: The showiest blooms are toxic to your credibility.

(My wife is planting flowers these days. Can you tell?)

There's a paradox here. You work with a customer, and he's happy with the result; in fact, he's ecstatic. He compares the performance he got before you arrived with what he's getting now, and gets this enormous number – 100X, 1000X or more. You quote that customer, accurately, and hear:

"I would have to be pretty drunk to believe that."

Your great, customer-verified, most wonderful results have trashed your credibility.

Here are some examples:

In a recent talk, Prof. Sharon Glotzer just glowed about getting a 100X speedup "overnight" on the molecular dynamics codes she runs.

In an online discussion on LinkedIn, a Cray marketer said his client's task went from taking 12 hours on a Quad-core Intel Westmere 5600 to 1.2 seconds. That's a speedup of 36,000X. What application? Sorry, that's under non-disclosure agreement.

In a video interview, a customer doing cell pathology image analysis reports their task going from 400 minutes to 65 milliseconds, for a speedup of just under 370,000X. (Update: Typo, he really does say "minutes" in the video.)

None of these people are shading the truth. They are doing what is, for them, a completely valid comparison: They're directly comparing where they started with where they ended up. The problem is that the result doesn't pass the drunk test. Or the laugh test. The idea that, by itself, accelerator hardware or even some massively parallel box will produce 5-digit speedups is laughable. Anybody baldly quoting such results will instantly find him- or herself dismissed as, well, the polite version would be that they're living in la-la land or dipping a bit too deeply into 1960s pop pharmacology.

What's going on with such huge results is that the original system was a target-rich zone for optimization. It was a pile of bad, squirrely code, and sometimes, on top of that, interpreted rather than compiled. Simply getting to the point where an accelerator, or parallelism, or SIMD, or whatever, could be applied involved fixing it up a lot, and much of the total speedup was due to that cleanup – not directly to the hardware.

This is far from a new issue. Back in the days of vector supercomputers, the following sequence was common: Take a bunch of grotty old Fortran code and run it through a new super-duper vectorizing optimizing compiler. Result: Poop. It might even slow down. So, OK, you clean up the code so the compiler has a fighting chance of figuring out that there's a vector or two in there somewhere, and Wow! Gigantic speedup. But there's a third step, a step not always done: Run the new version of the code through a decent compiler without vectors or any special hardware enabled, and, well, hmmm. In lots of cases it runs almost as fast as with the special hardware enabled. Thanks for your help optimizing my code, guys, but keep your hardware; it doesn't seem to add much value.

The moral of that story is that almost anything is better than grotty old Fortran. Or grotty, messed-up MATLAB or Java or whatever. It's the "grotty" part that's the killer. A related modernized version of this story is told in a recent paper Believe It or Not! Multi-core CPUs can Match GPU Performance, where they note "The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively." If you really clean up the code and match it to the platform it's using, great things can happen.

This of course doesn't mean that accelerators and other hardware are useless; far from it. The "Believe It or Not!" case wasn't exactly hurt by the fact that Power7 has a macho memory subsystem. It does mean that you should be aware of all the factors that sped up the execution, and using that information, present your results with credit due to the appropriate actions.

The situation we're in is identical to the one that lead someone (wish I remembered who), decades ago, to write a short paper titled, approximately, Ten Ways to Lie about Parallel Processing. I thought I kept a copy, but if I did I can't find it. It was back at the dawn of whatever, and I can't find it now even with Google Scholar. (If anyone out there knows the paper I'm referencing, please let me know.) Got it! It's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, by David H. Bailey. Thank you, Roland!

In the same spirit, and probably duplicating that paper massively, here are my ten ways to lose your credibility:

Only compare the time needed to execute the innermost kernel. Never mind that the kernel is just 5% of the total execution time of the whole task.
Compare your single-precision result to the original, which computed in double precision. Worry later that your double precision is 4X slower, and the increased data size won't fit in your local memory. Speaking of which,
Pick a problem size that just barely fits into the local memory you have available. Why? See #4.
Don't count the time to initialize the hardware and load the problem into its memory. PCI Express is just as fast as a processor's memory bus. Not.
Change the algorithm. Going from a linear to a binary search or a hash table is just good practice.
Rewrite the code from scratch. It was grotty old Fortran, anyway; the world is better off without it.
Allow a slightly different answer. A*(X+Y) equals A*X+A*Y, right? Not in floating point, it doesn't.
Change the operating system. Pick the one that does IO to your device fastest.
Change the libraries. The original was 32 releases out of date! And didn't work with my compiler!
Change the environment. For example, get rid of all those nasty interrupts from the sensors providing the real-time data needed in practice.

This, of course, is just a start. I'm sure there are another ten or a hundred out there.

A truly fair accounting for the speedup provided by an accelerator, or any other hardware, can only be done by comparing it to the best possible code for the original system. I suspect that the only time anybody will be able to do that is when comparing formally standardized benchmark results, not live customer codes.

For real customer codes, my advice would be to list all the differences between the original and the final runs that you can find. Feel free to use the list above as a starting point for finding those differences. Then show that list before you present your result. That will at least demonstrate that you know you're comparing marigolds and peonies, and will help avoid trashing your credibility.

*****************

Thanks to John Melonakos of Accelereyes for discussion and sharing his thoughts on this topic.