The Perils of Parallel: January 2012

On The One Hand:

Oak Ridge National Laboratories (ORNL) is heading for a 20 PFLOPS system, getting there by using Nvidia GPUs. Lots of GPUs. Up to 18,000 GPUs.

This is, of course, neither a secret nor news. Look here, or here, or here if you haven’t heard; it was particularly trumpeted at SC11 last November. They’re upgrading the U.S. Department of Energy's largest computer, Jaguar, from a mere 2.3 petaflops. It will grow into a system to be known as Titan, boasting a roaring 10 to 20 petaflops. Jaguar and Titan are shown below. Presumably there will be more interesting panel art ultimately provided for Titan.

The upgrade of the Jaguar Cray XT5 system will introduce new Cray XE6 nodes with AMD’s 16-core Interlagos Opteron 6200. However, the big performance numbers come from new XK6 nodes, which replace two (half) of the AMDs with Nvidia Tesla 3000-series Kepler compute accelerator GPUs, as shown in the diagram. (The blue blocks are Cray’s Gemini inter-node communications.)

The actual performance is a range because it will “depend on how many (GPUs) we can afford to buy," according to Jeff Nichols, ORNL's associate lab director for scientific computing. 20 PFLOPS is achieved if they reach 18,000 XK6 nodes, apparently meaning that all the nodes are XK6s with their GPUs.

All this seems like a straightforward march of progress these days: upgrade and throw in a bunch of Nvidia number-smunchers. Business as usual. The only news, and it is significant, is that it’s actually being done, sold, installed, accepted. Reality is a good thing. (Usually.) And GPUs are, for good reason, the way to go these days. Lots and lots of GPUs.

On The Other Hand:

Oak Ridge has applications totaling at least 5 million lines of code most of which “does not run on GPGPUs and probably never will due to cost and complexity” [emphasis added by me].

That’s what was said at an Intel press briefing at SC11 by Robert Harrison, a corporate fellow at ORNL and director of the National Institute of Computational Sciences hosted at ORNL. He is the person working to get codes ported to Knight’s Ferry, a pre-product software development kit based on Intel’s MIC (May Integrated Core) architecture. (See my prior post MIC and the Knights for a short description of MIC and links to further information.)

Video of that entire briefing is available, but the things I’m referring to are all the way towards the end, starting at about the 50 minute mark. The money slide out of the entire set is page 30:

(And I really wish whoever was making the video didn’t run out of memory, or run out of battery, or have to leave for a potty break, or whatever else right after this page was presented; it's not the last.)

The presenters said that they had actually ported “tens of millions” of lines of code, most functioning within one day. That does not mean they performed well in one day – see MIC and the Knights for important issues there – but he did say that they had decades of experience making vector codes work well, going all the way back to the Cray 1.

What Harrison says in the video about the possibility of GPU use is actually quite a bit more emphatic than the statement on the slide:

Most of this software, I can confidently say since I'm working on them ... will not run on GPGPUs as we understand them right now, in part because of the sheer volume of software, millions of lines of code, and in large part because the algorithms, structures, and so on associated with the applications are just simply don't have the massive parallelism required for fine grain [execution]."

All this is, of course, right up Intel’s alley, since their target for MIC is source compatibility: Change a command-line flag, recompile, done.

I can’t be alone in seeing a disconnect between the Titan hype and these statements. They make it sound like they’re busy building a system they can’t use, and I have too much respect for the folks at ORNL to think that could be true.

So, how do we resolve this conundrum? I can think of several ways, but they’re all speculation on my part. In no particular order:

- The 20 PFLOP number is public relations hype. The contract with Cray is apparently quite flexible, allowing them to buy as many or as few XK6 Tesla-juiced nodes as they like, presumably including zero. That’s highly unlikely, but it does allow a “try some and see if you like it” approach which might result in rather few XK6 nodess installed.

- Harrison is being overly conservative. When people really get down to it, perhaps porting to GPGPUs won’t be all that painful -- particularly compared with the vectorization required to really make MIC hum.

- Those MLOCs aren’t important for Jaguar/Titan. Unless you have a clearance a lot higher than the one I used to have, you have no clue what they are really running on Jaguar/Titan. The codes ported to MIC might not be the ones they need there, or what they run there may slip smoothly onto GPGPUs, or they may be so important a GPGPU porting effort is deemed worthwhile.

- MIC doesn’t arrive on time. MIC is still vaporware, after all, and the Jaguar/Titan upgrade is starting now. (It’s a bit delayed because AMD’s having trouble delivering those Interlagos Opterons, but the target start date is already past.) The earliest firm deployment date I know of for MIC is at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. Its new Stampede system uses MIC and deploys in 2013.

- Upgrading is a lot simpler and cheaper – in direct cost and in operational changes – than installing something that could use MIC. After all, Cray likes AMD, and uses AMD’s inter-CPU interconnect to attach their Gemini inter-node network. This may not hold water, though, since Nvidia isn’t well-liked by AMD anyway, and the Nvidia chips are attached by PCI-e links. PCI-e is what Knight’s Ferry and Knight’s Crossing (the product version) use, so one could conceivably plug them in.

- MIC is too expensive.

That last one requires a bit more explanation. Nvidia Teslas are, in effect, subsidized by the volumes of their plain graphics GPUs. Thise use the same architecture and can to a significant degree re-use chip designs. As a result, the development cost to get Tesla products out the door is spread across a vastly larger volume than the HPC market provides, allowing much lower pricing than would otherwise be the case. Intel doesn’t have that volume booster, and the price might turn out to reflect that.

That Nvidia advantage won’t last forever. Every time AMD sells a Fusion system with GPU built in, or Intel sells one of their chips with graphics integrated onto the silicon, another nail goes into the coffin of low-end GPU volume. (See my post Nvidia-based Cheap Supercomputing Coming to an End; the post turned out to be too optimistic about Intel & AMD graphics performance, but the principle still holds.) However, this volume advantage is still in force now, and may result in a significantly higher cost for MIC-based units. We really have no idea how Intel’s going to price MIC, though, so this is speculation until the MIC vapor condenses into reality.

Some of the resolutions to this Tesla/MIC conflict may be totally bogus, and reality may reflect a combination of reasons, but who knows? As I said above, I’m speculating, a bit caught…

I’m just a little bit caught in the middle

MIC is a dream, and Tesla’s a riddle

I don’t know what to say, can’t believe it all, I tried

I’ve got to let it go

And just enjoy the show.[1]

[1] With apologies to Lenka, the artist who actually wrote the song the girl sings in Moneyball. Great movie, by the way.

The Perils of Parallel

Monday, January 9, 2012

20 PFLOPS vs. 10s of MLOC: An Oak Ridge Conundrum