The Perils of Parallel: 20 PFLOPS vs. 10s of MLOC: An Oak Ridge Conundrum

On The One Hand:

Oak Ridge National Laboratories (ORNL) is heading for a 20 PFLOPS system, getting there by using Nvidia GPUs. Lots of GPUs. Up to 18,000 GPUs.

This is, of course, neither a secret nor news. Look here, or here, or here if you haven’t heard; it was particularly trumpeted at SC11 last November. They’re upgrading the U.S. Department of Energy's largest computer, Jaguar, from a mere 2.3 petaflops. It will grow into a system to be known as Titan, boasting a roaring 10 to 20 petaflops. Jaguar and Titan are shown below. Presumably there will be more interesting panel art ultimately provided for Titan.

The upgrade of the Jaguar Cray XT5 system will introduce new Cray XE6 nodes with AMD’s 16-core Interlagos Opteron 6200. However, the big performance numbers come from new XK6 nodes, which replace two (half) of the AMDs with Nvidia Tesla 3000-series Kepler compute accelerator GPUs, as shown in the diagram. (The blue blocks are Cray’s Gemini inter-node communications.)

The actual performance is a range because it will “depend on how many (GPUs) we can afford to buy," according to Jeff Nichols, ORNL's associate lab director for scientific computing. 20 PFLOPS is achieved if they reach 18,000 XK6 nodes, apparently meaning that all the nodes are XK6s with their GPUs.

All this seems like a straightforward march of progress these days: upgrade and throw in a bunch of Nvidia number-smunchers. Business as usual. The only news, and it is significant, is that it’s actually being done, sold, installed, accepted. Reality is a good thing. (Usually.) And GPUs are, for good reason, the way to go these days. Lots and lots of GPUs.

On The Other Hand:

Oak Ridge has applications totaling at least 5 million lines of code most of which “does not run on GPGPUs and probably never will due to cost and complexity” [emphasis added by me].

That’s what was said at an Intel press briefing at SC11 by Robert Harrison, a corporate fellow at ORNL and director of the National Institute of Computational Sciences hosted at ORNL. He is the person working to get codes ported to Knight’s Ferry, a pre-product software development kit based on Intel’s MIC (May Integrated Core) architecture. (See my prior post MIC and the Knights for a short description of MIC and links to further information.)

Video of that entire briefing is available, but the things I’m referring to are all the way towards the end, starting at about the 50 minute mark. The money slide out of the entire set is page 30:

(And I really wish whoever was making the video didn’t run out of memory, or run out of battery, or have to leave for a potty break, or whatever else right after this page was presented; it's not the last.)

The presenters said that they had actually ported “tens of millions” of lines of code, most functioning within one day. That does not mean they performed well in one day – see MIC and the Knights for important issues there – but he did say that they had decades of experience making vector codes work well, going all the way back to the Cray 1.

What Harrison says in the video about the possibility of GPU use is actually quite a bit more emphatic than the statement on the slide:

Most of this software, I can confidently say since I'm working on them ... will not run on GPGPUs as we understand them right now, in part because of the sheer volume of software, millions of lines of code, and in large part because the algorithms, structures, and so on associated with the applications are just simply don't have the massive parallelism required for fine grain [execution]."

All this is, of course, right up Intel’s alley, since their target for MIC is source compatibility: Change a command-line flag, recompile, done.

I can’t be alone in seeing a disconnect between the Titan hype and these statements. They make it sound like they’re busy building a system they can’t use, and I have too much respect for the folks at ORNL to think that could be true.

So, how do we resolve this conundrum? I can think of several ways, but they’re all speculation on my part. In no particular order:

- The 20 PFLOP number is public relations hype. The contract with Cray is apparently quite flexible, allowing them to buy as many or as few XK6 Tesla-juiced nodes as they like, presumably including zero. That’s highly unlikely, but it does allow a “try some and see if you like it” approach which might result in rather few XK6 nodess installed.

- Harrison is being overly conservative. When people really get down to it, perhaps porting to GPGPUs won’t be all that painful -- particularly compared with the vectorization required to really make MIC hum.

- Those MLOCs aren’t important for Jaguar/Titan. Unless you have a clearance a lot higher than the one I used to have, you have no clue what they are really running on Jaguar/Titan. The codes ported to MIC might not be the ones they need there, or what they run there may slip smoothly onto GPGPUs, or they may be so important a GPGPU porting effort is deemed worthwhile.

- MIC doesn’t arrive on time. MIC is still vaporware, after all, and the Jaguar/Titan upgrade is starting now. (It’s a bit delayed because AMD’s having trouble delivering those Interlagos Opterons, but the target start date is already past.) The earliest firm deployment date I know of for MIC is at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. Its new Stampede system uses MIC and deploys in 2013.

- Upgrading is a lot simpler and cheaper – in direct cost and in operational changes – than installing something that could use MIC. After all, Cray likes AMD, and uses AMD’s inter-CPU interconnect to attach their Gemini inter-node network. This may not hold water, though, since Nvidia isn’t well-liked by AMD anyway, and the Nvidia chips are attached by PCI-e links. PCI-e is what Knight’s Ferry and Knight’s Crossing (the product version) use, so one could conceivably plug them in.

- MIC is too expensive.

That last one requires a bit more explanation. Nvidia Teslas are, in effect, subsidized by the volumes of their plain graphics GPUs. Thise use the same architecture and can to a significant degree re-use chip designs. As a result, the development cost to get Tesla products out the door is spread across a vastly larger volume than the HPC market provides, allowing much lower pricing than would otherwise be the case. Intel doesn’t have that volume booster, and the price might turn out to reflect that.

That Nvidia advantage won’t last forever. Every time AMD sells a Fusion system with GPU built in, or Intel sells one of their chips with graphics integrated onto the silicon, another nail goes into the coffin of low-end GPU volume. (See my post Nvidia-based Cheap Supercomputing Coming to an End; the post turned out to be too optimistic about Intel & AMD graphics performance, but the principle still holds.) However, this volume advantage is still in force now, and may result in a significantly higher cost for MIC-based units. We really have no idea how Intel’s going to price MIC, though, so this is speculation until the MIC vapor condenses into reality.

Some of the resolutions to this Tesla/MIC conflict may be totally bogus, and reality may reflect a combination of reasons, but who knows? As I said above, I’m speculating, a bit caught…

I’m just a little bit caught in the middle

MIC is a dream, and Tesla’s a riddle

I don’t know what to say, can’t believe it all, I tried

I’ve got to let it go

And just enjoy the show.[1]

[1] With apologies to Lenka, the artist who actually wrote the song the girl sings in Moneyball. Great movie, by the way.

6 comments:

Jeff Squyres said...: The Intel slide about "Experience with Knights Ferry... unparalleled productivity". I find the wording choice quite amusing. :-); January 9, 2012 at 6:53 PM
Greg Pfister said...: @Jeff - Aargh! Can't believe I didn't see that one. Thanks!; January 9, 2012 at 7:10 PM
Greg Pfister said...: @Daniel - thanks!

There's no question that Nvidia certainly has a good market in ARM chips, including Tegra, and that won't go away any time soon.

Unfortunately, volumes on ARM chips don't really help Tesla and Tesla-like GPGPUs. That's a different design point, and doesn't share much if any silicon.

They surely will re-use their ARM cores in their new Project Denver HPC offerings. However, the big SIMD/SIMT whomper also on that chip won't, in the future, have a high-volume counterpart to juice its volumes, and that's a significant design effort.

I happened to speak with Bill Daly, Nvidia CTO, a few months ago and brought up this point. He basically hoped the low end didn't go away too soon.; January 9, 2012 at 7:22 PM
Anonymous said...: Greg, with regard to the number of codes that can be ported to run on a GPU-based Titan system: Have you considered how many codes were ever actually run on Roadrunner at LANL? From everything I've read and heard, it is extremely challenging to program Roadrunner and get good performance out of the Cell processors (Roadrunner is a hybrid system, with AMD hosts and Cell accelerators). I'd be willing to bet that the only codes successfully ported to or written for Roadrunner involved little more than one or two simple but CPU intensive kernels; I didn't hear of any "multiphysics" codes being ported to Roadrunner. From a production computing standpoint, Roadrunner is largely a failure. From a "pushing the limits of technology" standpoint, though, it might be considered a success.

That's a pretty bad precedent for a GPU-based Titan, since Titan is intended for production computing. However, there are far more developers with experience programming Nvidia GPUs than there are programming Cell, so perhaps the situation won't be so bad.

My bet is that a handful of codes will be able to use the GPUs in Titan effectively, with at least a few of them having a kernel that can be tuned well enough to pull a big number that ORNL can tout. The rest of the codes, which can't effectively be ported to use GPUs, will just run on the CPUs; they'll get work done but won't make good use of the system. Oddly enough, that sounds like what I've heard about Roadrunner...; January 15, 2012 at 8:43 PM
Greg Pfister said...: No, I hadn't considered that because, frankly, I didn't really know. Certainly Roadrunner has a similar set of issues.

Thanks for the information!; January 16, 2012 at 12:00 PM
Chad Brewbaker said...: IMHO AMD's APU or an Atom/ARM chip packaged with a commodity GPU is the way to go. Who really cares about running a petaflop's worth of particles due to the extreme latency between iterations which is at minimum the speed of light distance between compute nodes? 99% of particle simulation computations fit on a single high end GPU. Big data problems are all about caching and striping across multiple machines to lower bottlenecks and have failover if a node dies which will start happening with probability one in the very near future.; January 16, 2012 at 3:01 PM

The Perils of Parallel

Monday, January 9, 2012

20 PFLOPS vs. 10s of MLOC: An Oak Ridge Conundrum

6 comments:

Post a Comment