On The One Hand:
Oak Ridge National Laboratories (ORNL) is heading for a
20 PFLOPS system, getting there by using Nvidia GPUs. Lots of GPUs. Up to
18,000 GPUs.
This is, of course, neither a secret nor news. Look here,
or here,
or here
if you haven’t heard; it was particularly trumpeted at SC11 last November. They’re
upgrading the U.S. Department of Energy's largest computer, Jaguar, from a mere
2.3 petaflops. It will grow into a system to be known as Titan, boasting a roaring
10 to 20 petaflops. Jaguar and Titan are shown below. Presumably there will be more interesting panel art ultimately provided for Titan.
The upgrade of the Jaguar Cray XT5 system will introduce new Cray
XE6 nodes with AMD’s 16-core Interlagos Opteron 6200. However, the big
performance numbers come from new XK6
nodes, which replace two (half) of the AMDs with Nvidia Tesla 3000-series Kepler
compute accelerator GPUs, as shown in the diagram. (The blue blocks are Cray’s Gemini
inter-node communications.)
The actual performance is a range because it will “depend
on how many (GPUs) we can afford to buy," according
to Jeff Nichols, ORNL's associate lab director for scientific computing. 20
PFLOPS is achieved if they reach 18,000 XK6 nodes, apparently meaning that all
the nodes are XK6s with their GPUs.
All this seems like a straightforward march of progress
these days: upgrade and throw in a bunch of Nvidia number-smunchers. Business
as usual. The only news, and it is significant, is that it’s actually being
done, sold, installed, accepted. Reality is a good thing. (Usually.) And GPUs
are, for good reason, the way to go these days. Lots and lots of GPUs.
On The Other Hand:
Oak Ridge has applications totaling at least 5 million
lines of code most of which “does not run on GPGPUs and probably never will due to
cost and complexity” [emphasis added by me].
That’s what was said at an Intel press briefing at SC11
by Robert Harrison, a corporate fellow at ORNL and director of the National
Institute of Computational Sciences hosted at ORNL. He is the person working to get codes ported to
Knight’s Ferry, a pre-product software development kit based on Intel’s MIC
(May Integrated Core) architecture. (See my prior post MIC
and the Knights for a short description of MIC and links to further
information.)
Video
of that entire briefing is available, but the things I’m referring to are all
the way towards the end, starting at about the 50 minute mark. The money slide
out of the entire
set is page 30:
(And I really wish whoever was making the video didn’t
run out of memory, or run out of battery, or have to leave for a potty break, or whatever else
right after this page was presented; it's not the last.)
The presenters said that they had actually ported “tens
of millions” of lines of code, most functioning within one day. That does not
mean they performed well in one day – see MIC
and the Knights for important issues there – but he did say that they had
decades of experience making vector codes work well, going all the way back to
the Cray 1.
What Harrison says in the video about the possibility of
GPU use is actually quite a bit more emphatic than the statement on the slide:
Most of this software, I can
confidently say since I'm working on them ... will not run on GPGPUs as we
understand them right now, in part because of the sheer volume of software,
millions of lines of code, and in large part because the algorithms,
structures, and so on associated with the applications are just simply don't
have the massive parallelism required for fine grain [execution]."
All this is, of course, right up Intel’s alley, since
their target for MIC is source compatibility: Change a command-line flag, recompile,
done.
I can’t be alone in seeing a disconnect between the Titan
hype and these statements. They make it sound like they’re busy building a
system they can’t use, and I have too much respect for the folks at ORNL to
think that could be true.
So, how do we resolve this conundrum? I can think of
several ways, but they’re all speculation on my part. In no particular order:
- The 20 PFLOP
number is public relations hype. The contract with Cray is apparently quite
flexible, allowing them to buy as many or as few XK6 Tesla-juiced nodes as they
like, presumably including zero. That’s highly unlikely, but it does allow a “try
some and see if you like it” approach which might result in rather few XK6 nodess
installed.
- Harrison is
being overly conservative. When people really get down to it, perhaps
porting to GPGPUs won’t be all that painful -- particularly compared with the
vectorization required to really make MIC hum.
- Those MLOCs aren’t
important for Jaguar/Titan. Unless you have a clearance a lot higher than
the one I used to have, you have no clue what they are really running on
Jaguar/Titan. The codes ported to MIC might not be the ones they need there, or
what they run there may slip smoothly onto GPGPUs, or they may be so important a
GPGPU porting effort is deemed worthwhile.
- MIC doesn’t
arrive on time. MIC is still vaporware, after all, and the Jaguar/Titan
upgrade is starting now. (It’s a bit delayed because AMD’s having
trouble delivering those Interlagos Opterons, but the target start date is
already past.) The earliest firm deployment date I know of for MIC is at the Texas
Advanced Computing Center (TACC) at The University of Texas at Austin. Its new
Stampede system uses MIC
and deploys in 2013.
- Upgrading is a
lot simpler and cheaper – in direct cost and in operational changes – than installing
something that could use MIC. After all, Cray likes AMD, and uses AMD’s
inter-CPU interconnect to attach their Gemini inter-node network. This may not
hold water, though, since Nvidia isn’t well-liked by AMD anyway, and the Nvidia
chips are attached by PCI-e links. PCI-e is what Knight’s Ferry and Knight’s
Crossing (the product version) use, so one could conceivably plug them in.
- MIC is too
expensive.
That last one requires a bit more explanation. Nvidia Teslas
are, in effect, subsidized by the volumes of their plain graphics GPUs. Thise
use the same architecture and can to a significant degree re-use chip designs.
As a result, the development cost to get Tesla products out the door is spread
across a vastly larger volume than the HPC market provides, allowing much lower
pricing than would otherwise be the case. Intel doesn’t have that volume
booster, and the price might turn out to reflect that.
That Nvidia advantage won’t last forever. Every time AMD
sells a Fusion system with GPU built in, or Intel sells one of their chips with
graphics integrated onto the silicon, another nail goes into the coffin of low-end
GPU volume. (See my post Nvidia-based
Cheap Supercomputing Coming to an End; the post turned out to be too
optimistic about Intel & AMD graphics performance, but the principle still
holds.) However, this volume advantage is still in force now, and may result in
a significantly higher cost for MIC-based units. We really have no idea how
Intel’s going to price MIC, though, so this is speculation until the MIC vapor
condenses into reality.
Some of the resolutions to this Tesla/MIC conflict may be
totally bogus, and reality may reflect a combination of reasons, but who knows?
As I said above, I’m speculating, a bit caught…
I’m just a little bit caught
in the middle
MIC is a dream, and Tesla’s a
riddle
I don’t know what to say, can’t
believe it all, I tried
I’ve got to let it go
And just enjoy the show.[1]
[1]
With apologies to Lenka, the artist who actually wrote the song the girl sings in
Moneyball. Great movie, by the way.
6 comments:
The Intel slide about "Experience with Knights Ferry... unparalleled productivity". I find the wording choice quite amusing. :-)
@Jeff - Aargh! Can't believe I didn't see that one. Thanks!
@Daniel - thanks!
There's no question that Nvidia certainly has a good market in ARM chips, including Tegra, and that won't go away any time soon.
Unfortunately, volumes on ARM chips don't really help Tesla and Tesla-like GPGPUs. That's a different design point, and doesn't share much if any silicon.
They surely will re-use their ARM cores in their new Project Denver HPC offerings. However, the big SIMD/SIMT whomper also on that chip won't, in the future, have a high-volume counterpart to juice its volumes, and that's a significant design effort.
I happened to speak with Bill Daly, Nvidia CTO, a few months ago and brought up this point. He basically hoped the low end didn't go away too soon.
Greg, with regard to the number of codes that can be ported to run on a GPU-based Titan system: Have you considered how many codes were ever actually run on Roadrunner at LANL? From everything I've read and heard, it is extremely challenging to program Roadrunner and get good performance out of the Cell processors (Roadrunner is a hybrid system, with AMD hosts and Cell accelerators). I'd be willing to bet that the only codes successfully ported to or written for Roadrunner involved little more than one or two simple but CPU intensive kernels; I didn't hear of any "multiphysics" codes being ported to Roadrunner. From a production computing standpoint, Roadrunner is largely a failure. From a "pushing the limits of technology" standpoint, though, it might be considered a success.
That's a pretty bad precedent for a GPU-based Titan, since Titan is intended for production computing. However, there are far more developers with experience programming Nvidia GPUs than there are programming Cell, so perhaps the situation won't be so bad.
My bet is that a handful of codes will be able to use the GPUs in Titan effectively, with at least a few of them having a kernel that can be tuned well enough to pull a big number that ORNL can tout. The rest of the codes, which can't effectively be ported to use GPUs, will just run on the CPUs; they'll get work done but won't make good use of the system. Oddly enough, that sounds like what I've heard about Roadrunner...
No, I hadn't considered that because, frankly, I didn't really know. Certainly Roadrunner has a similar set of issues.
Thanks for the information!
IMHO AMD's APU or an Atom/ARM chip packaged with a commodity GPU is the way to go. Who really cares about running a petaflop's worth of particles due to the extreme latency between iterations which is at minimum the speed of light distance between compute nodes? 99% of particle simulation computations fit on a single high end GPU. Big data problems are all about caching and striping across multiple machines to lower bottlenecks and have failover if a node dies which will start happening with probability one in the very near future.
Post a Comment
Thanks for commenting!
Note: Only a member of this blog may post a comment.