On The One Hand:
Oak Ridge National Laboratories (ORNL) is heading for a
20 PFLOPS system, getting there by using Nvidia GPUs. Lots of GPUs. Up to
18,000 GPUs.
This is, of course, neither a secret nor news. Look here,
or here,
or here
if you haven’t heard; it was particularly trumpeted at SC11 last November. They’re
upgrading the U.S. Department of Energy's largest computer, Jaguar, from a mere
2.3 petaflops. It will grow into a system to be known as Titan, boasting a roaring
10 to 20 petaflops. Jaguar and Titan are shown below. Presumably there will be more interesting panel art ultimately provided for Titan.
The upgrade of the Jaguar Cray XT5 system will introduce new Cray
XE6 nodes with AMD’s 16-core Interlagos Opteron 6200. However, the big
performance numbers come from new XK6
nodes, which replace two (half) of the AMDs with Nvidia Tesla 3000-series Kepler
compute accelerator GPUs, as shown in the diagram. (The blue blocks are Cray’s Gemini
inter-node communications.)
The actual performance is a range because it will “depend
on how many (GPUs) we can afford to buy," according
to Jeff Nichols, ORNL's associate lab director for scientific computing. 20
PFLOPS is achieved if they reach 18,000 XK6 nodes, apparently meaning that all
the nodes are XK6s with their GPUs.
All this seems like a straightforward march of progress
these days: upgrade and throw in a bunch of Nvidia number-smunchers. Business
as usual. The only news, and it is significant, is that it’s actually being
done, sold, installed, accepted. Reality is a good thing. (Usually.) And GPUs
are, for good reason, the way to go these days. Lots and lots of GPUs.
On The Other Hand:
Oak Ridge has applications totaling at least 5 million
lines of code most of which “does not run on GPGPUs and probably never will due to
cost and complexity” [emphasis added by me].
That’s what was said at an Intel press briefing at SC11
by Robert Harrison, a corporate fellow at ORNL and director of the National
Institute of Computational Sciences hosted at ORNL. He is the person working to get codes ported to
Knight’s Ferry, a pre-product software development kit based on Intel’s MIC
(May Integrated Core) architecture. (See my prior post MIC
and the Knights for a short description of MIC and links to further
information.)
Video
of that entire briefing is available, but the things I’m referring to are all
the way towards the end, starting at about the 50 minute mark. The money slide
out of the entire
set is page 30:
(And I really wish whoever was making the video didn’t
run out of memory, or run out of battery, or have to leave for a potty break, or whatever else
right after this page was presented; it's not the last.)
The presenters said that they had actually ported “tens
of millions” of lines of code, most functioning within one day. That does not
mean they performed well in one day – see MIC
and the Knights for important issues there – but he did say that they had
decades of experience making vector codes work well, going all the way back to
the Cray 1.
What Harrison says in the video about the possibility of
GPU use is actually quite a bit more emphatic than the statement on the slide:
Most of this software, I can
confidently say since I'm working on them ... will not run on GPGPUs as we
understand them right now, in part because of the sheer volume of software,
millions of lines of code, and in large part because the algorithms,
structures, and so on associated with the applications are just simply don't
have the massive parallelism required for fine grain [execution]."
All this is, of course, right up Intel’s alley, since
their target for MIC is source compatibility: Change a command-line flag, recompile,
done.
I can’t be alone in seeing a disconnect between the Titan
hype and these statements. They make it sound like they’re busy building a
system they can’t use, and I have too much respect for the folks at ORNL to
think that could be true.
So, how do we resolve this conundrum? I can think of
several ways, but they’re all speculation on my part. In no particular order:
- The 20 PFLOP
number is public relations hype. The contract with Cray is apparently quite
flexible, allowing them to buy as many or as few XK6 Tesla-juiced nodes as they
like, presumably including zero. That’s highly unlikely, but it does allow a “try
some and see if you like it” approach which might result in rather few XK6 nodess
installed.
- Harrison is
being overly conservative. When people really get down to it, perhaps
porting to GPGPUs won’t be all that painful -- particularly compared with the
vectorization required to really make MIC hum.
- Those MLOCs aren’t
important for Jaguar/Titan. Unless you have a clearance a lot higher than
the one I used to have, you have no clue what they are really running on
Jaguar/Titan. The codes ported to MIC might not be the ones they need there, or
what they run there may slip smoothly onto GPGPUs, or they may be so important a
GPGPU porting effort is deemed worthwhile.
- MIC doesn’t
arrive on time. MIC is still vaporware, after all, and the Jaguar/Titan
upgrade is starting now. (It’s a bit delayed because AMD’s having
trouble delivering those Interlagos Opterons, but the target start date is
already past.) The earliest firm deployment date I know of for MIC is at the Texas
Advanced Computing Center (TACC) at The University of Texas at Austin. Its new
Stampede system uses MIC
and deploys in 2013.
- Upgrading is a
lot simpler and cheaper – in direct cost and in operational changes – than installing
something that could use MIC. After all, Cray likes AMD, and uses AMD’s
inter-CPU interconnect to attach their Gemini inter-node network. This may not
hold water, though, since Nvidia isn’t well-liked by AMD anyway, and the Nvidia
chips are attached by PCI-e links. PCI-e is what Knight’s Ferry and Knight’s
Crossing (the product version) use, so one could conceivably plug them in.
- MIC is too
expensive.
That last one requires a bit more explanation. Nvidia Teslas
are, in effect, subsidized by the volumes of their plain graphics GPUs. Thise
use the same architecture and can to a significant degree re-use chip designs.
As a result, the development cost to get Tesla products out the door is spread
across a vastly larger volume than the HPC market provides, allowing much lower
pricing than would otherwise be the case. Intel doesn’t have that volume
booster, and the price might turn out to reflect that.
That Nvidia advantage won’t last forever. Every time AMD
sells a Fusion system with GPU built in, or Intel sells one of their chips with
graphics integrated onto the silicon, another nail goes into the coffin of low-end
GPU volume. (See my post Nvidia-based
Cheap Supercomputing Coming to an End; the post turned out to be too
optimistic about Intel & AMD graphics performance, but the principle still
holds.) However, this volume advantage is still in force now, and may result in
a significantly higher cost for MIC-based units. We really have no idea how
Intel’s going to price MIC, though, so this is speculation until the MIC vapor
condenses into reality.
Some of the resolutions to this Tesla/MIC conflict may be
totally bogus, and reality may reflect a combination of reasons, but who knows?
As I said above, I’m speculating, a bit caught…
I’m just a little bit caught
in the middle
MIC is a dream, and Tesla’s a
riddle
I don’t know what to say, can’t
believe it all, I tried
I’ve got to let it go
And just enjoy the show.[1]
[1]
With apologies to Lenka, the artist who actually wrote the song the girl sings in
Moneyball. Great movie, by the way.