Also, there was much attempted discussion of the 2012
product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion
was much attempted by me, anyway, with decidedly limited success. There were
some hints, and some things can be deduced, but the real KC hasn’t stood up
yet. That reticence is probably a turn for the better, since KF is the direct
descendant of Intel’s Larrabee graphics engine, which was quite prematurely
trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only
to eventually be dropped – to become KF. A bit more circumspection is now
certainly called for.
This circumspection does, however, make it difficult to
separate what I learned into neat KF or KC buckets; KC is just too well hidden
so far. Here is my best guesses, answering questions I received from Twitter
and elsewhere as well as I can.
If you’re unfamiliar with MIC or KF or KC, you can call
up a plethora of resources on the web that will tell you about it; I won’t be
repeating that information here. Here’s a relatively recent one: Intel
Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore
anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one
chip. KC has “50 or more.” In addition, and crucially for much of the
discussion below, each core has an enhanced and expanded vector / SIMD unit. You
can think of that as an extension of SSE or AVX, but 512 bits wide and with
many more operations available.
An aside: Intel’s department of code names is fond of
using place names – towns, rivers, etc. – for the externally-visible names of
development projects. “Knight’s Ferry” follows that tradition; it’s a town up
in the Sierra Nevada Mountains in central California. The only “Knight’s
Corner” I could find, however, is a “populated area,” not even a real town,
probably a hamlet or development, in central Massachusetts. This is at best an
unlikely name source. I find this odd; I wish I’d remembered to ask about it.
Is It Real?
The MIC architecture is apparently as real as it can be.
There are multiple generations of the MIC chip in roadmaps, and Intel has
committed to supply KC (product-level) parts to the University of Texas TACC
by January 2013, so at least the second generation is as guaranteed to be
real as a contract makes it. I was repeatedly told by Intel execs I interviewed
that it is as real as it gets, that the MIC architecture is a long-term
commitment by Intel, and it is not transitional – not a step to other,
different things. This is supposed to be the Intel highly-parallel technical
computing accelerator architecture, period, a point emphasized to me by several
people. (They still see a role for Xeon, of course, so they don't think of MIC as the only
technical computing architecture.)
More importantly, Joe Curley (Intel HPC marketing) gave
me a reason why MIC is real, and intended to be architecturally stable: HPC and
general technical computing are about a third of Intel’s server business. Further,
that business tends to be a very profitable third since those customers tend to
buy high-end parts. MIC is intended to slot directly into that business,
obviously taking the money that is now increasingly spent on other accelerators
(chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as
discussed below, Intel’s intention for MIC is to greatly widen the pool of
customers for accelerators.
The Big Feature: Source Compatibility
There is absolutely no question that Intel regards source
compatibility as a primary, key feature of MIC: Take your existing programs,
recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag),
and they run on KF. I have zero doubt that this will also be true of KC and is
planned for every future release in their road map. I suspect it’s why there is
a MIC – why they did it, rather than just burying Larrabee six feet deep. No
binary compatibility, though; you need to recompile.
You do need to be on Linux; I heard no word about
Microsoft Windows. However, Microsoft Windows 8 has a new
task manager display changed to be a better visualization of many more – up
to 640 – cores. So who knows; support is up to Microsoft.
Clearly, to get anywhere, you also need to be
parallelized in some form; KF has support for MPI (messaging), OpenMP (shared
memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading
Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s
a real Linux, by the way, that runs on a few of the MIC processors; I was told
“you can SSH to it.” The rest of the cores run some form of microkernel. I see
no reason they would want any of that to become more restrictive on KC.
If you can pull off source compatibility, you have something
that is wonderfully easy to sell to a whole lot of customers. For example, Sriram
Swaminarayan of LANL has
noted (really interesting video there) that over 80% of HPC codes have,
like him, a very large body of legacy codes they need to carry into the future.
“Just recompile” promises to bring back the good old days of clock speed
increases when you just compiled for a new architecture and went faster. At
least it does if you’ve already gone parallel on X86, which is far from
uncommon. No messing with newfangled, brain-bending languages (like CUDA or
OpenCL) unless you really want to. This collection of customers is large,
well-funded, and not very well-served by existing accelerator architectures.
Right. Now, for all those readers screaming at me “OK, it
runs,
but does it perform?” –
Well, not necessarily.
The problem is that to get MIC – certainly KF, and it
might be more so for KC – to really perform, on many applications you must get its
512-bit-wide SIMD / vector unit cranking away. Jim
Reinders regaled me with a tale of a four-day port to MIC, where, surprised
it took that long (he said), he found that it took one day to make it run (just
recompile), and then three days to enable wider SIMD / vector execution. I
would not be at all surprised to find that this is pleasantly optimistic. After
all, Intel cherry-picked the recipients of KF, like CERN, which has one of the
world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications
in the known universe. (See my post Random
Things of Interest at IDF 2011.)
Where, on this SIMD/vector issue, are the 80% of folks
with monster legacy codes? Well, Sriram (see above) commented that when LANL
tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes
with the horsepower coming from attached IBM Cell blades – they had a problem
because to perform well, the Cell SPUs needed crank up their two-way
SIMD / vector units. Furthermore, they still have difficulty using earlier
Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s
8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.
On the other hand, getting good performance on other accelerators,
like Nvidia’s, requires much wider SIMD; they need 100s of units cranking,
minimally. Full-bore SIMD may in some cases be simpler to exploit than
SIMD/vector instructions. But even going through gigabytes of grotty old
FORTRAN code just to insert notations saying “do this loop in parallel,” without
breaking the code, can be arduous. The programming language, by the way, is not
the issue. Sriram reminded me of the old saying that great FORTRAN coders, who
wrote the bulk of those old codes, can write FORTRAN in any language.
But wait! How can these guys be choking on 2-way
parallelism when they have obviously exploited thousands of cluster nodes in
parallel? The answer is that we have here two different forms of parallelism;
the node-level one is based on scaling the amount of data, while the SIMD-level
one isn’t.
In physical simulations, which many of these codes
perform, what happens in this simulated
galaxy, or this airplane wing, bomb,
or atmosphere column over here has a
relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as
perturbations, smoothed out with a few more global iterations. That’s the basis
of the node-level parallelism, with communication between nodes. It can also
readily be the basis of processor/core-level parallelism across the cores of a
single multiprocessor. (One basis of those kinds of parallelism, anyway; other
techniques are possible.)
Inside any given galaxy, wing, bomb, or atmosphere column,
however, quantities tend to be much more tightly coupled to each other. (Consider,
for example, R2 force laws; irrelevant when sufficiently far, dominant
when close.) Changing the way those tightly-coupled calculations and done can
often strongly affect the precision of the results, the mathematical properties
of the solution, or even whether you ever converge to any solution. That part
may not be simple at all to parallelize, even two-way, and exploiting SIMD /
vector forces you to work at that level. (For example, you can get into trouble
when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to
Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter
9, “Basic Programming Models and Issues.”) To be sure, not all applications
have this problem; those that don’t often can easily spin up into thousands of
operations in parallel at all levels. (Also, multithreaded “real” SIMD, as
opposed to vector SIMD, can in some cases
avoid some of those problems. Note italicized words.)
The difficulty of exploiting parallelism in tightly-coupled
local computations implies that those 80% are in deep horse puckey no matter
what. You have to carefully consider everything (even, in some cases,
parenthesization of expressions, forcing order of operations) when changing
that code. Needing to do this to exploit MIC’s SIMD suggests an opening for
rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually
necessary for Intel, too, and if you do it our way you get” tons more
performance / lower power / whatever.
Can compilers help here? Sure, they can always eliminate a
pile of gruntwork. Automatically vectorizing compilers have been working quite
well since the 80s, and progress continues to be made in disentangling the aliasing
problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or
semi-commercial) products from people like CAPS
and The Portland Group get better results
if you tell them what’s what, with annotations. Those, of course, must be very
carefully applied across mountains of old codes. (They even emit CUDA and
OpenCL these days.)
By the way, at least some of the parallelism often exploited
by SIMD accelerators (as opposed to SIMD / vector) derives from what I called
node-level parallelism above.
Returning to the main discussion, Intel’s MIC has the
great advantage that you immediately get a simply ported, working program; and,
in the cases that don’t require SIMD operations to hum, that may be all you
need. Intel is pushing this notion hard. One IDF session presentation was
titled “Program the SAME Here and Over There” (caps were in the title). This is
a very big win, and can be sold easily because customers want to believe that
they need do little. Furthermore, you will probably always need less SIMD /
vector width with MIC than with GPGPU-style accelerators. Only experience over
time will tell whether that really matters in a practical sense, but I suspect
it does.
Several Other Things
Here are other MIC facts/factlets/opinions, each needing
far less discussion.
How do you get from one MIC to another MIC? MIC, both KF
and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does
not have a PCIe root complex, so cannot source PCIe. It must be attached to a
standard compute node. So all anybody was talking about was going down PCIe to
node memory, then back up PCIe to a different MIC, all at least partially under
host control. Maybe one could use peer-to-peer PCIe device transfers, although
I didn’t hear that mentioned. I heard nothing about separate busses directly
connecting MICs, like the ones that can connect dual GPUs. This PCIe use is
known to be a bottleneck, um, I mean, “known to require using MIC on
appropriate applications.” Will MIC be that way for ever and ever? Well, “no
announcement of future plans”, but “typically what Intel has done with
accelerators is eventually integrate them onto a package or chip.” They are
“working with others” to better understand “the optimal arrangement” for
connecting multiple MICs.
What kind of memory semantics does MIC have? All I heard
was flat cache coherence across all cores, with ordering and synchronizing
semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple
nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that
I think of it, did that count threads? I don’t know.) I asked whether the other
cores ran a micro-kernel and got a nod of assent. It is not the same Linux that
they run on Xeons. In some ways that’s obvious, since those microkernels on
other nodes have to be managed; whether other things changed I don’t know. Each
core has a private cache, and all memory is globally accessible.
Synchronization will likely change in KC. That’s how I
interpret Jim
Reinders’ comment that current synchronization is fine for 32-way, but over
40 will require some innovation. KC has been said to be 50 cores or more, so
there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100%
necessary for source code to run (as
opposed to perform), I think that
might be a candidate for the chopping block at some point.
Is there adequate memory bandwidth for apps that strongly
stream data? The answer was that they were definitely going to be competitive,
which I interpret as saying they aren’t going to break any records, but will be
good enough for less stressful cases. Some quite knowledgeable people I know
(non-Intel) have expressed the opinion that memory chips will be used in stacks
next to (not on top of) the MIC chip in the product, KC. Certainly that would
help a lot. (This kind of stacking also appears in a leaked picture of a “far
future prototype” from Nvidia, as well as an
Intel Labs demo at IDF.)
Power control: Each core is individually controllable,
and you can run all cores flat out, in their highest power state, without
melting anything. That’s definitely true for KF; I couldn’t find out whether
it’s true for KC. Better power controls than used in KF are now present in
Sandy Bridge, so I would imagine that at least that better level of support
will be there in KC.
Concluding Thoughts
Clearly, I feel the biggest point here are Intel’s
planned commitment over time to a stable architecture that is source code
compatible with Xeon. Stability and source code compatibility are clear selling
points to the large fraction of the HPC and technical computing market that
needs to move forward a large body of legacy applications; this fraction is not
now well-served by existing accelerators. Also important is the availability of
familiar tools, and more of them, compared with popular accelerators available
now. There’s also a potential win in being able to evolve existing programmer skill,
rather than replacing them. Things do change with the much wider core- and
SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that
required by current accelerator products, and it starts in a familiar place.
Will MIC win in the marketplace? Big honking SIMD units,
like Nvidia ships, will always produce more peak performance, which makes it
easy to grab more press. But Intel’s architectural disadvantage in peak juice
is countered by process advantage: They’re always two generations ahead of the
fabs others use; KC is a 22nm part, with those famous “3D” transistors. It
looks to me like there’s room for both approaches.
Finally, don’t forget that Nvidia in particular is here
now, steadily increasing its already massive momentum, while a product version
of MIC remains pie in the sky. What happens when the rubber meets the road with
real MIC products is unknown – and the track record of Larrabee should give
everybody pause until reality sets well into place, including SIMD issues,
memory coherence and power (neither discussed here, but not trivial), etc.
I think a lot of people would, or should, want MIC to
work. Nvidia is hard enough to deal with in reality that two best paper awards
were given at the recently concluded IPDPS
2011 conference – the largest and most prestigious academic parallel computing conference
– for papers that may as well have been titled “How I actually managed to do
something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like
a shortest-path graph algorithm (PHAST) are not exactly what one typically expects
to run well on a GPGPU. Nevertheless, this is not a good sign. People should
not have to do work at the level of intellectual academic accolades to get something
done – anything! – on a truly useful computer architecture.
Hope aside, a lot of very difficult hardware and software
still has to come together to make MIC work. And…
Larrabee was supposed to be real, too.
**************************************************************
Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!
**************************************************************
Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!