At the recent Intel Developer Forum (IDF), I was given the
opportunity to interview James Reinders. James is in the Director, Software
Evangelist of Intel’s Software and Services Group in Oregon, and the
conversation ranged far and wide, from programming languages, to frameworks, to
transactional
memory, to the use of CUDA, to Matlab, to vectorizing
for execution on Intel’s MIC
(Many Integrated Core) architecture.
Intel-provided information about James:
James Reinders is an expert on parallel computing.
James is a senior engineer who joined Intel Corporation in 1989 and has
contributed to projects including systolic arrays systems WARP and iWarp, and
the world's first TeraFLOP supercomputer (ASCI Red), as well as compilers and
architecture work for multiple Intel processors and parallel systems. James has
been a driver behind the development of Intel as a major provider of software
development products, and serves as their chief software evangelist. His most
recent book is “Intel Threading Building Blocks” from O'Reilly Media which has
been translated to Japanese, Chinese and Korean. James has published numerous
articles, contributed to several books and is one of his current projects is as
a co-author on a new book on parallel programming to be released in 2012.
I recorded our conversation; what follows is a transcript.
Also, I used Twitter to crowd-source questions, and some of my comments refer
to picking questions out of the list that generated. (Thank you! To all who
responded.)
This is #2 in a series of three such transcripts. I’ll
have at least one additional post about IDF 2011, summarizing the things I
learned about MIC and the Intel “Knight’s” accelerator boards using them, since
some important things learned were outside the interviews.
Full disclosure: As I originally noted in a prior
post, Intel paid for me to attend IDF. Thanks, again. It was a great
experience, since I’d never before attended.
Occurrences of [] indicate words I added for clarification
or comment post-interview.
Pfister:
[Discussing where I’m coming from, crowd-sourced question list, HPC & MIC
focus here.] So where would you like to start?
Reinders:
Wherever you like. MIC and HPC – HPC is my life, and parallel programming, so
do your best. It has been for a long, long time, so hopefully I have a very
realistic view of what works and what doesn’t work. I think I surprise some
people with optimism about where we’re going, but I have some good reasons to
see there’s a lot of things we can do in the architecture and the software that
I think will surprise people to make that a lot more approachable than you
would expect. Amdahl’s law is still there, but some of the difficulties that we
have with the systems in terms of programming, the nondeterminism that gets
involved in the programming, which you know really destroys the paradigm of
thinking how to debug, those are solvable problems. That surprises people a
lot, but we have a lot at our disposal we didn’t have 20 or 30 years ago,
computers are so much faster and it benefits the tools. Think about how much
more the tools can do. You know, your compiler still compiles in about the same
time it did 10 years ago, but now it’s doing a lot more, and now that multicore
has become very prevalent in our hardware architecture, there are some hooks
that we are going to get into the hardware that will solve some of the
debugging problems that debugging tools can’t do by themselves because we can
catch the attention of the architects and we understand enough that there’s
some give-and-take in areas that might surprise people, that they will suddenly
have a tool where people say “how’d you solve that problem?” and it’s over
there under the covers. So I’m excited about that.
[OK, so everybody forgive me for not jumping right away on
his fix for nondeterminism. What he meant by that was covered later.]
Pfister: So,
you’re optimistic?
Reinders:
Optimistic that it’s not the end of the world.
Pfister: OK.
Let me tell you where I’m coming from on that. A while back, I spent an evening
doing a web survey of parallel programming languages, and made a spreadsheet of
101 parallel programming languages [see my much earlier post, 101
Parallel Programming Languages].
Reinders: [laughs]
You missed a few.
Pfister: I’m
sure I did. It was just one night. But not one of those was being used. MPI and
OpenMP, that was it.
Reinders: And
Erlang has had some limited popularity, but is dying out. They’re a lot like AI
and some other things. They help solve some problems, and then if the idea is
really an advance, you’ll see something from that materialize in C or C++,
Java, or C#. Those languages teach us something that we then use where we
really want it.
Pfister: I
understand what you’re saying. It’s like MapReduce being a large-scale version
of the old LISP mapcar.
Reinders:
Which was around in the early 70s. A lot of people picked up on it, it’s not a
secret but it’s still, you know, on the edge.
Pfister: I
heard someone say recently that there was a programming crisis in the early
80s: How were you going to program all those PCs? It was solved not by
programming, but by having three or four frameworks, such as Excel or Word,
that some experts in a dark room wrote, everybody used, and it spread like
crazy. Is there anything now like that which we could hope for?
Reinders: You
see people talk about Python, you see Matlab. Python is powerful, but I think
it’s sort of trapped between general-purpose programming and the Matlab. It may
be a big enough area; it certainly has a lot of followers. Matlab is a good
example. We see a lot of people doing a lot in Matlab. And then they run up
against barriers. Excel has the same thing. You see Excel grow up and people
incredibly hairy things. We worked with Microsoft a few years ago, and they’ve
added parallelism to Excel, and it’s extremely important to some people. Some
people have spreadsheets out there that do unbelievable things. You change one
cell, and it would take a computer from just a couple of years ago and just
stall it for 30 minutes while it recomputes. [I know of people in the finance
industry who go out for coffee for a few hours if they accidentally hit F5.] Now
you can do that in parallel. I think people do gravitate towards those
frameworks, as you’re saying. So which ones will emerge? I think there’s hope.
I think Matlab is one; I don’t know that I’d put my money on that being the huge
one. But I do think there’s a lot of opportunity for that to hide this compute
power behind it. Yes, I agree with that, Word and Excel spreadsheets, they did
that, they removed something that you would have programmed over and over
again, made it accessible without it looking like programming.
Pfister:
People did spreadsheets without realizing they were programming, because it was
so obvious.
Reinders:
Yes, you’re absolutely right. I tend to think of it in terms of libraries,
because I’m a little bit more of an engineer. I do see development of important
libraries that use unbelievable amounts of compute power behind them and then
simply do something that anyone could understand. Obviously image processing is
one [area], but there are other manipulations that I think people will just
routinely be able to throw into an application, but what stands behind them is
an incredibly complex library that uses compute power to manipulate that data.
You see Apple use a lot of this in their user interface, just doing this
[swipes] or that to the screen, I mean the thing behind that uses parallelism
quite well.
Pfister: But
this [swipes] [meaning the thing you do] is simple.
Reinders:
Right, exactly. So I think that’s a lot like moving to spreadsheets; that’s the
modern equivalent of using spreadsheets or Word. It’s the user interfaces, and
they are demanding a lot behind them. It’s unbelievable the compute power that
can use.
Pfister: Yes,
it is. And I really wonder how many times you’re going to want to scan your
pictures for all the images of Aunt Sadie. You’ll get tired of doing it after a
couple of days.
Reinders:
Right, but I think rather than that being an activity, it’s just something your
computer does for you. It disappears. Most of us don’t want to organize things,
we want it just done. And Google’s done that on the web. Instead of keeping a
million bookmarks to find something, you do a search.
Pfister:
Right. I used to have this incredible tree of bookmarks, and could never find
anything there.
Reinders:
Yes. You’d marvel at people who kept neat bookmarks, and now nobody keeps them.
Pfister: I
remember when it was a major feature of Firefox that it provided searching of
your bookmarks.
Reinders:
[Laughter]
Pfister: You
mentioned nondeterminism. Are there any things in the hardware that you’re
thinking of? IBM Blue Gene just said they have transactional memory, in
hardware, on a chip. I’m dubious.
Reinders:
Yes, the Blue Gene/Q stuff. We’ve been looking at transactional memory a long
time, we being the industry, Intel included. At first we hoped “Wow, get rid of
locks, we’ll all just use transactional memory, it’ll just work.” Well, the
shortest way I can say why it doesn’t work is that software people want
transactions to be arbitrarily large, and hardware needs it to be constrained,
so it can actually do what you’re asking it to do, like holding a buffer.
That’s a nonstarter.
So now what’s happening? Rocks was looking at this in Sun,
a hybrid technique, and unfortunately they didn’t bring that to market. Nobody
outside the team knows exactly what happened, but the project as a whole failed,
rather than saying transactional memory was the death. But they had a hard time
figuring out how you engineer that buffering. A lot of smart people are looking
at it. IBM’s come up with a solution, but I’ve heard it’s constrained to a
single socket. It makes sense to me why a constraint like that would be
buildable. The hard part is then how do you wrap that into a programming model.
Blue Gene’s obviously a very high end machine, so those developers have more
familiarity with constraints and dealing with it. Making it general purpose is
a very hard problem, very attractive, but I think that at the end of the day, all
transactional memory will do is be another option, that may be less
error-prone, to use in frameworks or toolkits. I don’t see a big shift in
programming model where people say “Oh, I’m using transactional memory.” It’ll
be a piece of infrastructure that toolkits like Threading Building Blocks or
OpenMP or Cilk+ use. It’ll be important for us in that it gives better
guarantees.
The things I more had in mind is you’re seeing a whole
class of tools. We’ve got a tool that can do deadlock and race detection
dynamically and find it; a very, very good tool. You see companies like
TotalView looking at what they would call replaying, or unwinding, going
backwards, with debuggers. The problem with debuggers if your program’s
nondeterministic is you run it to a breakpoint and say, whoa, I want to see
what happened back here, what we usually do is just pop out of the debugger and
run it with an earlier breakpoint, or re-run it. If the program is
nondeterministic, you don’t get what you want. So the question is, can the
debugger keep enough information to back up? Well, the thing that backing up
and debugging, deadlock detection, and race detection, all those things have in
common is that they tend to run two or three orders of magnitude slower when
you’re using those techniques. Well, that’s not compelling. But, the cool part
is, with the software, we’re showing how to detect those – just a thousand
times slower than real time.
Now we have the cool engineering problem: Can you make it
faster? Is there something you could do in the software or the hardware and
make that faster? I think there is, and a lot of people do. I get really
excited when you solve a hard problem, can you replay a debug, yeah, it’s too
slow. We use it to solve really hard problems, with customers that are really
important, where you hunker down for a week or two using a tool that’s a
thousand times slower to find the bug, and you’re so happy you found it – I
can’t stand out in a booth and market and have a million developers use it.
That won’t happen unless we get it closer to real time. I think that will
happen. We’re looking at ways to do that. It’s a cooperative thing between
hardware and software, and it’s not just an Intel thing; obviously the Blue
Gene team worries about these things, Sun’s team as worried about them. There’s
actually a lot of discussion between those small teams. There aren’t that many
people who understand what transactional memory is or how to implement it in
hardware, and the people who do talk to each other across companies.
[In retrospect, while transcribing this, I find the sudden
transition back to TM to be mysterious. Possibly james was veering away from
unannounced technology, or possibly there’s some link between TM and 1000x
speedups of playback. If there is such a link, it’s not exactly instantly obvious
to me.]
Pfister: At a
minimum, at conferences.
Reinders:
Yes, yes, and they’d like to see the software stack on top of them come
together, so they know what hardware to build to give whatever the software
model is what it needs. One of the things we learned about transactional memory
is that the software model is really hard. We have a transactional memory
compiler that does it all in software. It’s really good. We found that when
people used it, they treated transactional memory like locks and created new
problems. They didn’t write a transactional memory program from scratch to use
transactional memory, they took code they wrote for locks and tried to use
transactional memory instead of locks, and that creates problems.
Pfister: The
one place I’ve seen where rumors showed someone actually using it was the
special-purpose Java machine Azul. 500 plus processors per rack, multiple
racks, point-to-point connections with a very pretty diagram like a rosette.
They got into a suit war with Sun. And some of the things they showed were
obvious applications of transactional memory.
Reinders:
Hmm.
Pfister: Let’s
talk about support for things like MIC. One question I had was that things like
CUDA, which let you just annotate your code, well, do more than that. But I
think CUDA was really a big part of the success of Nvidia.
Reinders: Oh,
absolutely. Because how else are you going to get anything to go down your
shader pipeline for a computation if you don’t give a model? And by lining up
with one model, no matter the pros or cons, or how easy or hard it was, it gave
a mechanism, actually a low-level mechanism, that turns out to be predictable
because the low-level mechanism isn’t trying to do anything too fancy for you,
it’s basically giving you full control. That’s a problem to get a lot of people
to program that way, but when a programmer does program that way, they get what
the hardware can give them. We don’t need a fancy compiler that gets it right
half the time on top of that, right? Now everybody in the world would like a
fancy compiler that always got it right, and when you can build that, then CUDA
and that sort of thing just poof! Gone. I’m not sure that’s a tractable problem
on a device that’s not more general than that type of pipeline.
So, the challenge I see with CUDA, and OpenCL, and even
C++AMP is that they’re going down the road of saying look, there are going to
be several classes of devices, and we need you the programmer to write a
different version of your program for each class of device. Like in OpenCL, you
can take a function and write a version for a CPU, for a GPU, a version for an
accelerator. So in this terminology, OpenCL is proposing CPU is like a Xeon,
GPU is like a Tesla, an accelerator something like MIC. We have a hard enough
problem getting one version of an optimized program written. I think that’s a
fatal flaw in this thing being widely adopted. I think we can bring those together.
What you really are trying to say is that part of your
program is going to be restrictive enough that it can be vectorized, done in
parallel. I think there are alternatives to this that will catch on and
mitigate the need to write much code in OpenCL and in CUDA. The other flaw with
those techniques is that in a world where you have a GPU and a CPU, the GPU’s
got a job to do on the user interface, and so far we’ve not described what
happens when applications mysteriously try to send some to the GPU, some to the
CPU. If you get too many apps pounding on the GPU, the user experience dies. [OK,
mea culpa for not interrupting and
mentioning Tianhe-1A.] AMD has proposed in their future architectures that
they’re going to produce a meta-language that OpenCL targets, and then the
hardware can target some to the GPU, and some to the CPU. So I understand the
problem, and I don’t know if that solution’s the right one, but it highlights
that the problem’s understood if you write too much OpenCL code. I’m personally
more of a believer that we find higher-level programming interfaces like Cilk
plusses, array notations, add array notations to C that explicitly tells you
vectorize and the compiler can figure out whether that’s SSC, is it AVX, is it
the 512-bit wide stuff on MIC, a GPU pipeline, whatever is on the hardware. But
don’t pollute the programming language by telling the programmer to write three
versions of your code. The good news is, though, if you do use OpenCL or CUDA
to do that, you have extreme control of the hardware and will get the best
hardware results you can, and we learn from that. I just think the learnings
are going to drive us to more abstract programming models. That’s why I’m a big
believer in the Cilk plus stuff that we’re doing.
Pfister: But
how many users of HPC systems are interested in squeezing that last drop of
performance out?
Reinders: HPC
users are extremely interested in squeezing performance if they can keep a
single source code that they can run everywhere. I hear this all the time, you
know, you go to Oak Ridge, and they want to run some code. Great, we’ll run it
on an Intel machine, or we’ll run it on a machine from IBM or HP or whatever,
just don’t tell me it has to be rewritten in a strange language that’s only
supported on your machine. It’s pretty consistent. So the success of CUDA, to
be used on those machines, it’s limited in a way, but it’s been exciting. But
it’s been a strain on the people who have done that because CUDA code because
CUDA code’s not going to run on an Intel machine [Well, actually, the Portland Group has a CUDA C/C++ compiler
targeting x86. I do not know how good the output code performance is.]. OpenCL
offers some opportunities to run everywhere, but then has problems of
abstraction. Nvidia will talk about 400X speedups, which aren’t real, well that
depends on your definition of “real”.
Pfister:
Let’s not start on that.
Reinders: OK,
well, what we’re seeing constantly is that vectorization is a huge challenge.
You talk to people who have taken their cluster code and moved it to MIC
[Cluster? No shared memory?], very consistently they’ll tell us stories like,
oh, “We ported in three days.” The Intel
marketing people are like “That’s great! Three days!” I ask why the heck
did it take you three days? Everybody tells me the same thing: It ran right
away, since we support MPI, OpenMP, Fortran, C++. Then they had to spend a few
days to vectorize because otherwise performance was terrible. They’re trying to
use the 512-bit-wide vectors, and their original code was written using SSE
[Xeon SIMD/vector] with intrinsics [explicit calls to the hardware operations].
They can’t automatically translate, you have to restructure the loop because
it’s 512 bits wide – that should be automated, and if we don’t get that
automated in the next decade we’ve made a huge mistake as an industry. So I’m
hopeful that we have solutions to that today, but I think a standardized
solution to that will have to come forward.
Pfister: I
really wonder about that, because wildly changing the degree of parallelism, at
least at a vector level – if it’s not there in the code today, you’ve just got
to rewrite it.
Reinders:
Right, so we’ve got low-hanging fruit, we’ve got codes that have the
parallelism today, we need to give them a better way of specifying it. And then
yes, over time, those need to migrate to that [way of specifying parallelism in
programs]. But migrating the code where you have to restructure it a lot, and
then you do it all in SSE intrinsics, that’s very painful. If it feels more
readable, more intuitive, like array extensions to the language, I give it
better odds. But it’s still algorithmic transformation. They have to teach
people where to find their data parallelism; that’s where all the scaling is in
an application. If you don’t know how to expose it or write programs that
expose it, you won’t take advantage of this shift in the industry.
Pfister: Yes.
Reinders: I’m
supposed to make sure you wander down at about 11:00.
Pfister: Yes,
I’ve got to go to the required press briefing, so I guess we need to take off.
Thanks an awful lot.
Reinders:
Sure. If there are any other questions we need to follow up on, I’ll be happy
to talk to you. I hope I’ve knocked off a few of your questions.
Pfister: And
then some. Thanks.
[While walking down to the press briefing, I asked James
whether the synchronization features he had from the X86 architecture were
adequate for MIC. He said that they were OK for the 30 or so cores in Knight’s Ferry,
but when you got above 40, they would need to do something additional.
Interestingly, after the conference, there was an Intel
press release about the Intel/Dell “home run” win at TACC – using Knight’s
Corner, “an innovative design that includes more than 50 cores.” This dovetails
with what Joe Curley told me about Knight’s Corner not being the same as
Knight’s Ferry. Stay tuned for the next interview.]