At the recent Intel Developer Forum (IDF), I was given
the opportunity to interview Joe Curley, Director, Technical Computing Marketing
of Intel’s Datacenter & Connected Systems Group in Hillsboro.
Intel-provided information about Joe:
Joe Curley, serves Intel® Corporation as director of
marketing for technical computing in the Data Center Group. The technical
computing marketing team manages marketing for high-performance computing (HPC)
and workstation product lines as well as future Intel® Many Integrated Core
(Intel® MIC) products. Joe joined Intel in 2007 to manage planning activities
that lead up to the announcement of the Intel® MIC Architecture in May of 2010.
Prior to joining Intel, Joe worked at Dell, Inc. and graphics pioneer Tseng
Labs in a series of marketing and engineering leadership roles.
I recorded our conversation; what follows is a transcript.
Also, I used Twitter to crowd-source questions, and some of my comments refer
to picking questions out of the list that generated. (Thank you! to all who
responded.)
This is the last in a series of three such transcripts. Hallelujah!
Doing this has been a pain. I’ll have at least one additional post about IDF
2011, summarizing the things I learned about MIC and the Intel “Knight’s”
accelerator boards using them, since some important things learned were outside
the interviews. But some were in the interviews, including here.
Full disclosure: As I originally noted in a prior
post, Intel paid for me to attend IDF. Thanks, again. It was a great
experience, since I’d never before attended.
Occurrences of [] indicate words I added for
clarification or comment post-interview.
[We began by discovering we had similar deep backgrounds, both
starting in graphics hardware. I designed & built a display processor (a
prehistoric GPU), he built “the most efficient framework buffer controller you
could possibly make”. Guess which one of us is in marketing?]
A: My experience in the [HPC] business really started relatively
recently, a little under five years ago, [when] I started working on many-core
processors. I won’t be able to go into history, but I can at least tell you
what we’re doing and why.
Q: Why don’t we start there? At a high level, what are
you doing, and why? High level for what you are doing, and as much detail on “why”
as you can provide.
A: We have to narrow the question. So, at Intel, what we’re
after first of all is in what we call our Technical Computing Marketing Group
inside Data Center Group. That has really three major objectives. The first one
is to specify the needs for high performance computing, how we can help our
customers and developers build the best high performance computing systems.
Q: Let me stop you for a second right there. My
impression for high performance computing is that they are people whose needs
are that they want more. Just more.
A: Oh, yes, but more at what cost? What cost of power,
what cost of programability, what cost of size. How are we going to build the
IO system to handle it affordably or use the fabric of the day.
Q: Yes, they want more, but they want it at two
bytes/FLOPS of memory bandwidth and communication bandwidth.
A: There’s an old thing called the Dilbert Spec, which is
“I want it all, and by the way, can it be free?” But that’s not really what
people tell us they want. People in HPC have actually been remarkably pragmatic
about what it takes to develop innovation. So they really want us to do some
things, and do them really well.
By the way, to finish what we do, we also have the
workstation segment, and the MIC Many Integrated Core product line. The
marketing for that is also in our group.
You asked “what are you doing and why.” It would probably
take forever to go across all domains, but we could go into any one of them a
little bit better.
Q: Can you give me a general “why” for HPC, and a
specific “why” for MIC?
A: Well, HPC’s a really good business. I get stunned,
somebody must be Asking really weird questions, asking “why are you doing HPC?”
Q: What I’ve heard is that HPC is traditionally 12% of
the market.
A: Supercomputing is a relatively small percentage of the
market. HPC and technical computing,
combined, is, not exactly, but roughly, a third of our data center business.
[emphasis added by me] Our data center business is a pretty robust business.
And high performance computing is a business that requires very high end, high
performance processors. It’s actually a very desirable business to be in, if
you can do it, and if your systems work. It’s a business we spend a lot of time
working on because it’s a good business.
Now, if you look at MIC, back in 2005 we made a tacit conclusion
that the performance of a system will come out of parallelism. Parallelism
could be expressed at Intel in a lot of different ways. You can look at it as
threads, we have this concept called hyperthreading. You can look at it as
cores. And we have the SSE instructions sitting around which are SIMD, that’s a
form of parallelism; people argue about the definition, but yes, it is. [I agree.]
So you take a look at the basic architectural constructs, ease of programming,
you know, a cache-based CISC model, and then scaling on cores, threads, SIMD or
vectors, these common attributes have been adopted and well-used by a lot of
programmers. There are programs across the continuum of coarse- to fine-grained
parallel, embarrassingly parallel, pick your taxonomy. But there are
applications that developers would be willing to trade the performance of any
particular task or thread for the sum of what you can do inside the power
envelope at a given period of time. Lots of people have different ways of
defining that, you hear throughput, whatever, but this is the class of
applications, and over time they’re growing.
Q: Growing relatively, or, say, compared to commercial
processing, or…? Is the segment getting larger?
A: The number of people who have tasks they want to run
on that kind of hardware is clearly growing. One of the reasons we’re doing
MIC, maybe I should just cut it to the easiest answer, is developers and
customers asked us to.
Q: Really?
A: And they came to us with a really simple question. We
were struggling in the marketing group with how to position MIC, and one of our
developers got worked up, like “Look, you give me the parallel performance of
an accelerator, but you give me the ease of CPU programming!” Now, ease is a
funny word; you can get into religious arguments about ease. But I think what
he means is “I don’t have to re-think my algorithm, I don’t have to reorder my
data set, there are some things that I don’t have to do. So that they wanted to
have the idea of give me this architecture and get it to scale to be wildly
parallel. And that is exactly what we’ve done with the MIC architecture. If you
think about what the Kinght’s Ferry STP [? Undoubtedly this is SDP - Software Development Platform; I just heard it wrong on the recording.] is, a 32 core, coherent, on a chip,
teraflop part, it’s kind of like Paragon or ASCI Red on a chip. [but it is only a TFLOPS in single precision] And the
programming model is, surprisingly, kind of like a bunch of processor cores on
a network, which a lot of people understand and can get a lot of utility out of
in a very well-understood way. So, in a sense, we’re giving people what they
want, and that, generally, is good business. And if you don’t give them what
they want, they’ll have to go find someone else. So we’re simply doing what our
marketplace asked us for.
Q: Well, let me play a little bit of devil’s advocate
here, because MIC is very clearly derivative of Larrabee, and…
A: Knight’s Ferry is.
Q: … Knight’s Ferry is. Not MIC?
A: No. I think you have to take a look at what Larrabee
was. Larrabee, by the way, was a really cool project, but what Larrabee was was
a tile rendering graphics device, which meant its design point, was first of
all the programming model was derived from what you do for graphics. It’s going
to be API-based, the answer it’s going to generate is going to be a pixel, the
pixel is going to have a defined level of sub-pixel accuracy. It’s a very
predictable output. The internal optimizations you would make for a graphics implementation
of a general many-core architecture is one very specific implementation. Let’s
talk about the needs of the high performance computing market. I need
bandwidth. I need memory depth. Larrabee didn’t need memory depth; it didn’t have
a frame buffer.
Q: It needed bandwidth to local memory [of which it didn’t
have enough; see my post The
Problem with Larrabee]
A: Yes, but less than you think, because the cache was
the critical element in that architecture [again, see that
post] if you look through the academic papers on that…
Q: OK, OK.
A: So, they have a common heritage, they’re both derived
out of the thoughts that came out of the Intel Labs terascale research. They’re
both many-core. But Knight’s Ferry came out with a few, they’re only a few,
modifications. But the programming model is completely different. You don’t
program a graphics device like you do a computer, and MIC is a computer.
Q: The higher-level programming model is different.
A: Correct.
Q: But it is a big, wide, cache-coherent SMP.
A: Well, yes, that’s what Knight’s Ferry is, but we haven’t
talked about what Knight’s Corner yet, and unfortunately I won’t today, and we
haven’t talked about where the product line will go from there, either. But
there are many things that will remain the same, because there are things you
can take and embellish and work and things that will be really different.
Q: But can you at least give me a hint? Is there a chance
that Knight’s Corner will be a substantially different hardware model than
Knight’s Ferry?
A: I’m going to really love to talk to you about
Knight’s Corner. [his emphasis]
Q: But not today.
A: I’m going to duck it today.
Q: Oh, man…
A: The product is going to be in our 22 nm process, and
22 nm isn’t shipping yet. When we get a little bit closer, when it deserves to
have the buzz generated, we’ll start generating buzz. Right now, the big thing is
that we’re making the investments in the Knight’s Ferry software development
platform, to see how codes scale across the many-core, to get the environment
and tools up, to let developers poke at it and find stuff, good stuff, bad stuff, in between stuff, that
allow us to adjust the product line for ongoing generations. We’ve done that
really well since we announced the architecture about 15 months ago.
Q: I was wondering what else I was going to talk about
after having talked to both John Hengeveld and Jim Reinders. This is great.
Nobody talked about where it really came from, and even hinted that there were
changes to the MIC chip [architecture].
A: Oh, no, no, many things will be the same, many things
will be different. If you’re targeting trying to do a pixel-renderer, go do a
pixel-renderer. If you’re trying to do a general-purpose computing device, do a
general-purpose computing device. You’ll see some things and say “well, it’s
all the same” and other things “wow, it’s completely different.” We’ll get
around to talking about the part when we’re a little closer.
The most important thing that James and/or John should
have been talking about is that the key thing is the ability to not force the
developer to completely and utterly re-think their problem to use your hardware.
There are two models: In an accelerator model, which is something I spent a lot
of my life working with, accelerators have the advantage of optimization. You
can say “I want to do one thing really well.” So you can then describe a
programming model for the hardware. You can say “build your data this way,
write your program this way” and if you do it will work. The problem is that
not everything fits into the box. Oh, you have sparse data. Oh, you have
recursive code.
Q: And there’s madness in that direction, because if you
start supporting that you wind yourself around to a general-purpose machine. […usually,
a very odd-looking general-purpose machine. I’ve talked about Sutherland’s “Wheel
of Reincarnation” in this blog, haven’t I? Oh, there it is: The
Cloud Got GPUs, back in November 2010.]
A: Then it’s not an accelerator any more. The thing that
you get in MIC is the performance of one of those accelerators. We’ve shown
this. We’ve hit 960GF out of a peak 1.2TF without throwing away precision,
without playing any circus tricks, just run the hardware. On Knight’s Ferry we’ve
shown that. So you get performance, but you’re getting it out of the general
purpose programming model.
Q: That’s running LINPACK, or… ?
A: That was an even more basic thing; I’m just talking
about SGEMM [single-precision
dense matrix multiply].
Q: I just wanted to ground the number.
A: For LU factorization, I think we showed hybrid LU,
really cool, one of the great things about this hybrid…
Q: They’re demo-ing that downstairs.
A: … OK. When the matrix size is small, I keep it on the
host; when the matrix size is large, I move it. But it’s all the same code, the
same code either place. I’m just deciding where I want to run the code
intelligently, based on the size of the matrix. You can get the exact number,
but I think it’s on the order of 750GBytes/sec for LU [GFLOPS?], which is
actually, for a first-generation part, not shabby. [They were doing 650-750 GF according to the meter I saw. That's single precision; Knight's Ferry was originally a graphics part.]
Q: Yaahh, well, there are a lot of people who can deliver
something like that.
A: We’ll keep working on it and making it better and
better. So, what are we proving today. All we’ve proven today is that the
architecture is capable of performance. We’ve got a lot of work to do before we
have a product, but the architecture has shown itself to be capable. The
programming model, we have people who will speak for us, like the quotes that
came from LRZ
[data center for the universities of Munich and the Bavarian Academy of
Sciences], from Leibnitz [same place], a code they couldn’t port to other
accelerators was running in two hours and optimized in two days. Now, actual
mileage may vary, see dealer for…
Q: So, there are things that just won’t run on a CUDA
model? Example?
A: Well, perhaps, again, the thing you try to get to is
whether there is evidence growing that what you say is real. So we’re having
people who are starting to be able to speak to that, and that gives people the
confidence that we’re going to be able to get there. The other thing it ends up
doing, it’s kind of an odd benefit, as people have started building their code,
trying to optimize it for MIC, they’re finding the parallelism, they’re doing
what we wanted them to do all along, they’re taking the same code on their
current cluster and they’re getting benefits right now.
Q: That’s got a long history. People would have some
grotty old FORTRAN code, and want to vectorize it, but the vectorizing compiler
couldn’t make crap out of it. So they cleaned it up, made it obvious what was
going on, and the vectorizer did its thing well. Then they put it back on the
original machine and it ran twice as fast.
A: So, one of the nice things that’s happened is that as
people are looking at ways to scale power, performance, they’re finally getting
around to dealing with parallelism. The offer that we’re trying to provide is
portable, high level, standards-based, and you can use it now.
You said “why.” That’s why. Our customers and developers
say “if you can do that, that’s really valuable.” Now. We’re four men and a
pudding, we haven’t shipped a product yet, we’ve got a lot of work to do, but
the thought and the promise and the early data is really good.
Q: OK. Well, great.
A: Was that a good use of the time?
Q: That’s a very good use of the time. Let me poke on one
thing a little bit. Conceptually, it ought to be simpler to write code to that
kind of a shared memory model and get parallelism out of the code that way.
Now, on the other hand, there was a talk – sorry, I forget his name, he was one
of the software guys working on Larrabee [it was Tom Forsyth; see my post The
Problem with Larrabee again] said someone on the project had written four
renderers, and three of them were for Larrabee. He was having one hell of a
time trying to get something that performed well. His big issue, at least what
it came down to from what I remember of the talk, was memory bandwidth.
A: Well, first of all, we’ve said Larrabee’s not a
product. As I’ve said, one of the things that is critical, you’ve got the
compute-bound, you’ve got the memory-bound, and most people are somewhere in
between, but you have to be able to handle the two edge cases. We understand
that, and we intend to deliver a really good value across the spectrum. Now,
Knight’s Ferry has the RVI silicon [RVI? I’m guessing here], it’s a variation
off the silicon we used, no one cares about that, but on Knight’s Ferry, the memory bus is 256
bits wide. Relatively narrow, and for a graphics processor, very narrow. There
are definitely design decisions in how that chip was made that would limit the
bandwidth. And the memory it was designed with is slower than the memory today,
you have all of the normal things. But if you went downstairs to the show
floor, and talk to Daniel Paul, he’s demonstrating a pretty dramatic
ray-tracer.
[What follows is a bit confused. He didn’t mean the
Austrian Crown stochastic ray-tracing demo, but rather the real-time
ray-tracing demo. As I said in my immediately previous post (Random
Things of Interest at IDF 2011), the real-time demo is on a set of Knight’s
Ferries attached to a Xeon-based node. At the time of the interview, I hadn’t
seen the real-time demo, just the stochastic one; the latter is not on Knight’s
Ferry.]
Q: I’ve seen that one. The Austrian Crown?
A: Yes.
Q: I thought that was on a cluster.
A: In the little box behind there, he’s able to scale
from one to eight Knight’s Ferries.
Q: He never told me there was a Knight’s Ferry in there.
A: Yes, it’s all Knight’s Ferry.
Q: Well, I’m going to go down there and beat on him a
little bit.
A: I’m about to point you to a YouTube site, it got
compressed and thrown up on YouTube. You can’t get the impact of the complexity
of the rays, but you can at least get the superficial idea of the responsiveness
of the system from Knight’s Ferry.
[He didn’t point me to YouTube, or I lost it, but here’s one I found.
Ignore the fact that the introduction is in Swedish or something [it's Dutch, actually]; Daniel – and it’s
Daniel, not David – speaks English, and gives a good demo. Yes, everybody in
the “Labs” part of the showroom wore white lab coats. I did a bit of teasing. I also updated the Random Things of Interest post to directly include it.]
Well, if you believe that what we’re going to do in our
mainstream processors is roughly double the FLOPS every generation for the next
many generations, that’s our intent. What if we can do that on the MIC line as
well? By the time you get to where ray-tracing would be practical, you could
see multiple of those being integrated into a single device [added in transcription:
Multiple MICs in a single device? Hierarchical MIC?] becomes practical
computationally. That won’t be far from now. So, it’s a nice demo. David’s an
expert in his field, I didn’t hear what he said, but it you want to see the
device downstairs actually running a fairly strenuous graphics workload, take a
look at that.
Q: OK. I did go down there and I did see that, I just
didn’t know it was Knight’s Ferry. [It’s not, it’s not, still confused here.]
On that HDR display that is gorgeous. [Where “it” = stochastically-ray-traced Austrian
Crown. It is.]
[At that point, Dave Patterson walked in, which
interrupted us. We said hello – I know Dave of old, a bit – thanks were
exchanged with Joe, and I departed.]
[I can’t believe this is the end of the last one. I
really don’t like transcribing.]
7 comments:
It may be worth noting for your readers that Knights Ferry is a teraflop of single precision performance, not double. This due to the graphics heritage of course where full speed doubles are not needed (yet). The double precision performance max out at a mere 175-200GF/s as far as I know.
Presumably the product version will remedy this.
Another big open question is how big a memory size they can provide. 2GB/core is my baseline need for example.
"Swedish or something" is Dutch, actually.
Thanks, I knew someone would eventually let me know. Will update the post appropriately.
Oh, and anonymous, I did make updates to clarify that everything is SP in the current version.
2GB/core, 64 cores, 128 GB? Not unrealistic in 2013. (I picked 64 since they've said "more than 50" and that's the obvious number > 50.)
ASCI Red, not ASCII Red. It stands for for Accelerated Strategic Computing Initiative, not to be confused with the character set.
Alan, thanks for the correction. Will fix.
why does Intel waste money with this crap? build a decent gpu why don't you
Post a Comment
Thanks for commenting!
Note: Only a member of this blog may post a comment.