Thursday, September 29, 2011

A Conversation with Intel’s James Reinders at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview James Reinders. James is in the Director, Software Evangelist of Intel’s Software and Services Group in Oregon, and the conversation ranged far and wide, from programming languages, to frameworks, to transactional memory, to the use of CUDA, to Matlab, to vectorizing for execution on Intel’s MIC (Many Integrated Core) architecture.

Intel-provided information about James:
James Reinders is an expert on parallel computing. James is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including systolic arrays systems WARP and iWarp, and the world's first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for multiple Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. His most recent book is “Intel Threading Building Blocks” from O'Reilly Media which has been translated to Japanese, Chinese and Korean. James has published numerous articles, contributed to several books and is one of his current projects is as a co-author on a new book on parallel programming to be released in 2012.

I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)

This is #2 in a series of three such transcripts. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

Pfister: [Discussing where I’m coming from, crowd-sourced question list, HPC & MIC focus here.] So where would you like to start?

Reinders: Wherever you like. MIC and HPC – HPC is my life, and parallel programming, so do your best. It has been for a long, long time, so hopefully I have a very realistic view of what works and what doesn’t work. I think I surprise some people with optimism about where we’re going, but I have some good reasons to see there’s a lot of things we can do in the architecture and the software that I think will surprise people to make that a lot more approachable than you would expect. Amdahl’s law is still there, but some of the difficulties that we have with the systems in terms of programming, the nondeterminism that gets involved in the programming, which you know really destroys the paradigm of thinking how to debug, those are solvable problems. That surprises people a lot, but we have a lot at our disposal we didn’t have 20 or 30 years ago, computers are so much faster and it benefits the tools. Think about how much more the tools can do. You know, your compiler still compiles in about the same time it did 10 years ago, but now it’s doing a lot more, and now that multicore has become very prevalent in our hardware architecture, there are some hooks that we are going to get into the hardware that will solve some of the debugging problems that debugging tools can’t do by themselves because we can catch the attention of the architects and we understand enough that there’s some give-and-take in areas that might surprise people, that they will suddenly have a tool where people say “how’d you solve that problem?” and it’s over there under the covers. So I’m excited about that.

[OK, so everybody forgive me for not jumping right away on his fix for nondeterminism. What he meant by that was covered later.]

Pfister: So, you’re optimistic?

Reinders: Optimistic that it’s not the end of the world.

Pfister: OK. Let me tell you where I’m coming from on that. A while back, I spent an evening doing a web survey of parallel programming languages, and made a spreadsheet of 101 parallel programming languages [see my much earlier post, 101 Parallel Programming Languages].

Reinders: [laughs] You missed a few.

Pfister: I’m sure I did. It was just one night. But not one of those was being used. MPI and OpenMP, that was it.

Reinders: And Erlang has had some limited popularity, but is dying out. They’re a lot like AI and some other things. They help solve some problems, and then if the idea is really an advance, you’ll see something from that materialize in C or C++, Java, or C#. Those languages teach us something that we then use where we really want it.

Pfister: I understand what you’re saying. It’s like MapReduce being a large-scale version of the old LISP mapcar.

Reinders: Which was around in the early 70s. A lot of people picked up on it, it’s not a secret but it’s still, you know, on the edge.

Pfister: I heard someone say recently that there was a programming crisis in the early 80s: How were you going to program all those PCs? It was solved not by programming, but by having three or four frameworks, such as Excel or Word, that some experts in a dark room wrote, everybody used, and it spread like crazy. Is there anything now like that which we could hope for?

Reinders: You see people talk about Python, you see Matlab. Python is powerful, but I think it’s sort of trapped between general-purpose programming and the Matlab. It may be a big enough area; it certainly has a lot of followers. Matlab is a good example. We see a lot of people doing a lot in Matlab. And then they run up against barriers. Excel has the same thing. You see Excel grow up and people incredibly hairy things. We worked with Microsoft a few years ago, and they’ve added parallelism to Excel, and it’s extremely important to some people. Some people have spreadsheets out there that do unbelievable things. You change one cell, and it would take a computer from just a couple of years ago and just stall it for 30 minutes while it recomputes. [I know of people in the finance industry who go out for coffee for a few hours if they accidentally hit F5.] Now you can do that in parallel. I think people do gravitate towards those frameworks, as you’re saying. So which ones will emerge? I think there’s hope. I think Matlab is one; I don’t know that I’d put my money on that being the huge one. But I do think there’s a lot of opportunity for that to hide this compute power behind it. Yes, I agree with that, Word and Excel spreadsheets, they did that, they removed something that you would have programmed over and over again, made it accessible without it looking like programming.

Pfister: People did spreadsheets without realizing they were programming, because it was so obvious.

Reinders: Yes, you’re absolutely right. I tend to think of it in terms of libraries, because I’m a little bit more of an engineer. I do see development of important libraries that use unbelievable amounts of compute power behind them and then simply do something that anyone could understand. Obviously image processing is one [area], but there are other manipulations that I think people will just routinely be able to throw into an application, but what stands behind them is an incredibly complex library that uses compute power to manipulate that data. You see Apple use a lot of this in their user interface, just doing this [swipes] or that to the screen, I mean the thing behind that uses parallelism quite well.

Pfister: But this [swipes] [meaning the thing you do] is simple.

Reinders: Right, exactly. So I think that’s a lot like moving to spreadsheets; that’s the modern equivalent of using spreadsheets or Word. It’s the user interfaces, and they are demanding a lot behind them. It’s unbelievable the compute power that can use.

Pfister: Yes, it is. And I really wonder how many times you’re going to want to scan your pictures for all the images of Aunt Sadie. You’ll get tired of doing it after a couple of days.

Reinders: Right, but I think rather than that being an activity, it’s just something your computer does for you. It disappears. Most of us don’t want to organize things, we want it just done. And Google’s done that on the web. Instead of keeping a million bookmarks to find something, you do a search.

Pfister: Right. I used to have this incredible tree of bookmarks, and could never find anything there.

Reinders: Yes. You’d marvel at people who kept neat bookmarks, and now nobody keeps them.

Pfister: I remember when it was a major feature of Firefox that it provided searching of your bookmarks.

Reinders: [Laughter]

Pfister: You mentioned nondeterminism. Are there any things in the hardware that you’re thinking of? IBM Blue Gene just said they have transactional memory, in hardware, on a chip. I’m dubious.

Reinders: Yes, the Blue Gene/Q stuff. We’ve been looking at transactional memory a long time, we being the industry, Intel included. At first we hoped “Wow, get rid of locks, we’ll all just use transactional memory, it’ll just work.” Well, the shortest way I can say why it doesn’t work is that software people want transactions to be arbitrarily large, and hardware needs it to be constrained, so it can actually do what you’re asking it to do, like holding a buffer. That’s a nonstarter.

So now what’s happening? Rocks was looking at this in Sun, a hybrid technique, and unfortunately they didn’t bring that to market. Nobody outside the team knows exactly what happened, but the project as a whole failed, rather than saying transactional memory was the death. But they had a hard time figuring out how you engineer that buffering. A lot of smart people are looking at it. IBM’s come up with a solution, but I’ve heard it’s constrained to a single socket. It makes sense to me why a constraint like that would be buildable. The hard part is then how do you wrap that into a programming model. Blue Gene’s obviously a very high end machine, so those developers have more familiarity with constraints and dealing with it. Making it general purpose is a very hard problem, very attractive, but I think that at the end of the day, all transactional memory will do is be another option, that may be less error-prone, to use in frameworks or toolkits. I don’t see a big shift in programming model where people say “Oh, I’m using transactional memory.” It’ll be a piece of infrastructure that toolkits like Threading Building Blocks or OpenMP or Cilk+ use. It’ll be important for us in that it gives better guarantees.

The things I more had in mind is you’re seeing a whole class of tools. We’ve got a tool that can do deadlock and race detection dynamically and find it; a very, very good tool. You see companies like TotalView looking at what they would call replaying, or unwinding, going backwards, with debuggers. The problem with debuggers if your program’s nondeterministic is you run it to a breakpoint and say, whoa, I want to see what happened back here, what we usually do is just pop out of the debugger and run it with an earlier breakpoint, or re-run it. If the program is nondeterministic, you don’t get what you want. So the question is, can the debugger keep enough information to back up? Well, the thing that backing up and debugging, deadlock detection, and race detection, all those things have in common is that they tend to run two or three orders of magnitude slower when you’re using those techniques. Well, that’s not compelling. But, the cool part is, with the software, we’re showing how to detect those – just a thousand times slower than real time.

Now we have the cool engineering problem: Can you make it faster? Is there something you could do in the software or the hardware and make that faster? I think there is, and a lot of people do. I get really excited when you solve a hard problem, can you replay a debug, yeah, it’s too slow. We use it to solve really hard problems, with customers that are really important, where you hunker down for a week or two using a tool that’s a thousand times slower to find the bug, and you’re so happy you found it – I can’t stand out in a booth and market and have a million developers use it. That won’t happen unless we get it closer to real time. I think that will happen. We’re looking at ways to do that. It’s a cooperative thing between hardware and software, and it’s not just an Intel thing; obviously the Blue Gene team worries about these things, Sun’s team as worried about them. There’s actually a lot of discussion between those small teams. There aren’t that many people who understand what transactional memory is or how to implement it in hardware, and the people who do talk to each other across companies.

[In retrospect, while transcribing this, I find the sudden transition back to TM to be mysterious. Possibly james was veering away from unannounced technology, or possibly there’s some link between TM and 1000x speedups of playback. If there is such a link, it’s not exactly instantly obvious to me.]

Pfister: At a minimum, at conferences.

Reinders: Yes, yes, and they’d like to see the software stack on top of them come together, so they know what hardware to build to give whatever the software model is what it needs. One of the things we learned about transactional memory is that the software model is really hard. We have a transactional memory compiler that does it all in software. It’s really good. We found that when people used it, they treated transactional memory like locks and created new problems. They didn’t write a transactional memory program from scratch to use transactional memory, they took code they wrote for locks and tried to use transactional memory instead of locks, and that creates problems.

Pfister: The one place I’ve seen where rumors showed someone actually using it was the special-purpose Java machine Azul. 500 plus processors per rack, multiple racks, point-to-point connections with a very pretty diagram like a rosette. They got into a suit war with Sun. And some of the things they showed were obvious applications of transactional memory.

Reinders: Hmm.

Pfister: Let’s talk about support for things like MIC. One question I had was that things like CUDA, which let you just annotate your code, well, do more than that. But I think CUDA was really a big part of the success of Nvidia.

Reinders: Oh, absolutely. Because how else are you going to get anything to go down your shader pipeline for a computation if you don’t give a model? And by lining up with one model, no matter the pros or cons, or how easy or hard it was, it gave a mechanism, actually a low-level mechanism, that turns out to be predictable because the low-level mechanism isn’t trying to do anything too fancy for you, it’s basically giving you full control. That’s a problem to get a lot of people to program that way, but when a programmer does program that way, they get what the hardware can give them. We don’t need a fancy compiler that gets it right half the time on top of that, right? Now everybody in the world would like a fancy compiler that always got it right, and when you can build that, then CUDA and that sort of thing just poof! Gone. I’m not sure that’s a tractable problem on a device that’s not more general than that type of pipeline.

So, the challenge I see with CUDA, and OpenCL, and even C++AMP is that they’re going down the road of saying look, there are going to be several classes of devices, and we need you the programmer to write a different version of your program for each class of device. Like in OpenCL, you can take a function and write a version for a CPU, for a GPU, a version for an accelerator. So in this terminology, OpenCL is proposing CPU is like a Xeon, GPU is like a Tesla, an accelerator something like MIC. We have a hard enough problem getting one version of an optimized program written. I think that’s a fatal flaw in this thing being widely adopted. I think we can bring those together.

What you really are trying to say is that part of your program is going to be restrictive enough that it can be vectorized, done in parallel. I think there are alternatives to this that will catch on and mitigate the need to write much code in OpenCL and in CUDA. The other flaw with those techniques is that in a world where you have a GPU and a CPU, the GPU’s got a job to do on the user interface, and so far we’ve not described what happens when applications mysteriously try to send some to the GPU, some to the CPU. If you get too many apps pounding on the GPU, the user experience dies. [OK, mea culpa for not interrupting and mentioning Tianhe-1A.] AMD has proposed in their future architectures that they’re going to produce a meta-language that OpenCL targets, and then the hardware can target some to the GPU, and some to the CPU. So I understand the problem, and I don’t know if that solution’s the right one, but it highlights that the problem’s understood if you write too much OpenCL code. I’m personally more of a believer that we find higher-level programming interfaces like Cilk plusses, array notations, add array notations to C that explicitly tells you vectorize and the compiler can figure out whether that’s SSC, is it AVX, is it the 512-bit wide stuff on MIC, a GPU pipeline, whatever is on the hardware. But don’t pollute the programming language by telling the programmer to write three versions of your code. The good news is, though, if you do use OpenCL or CUDA to do that, you have extreme control of the hardware and will get the best hardware results you can, and we learn from that. I just think the learnings are going to drive us to more abstract programming models. That’s why I’m a big believer in the Cilk plus stuff that we’re doing.

Pfister: But how many users of HPC systems are interested in squeezing that last drop of performance out?

Reinders: HPC users are extremely interested in squeezing performance if they can keep a single source code that they can run everywhere. I hear this all the time, you know, you go to Oak Ridge, and they want to run some code. Great, we’ll run it on an Intel machine, or we’ll run it on a machine from IBM or HP or whatever, just don’t tell me it has to be rewritten in a strange language that’s only supported on your machine. It’s pretty consistent. So the success of CUDA, to be used on those machines, it’s limited in a way, but it’s been exciting. But it’s been a strain on the people who have done that because CUDA code because CUDA code’s not going to run on an Intel machine [Well, actually, the Portland Group has a CUDA C/C++ compiler targeting x86. I do not know how good the output code performance is.]. OpenCL offers some opportunities to run everywhere, but then has problems of abstraction. Nvidia will talk about 400X speedups, which aren’t real, well that depends on your definition of “real”.

Pfister: Let’s not start on that.

Reinders: OK, well, what we’re seeing constantly is that vectorization is a huge challenge. You talk to people who have taken their cluster code and moved it to MIC [Cluster? No shared memory?], very consistently they’ll tell us stories like, oh, “We ported in three days.” The Intel  marketing people are like “That’s great! Three days!” I ask why the heck did it take you three days? Everybody tells me the same thing: It ran right away, since we support MPI, OpenMP, Fortran, C++. Then they had to spend a few days to vectorize because otherwise performance was terrible. They’re trying to use the 512-bit-wide vectors, and their original code was written using SSE [Xeon SIMD/vector] with intrinsics [explicit calls to the hardware operations]. They can’t automatically translate, you have to restructure the loop because it’s 512 bits wide – that should be automated, and if we don’t get that automated in the next decade we’ve made a huge mistake as an industry. So I’m hopeful that we have solutions to that today, but I think a standardized solution to that will have to come forward.

Pfister: I really wonder about that, because wildly changing the degree of parallelism, at least at a vector level – if it’s not there in the code today, you’ve just got to rewrite it.

Reinders: Right, so we’ve got low-hanging fruit, we’ve got codes that have the parallelism today, we need to give them a better way of specifying it. And then yes, over time, those need to migrate to that [way of specifying parallelism in programs]. But migrating the code where you have to restructure it a lot, and then you do it all in SSE intrinsics, that’s very painful. If it feels more readable, more intuitive, like array extensions to the language, I give it better odds. But it’s still algorithmic transformation. They have to teach people where to find their data parallelism; that’s where all the scaling is in an application. If you don’t know how to expose it or write programs that expose it, you won’t take advantage of this shift in the industry.

Pfister: Yes.

Reinders: I’m supposed to make sure you wander down at about 11:00.

Pfister: Yes, I’ve got to go to the required press briefing, so I guess we need to take off. Thanks an awful lot.

Reinders: Sure. If there are any other questions we need to follow up on, I’ll be happy to talk to you. I hope I’ve knocked off a few of your questions.

Pfister: And then some. Thanks.

[While walking down to the press briefing, I asked James whether the synchronization features he had from the X86 architecture were adequate for MIC. He said that they were OK for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something additional. Interestingly, after the conference, there was an Intel press release about the Intel/Dell “home run” win at TACC – using Knight’s Corner, “an innovative design that includes more than 50 cores.” This dovetails with what Joe Curley told me about Knight’s Corner not being the same as Knight’s Ferry. Stay tuned for the next interview.]


kme said...

In terms of having to rewrite your code for different vector widths, the latest builds of GCC have gotten very good at vectorising plain, non-vector C loops - on x86, often producing SSE output that is close to as good as you can get using intrinsics. If GCC can do it, then I'm sure ICC can too, and this obviously gives you the "just recompile and take advantage of 512 bit vectors" special sauce that was mentioned. You don't even need array annotations in the language to do this - just good aliasing analysis (which is helped by C99's restrict keyword).

Greg Pfister said...

Good alias analysis is obviously crucial, with every bit of additional programmer-supplied information a boost.

Yet even though this technology has been refined for decades, people still find it necessary to use intrinsics to get what they need. So clearly it's not omnipotent, and going to very wide widths still often requires program modifications.

Come to think of it, though, the new Sandy Bridge SSE is already 256 bits wide. So maybe if people can set up to use that, maybe it wouldn't be that big a leap to MIC's 512 wide.

David Brenner said...

I just wanted to note that I find these interviews extremely interesting. Great questions!

One other thing - it seems like we've been hearing the same thing about parallelism for the past 10 years (programmers not using parallel languages). Since I'm still in school, it seems to me like the adoption has already happened, but I guess that may not be the trend in industry. So, what's going to win out? Will it be abstraction or explicitly integrating it into other languages (C, C#, Python, etc) or are we actually going to see an embrace of languages like Haskell and Erlang?

Greg Pfister said...

Thanks! David.

It's a PITA to transcribe these, and IDF is rapidly turning into old news (which is why I try to avoid doing news: short shelf life), so I'm really glad to have the feedback.

About languages, I think I agree with with Reinders said here: If there are truly useful things in a language, the most likely thing to happen is that they get stuffed into existing "languages you really want." Usually the features are somewhat mangled in the process (rotten syntax, incomplete semantics), but that's life.

There are major barriers to widespread adoption of any brand new language: education, ecosystem, compatibility with a jillion existing LOC... Sometimes one makes it big (like Java), but that's the exception, not the rule.

Useful frameworks often have an easier time of it, I think.

Noz said...

+1 for transcribing the interviews, I like them very much, too. And I also re-read the articles about 101 parallel languages, keep it up!

Paul A. Clayton said...

With respect to "replaying, or unwinding, going backwards, with debuggers" and transactional memory, rollback is a standard component of transactions, so it might be reasonable that TM (or versioned memory) hardware might be used to facilitate such.

bartoszmilewski said...

About debugging running one or two orders of magnitude slower, my company, Corensic, makes a concurrency bug accelerator that slows programs down about 2x (just one binary order of magnitude). So this problem is solvable.

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.