The Perils of Parallel: Dell

Showing posts with label Dell. Show all posts

Thursday, September 29, 2011

A Conversation with Intel’s James Reinders at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview James Reinders. James is in the Director, Software Evangelist of Intel’s Software and Services Group in Oregon, and the conversation ranged far and wide, from programming languages, to frameworks, to transactional memory, to the use of CUDA, to Matlab, to vectorizing for execution on Intel’s MIC (Many Integrated Core) architecture.

Intel-provided information about James:

James Reinders is an expert on parallel computing. James is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including systolic arrays systems WARP and iWarp, and the world's first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for multiple Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. His most recent book is “Intel Threading Building Blocks” from O'Reilly Media which has been translated to Japanese, Chinese and Korean. James has published numerous articles, contributed to several books and is one of his current projects is as a co-author on a new book on parallel programming to be released in 2012.

I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)

This is #2 in a series of three such transcripts. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

Pfister: [Discussing where I’m coming from, crowd-sourced question list, HPC & MIC focus here.] So where would you like to start?

Reinders: Wherever you like. MIC and HPC – HPC is my life, and parallel programming, so do your best. It has been for a long, long time, so hopefully I have a very realistic view of what works and what doesn’t work. I think I surprise some people with optimism about where we’re going, but I have some good reasons to see there’s a lot of things we can do in the architecture and the software that I think will surprise people to make that a lot more approachable than you would expect. Amdahl’s law is still there, but some of the difficulties that we have with the systems in terms of programming, the nondeterminism that gets involved in the programming, which you know really destroys the paradigm of thinking how to debug, those are solvable problems. That surprises people a lot, but we have a lot at our disposal we didn’t have 20 or 30 years ago, computers are so much faster and it benefits the tools. Think about how much more the tools can do. You know, your compiler still compiles in about the same time it did 10 years ago, but now it’s doing a lot more, and now that multicore has become very prevalent in our hardware architecture, there are some hooks that we are going to get into the hardware that will solve some of the debugging problems that debugging tools can’t do by themselves because we can catch the attention of the architects and we understand enough that there’s some give-and-take in areas that might surprise people, that they will suddenly have a tool where people say “how’d you solve that problem?” and it’s over there under the covers. So I’m excited about that.

[OK, so everybody forgive me for not jumping right away on his fix for nondeterminism. What he meant by that was covered later.]

Pfister: So, you’re optimistic?

Reinders: Optimistic that it’s not the end of the world.

Pfister: OK. Let me tell you where I’m coming from on that. A while back, I spent an evening doing a web survey of parallel programming languages, and made a spreadsheet of 101 parallel programming languages [see my much earlier post, 101 Parallel Programming Languages].

Reinders: [laughs] You missed a few.

Pfister: I’m sure I did. It was just one night. But not one of those was being used. MPI and OpenMP, that was it.

Reinders: And Erlang has had some limited popularity, but is dying out. They’re a lot like AI and some other things. They help solve some problems, and then if the idea is really an advance, you’ll see something from that materialize in C or C++, Java, or C#. Those languages teach us something that we then use where we really want it.

Pfister: I understand what you’re saying. It’s like MapReduce being a large-scale version of the old LISP mapcar.

Reinders: Which was around in the early 70s. A lot of people picked up on it, it’s not a secret but it’s still, you know, on the edge.

Pfister: I heard someone say recently that there was a programming crisis in the early 80s: How were you going to program all those PCs? It was solved not by programming, but by having three or four frameworks, such as Excel or Word, that some experts in a dark room wrote, everybody used, and it spread like crazy. Is there anything now like that which we could hope for?

Reinders: You see people talk about Python, you see Matlab. Python is powerful, but I think it’s sort of trapped between general-purpose programming and the Matlab. It may be a big enough area; it certainly has a lot of followers. Matlab is a good example. We see a lot of people doing a lot in Matlab. And then they run up against barriers. Excel has the same thing. You see Excel grow up and people incredibly hairy things. We worked with Microsoft a few years ago, and they’ve added parallelism to Excel, and it’s extremely important to some people. Some people have spreadsheets out there that do unbelievable things. You change one cell, and it would take a computer from just a couple of years ago and just stall it for 30 minutes while it recomputes. [I know of people in the finance industry who go out for coffee for a few hours if they accidentally hit F5.] Now you can do that in parallel. I think people do gravitate towards those frameworks, as you’re saying. So which ones will emerge? I think there’s hope. I think Matlab is one; I don’t know that I’d put my money on that being the huge one. But I do think there’s a lot of opportunity for that to hide this compute power behind it. Yes, I agree with that, Word and Excel spreadsheets, they did that, they removed something that you would have programmed over and over again, made it accessible without it looking like programming.

Pfister: People did spreadsheets without realizing they were programming, because it was so obvious.

Reinders: Yes, you’re absolutely right. I tend to think of it in terms of libraries, because I’m a little bit more of an engineer. I do see development of important libraries that use unbelievable amounts of compute power behind them and then simply do something that anyone could understand. Obviously image processing is one [area], but there are other manipulations that I think people will just routinely be able to throw into an application, but what stands behind them is an incredibly complex library that uses compute power to manipulate that data. You see Apple use a lot of this in their user interface, just doing this [swipes] or that to the screen, I mean the thing behind that uses parallelism quite well.

Pfister: But this [swipes] [meaning the thing you do] is simple.

Reinders: Right, exactly. So I think that’s a lot like moving to spreadsheets; that’s the modern equivalent of using spreadsheets or Word. It’s the user interfaces, and they are demanding a lot behind them. It’s unbelievable the compute power that can use.

Pfister: Yes, it is. And I really wonder how many times you’re going to want to scan your pictures for all the images of Aunt Sadie. You’ll get tired of doing it after a couple of days.

Reinders: Right, but I think rather than that being an activity, it’s just something your computer does for you. It disappears. Most of us don’t want to organize things, we want it just done. And Google’s done that on the web. Instead of keeping a million bookmarks to find something, you do a search.

Pfister: Right. I used to have this incredible tree of bookmarks, and could never find anything there.

Reinders: Yes. You’d marvel at people who kept neat bookmarks, and now nobody keeps them.

Pfister: I remember when it was a major feature of Firefox that it provided searching of your bookmarks.

Reinders: [Laughter]

Pfister: You mentioned nondeterminism. Are there any things in the hardware that you’re thinking of? IBM Blue Gene just said they have transactional memory, in hardware, on a chip. I’m dubious.

Reinders: Yes, the Blue Gene/Q stuff. We’ve been looking at transactional memory a long time, we being the industry, Intel included. At first we hoped “Wow, get rid of locks, we’ll all just use transactional memory, it’ll just work.” Well, the shortest way I can say why it doesn’t work is that software people want transactions to be arbitrarily large, and hardware needs it to be constrained, so it can actually do what you’re asking it to do, like holding a buffer. That’s a nonstarter.

So now what’s happening? Rocks was looking at this in Sun, a hybrid technique, and unfortunately they didn’t bring that to market. Nobody outside the team knows exactly what happened, but the project as a whole failed, rather than saying transactional memory was the death. But they had a hard time figuring out how you engineer that buffering. A lot of smart people are looking at it. IBM’s come up with a solution, but I’ve heard it’s constrained to a single socket. It makes sense to me why a constraint like that would be buildable. The hard part is then how do you wrap that into a programming model. Blue Gene’s obviously a very high end machine, so those developers have more familiarity with constraints and dealing with it. Making it general purpose is a very hard problem, very attractive, but I think that at the end of the day, all transactional memory will do is be another option, that may be less error-prone, to use in frameworks or toolkits. I don’t see a big shift in programming model where people say “Oh, I’m using transactional memory.” It’ll be a piece of infrastructure that toolkits like Threading Building Blocks or OpenMP or Cilk+ use. It’ll be important for us in that it gives better guarantees.

The things I more had in mind is you’re seeing a whole class of tools. We’ve got a tool that can do deadlock and race detection dynamically and find it; a very, very good tool. You see companies like TotalView looking at what they would call replaying, or unwinding, going backwards, with debuggers. The problem with debuggers if your program’s nondeterministic is you run it to a breakpoint and say, whoa, I want to see what happened back here, what we usually do is just pop out of the debugger and run it with an earlier breakpoint, or re-run it. If the program is nondeterministic, you don’t get what you want. So the question is, can the debugger keep enough information to back up? Well, the thing that backing up and debugging, deadlock detection, and race detection, all those things have in common is that they tend to run two or three orders of magnitude slower when you’re using those techniques. Well, that’s not compelling. But, the cool part is, with the software, we’re showing how to detect those – just a thousand times slower than real time.

Now we have the cool engineering problem: Can you make it faster? Is there something you could do in the software or the hardware and make that faster? I think there is, and a lot of people do. I get really excited when you solve a hard problem, can you replay a debug, yeah, it’s too slow. We use it to solve really hard problems, with customers that are really important, where you hunker down for a week or two using a tool that’s a thousand times slower to find the bug, and you’re so happy you found it – I can’t stand out in a booth and market and have a million developers use it. That won’t happen unless we get it closer to real time. I think that will happen. We’re looking at ways to do that. It’s a cooperative thing between hardware and software, and it’s not just an Intel thing; obviously the Blue Gene team worries about these things, Sun’s team as worried about them. There’s actually a lot of discussion between those small teams. There aren’t that many people who understand what transactional memory is or how to implement it in hardware, and the people who do talk to each other across companies.

[In retrospect, while transcribing this, I find the sudden transition back to TM to be mysterious. Possibly james was veering away from unannounced technology, or possibly there’s some link between TM and 1000x speedups of playback. If there is such a link, it’s not exactly instantly obvious to me.]

Pfister: At a minimum, at conferences.

Reinders: Yes, yes, and they’d like to see the software stack on top of them come together, so they know what hardware to build to give whatever the software model is what it needs. One of the things we learned about transactional memory is that the software model is really hard. We have a transactional memory compiler that does it all in software. It’s really good. We found that when people used it, they treated transactional memory like locks and created new problems. They didn’t write a transactional memory program from scratch to use transactional memory, they took code they wrote for locks and tried to use transactional memory instead of locks, and that creates problems.

Pfister: The one place I’ve seen where rumors showed someone actually using it was the special-purpose Java machine Azul. 500 plus processors per rack, multiple racks, point-to-point connections with a very pretty diagram like a rosette. They got into a suit war with Sun. And some of the things they showed were obvious applications of transactional memory.

Reinders: Hmm.

Pfister: Let’s talk about support for things like MIC. One question I had was that things like CUDA, which let you just annotate your code, well, do more than that. But I think CUDA was really a big part of the success of Nvidia.

Reinders: Oh, absolutely. Because how else are you going to get anything to go down your shader pipeline for a computation if you don’t give a model? And by lining up with one model, no matter the pros or cons, or how easy or hard it was, it gave a mechanism, actually a low-level mechanism, that turns out to be predictable because the low-level mechanism isn’t trying to do anything too fancy for you, it’s basically giving you full control. That’s a problem to get a lot of people to program that way, but when a programmer does program that way, they get what the hardware can give them. We don’t need a fancy compiler that gets it right half the time on top of that, right? Now everybody in the world would like a fancy compiler that always got it right, and when you can build that, then CUDA and that sort of thing just poof! Gone. I’m not sure that’s a tractable problem on a device that’s not more general than that type of pipeline.

So, the challenge I see with CUDA, and OpenCL, and even C++AMP is that they’re going down the road of saying look, there are going to be several classes of devices, and we need you the programmer to write a different version of your program for each class of device. Like in OpenCL, you can take a function and write a version for a CPU, for a GPU, a version for an accelerator. So in this terminology, OpenCL is proposing CPU is like a Xeon, GPU is like a Tesla, an accelerator something like MIC. We have a hard enough problem getting one version of an optimized program written. I think that’s a fatal flaw in this thing being widely adopted. I think we can bring those together.

What you really are trying to say is that part of your program is going to be restrictive enough that it can be vectorized, done in parallel. I think there are alternatives to this that will catch on and mitigate the need to write much code in OpenCL and in CUDA. The other flaw with those techniques is that in a world where you have a GPU and a CPU, the GPU’s got a job to do on the user interface, and so far we’ve not described what happens when applications mysteriously try to send some to the GPU, some to the CPU. If you get too many apps pounding on the GPU, the user experience dies. [OK, mea culpa for not interrupting and mentioning Tianhe-1A.] AMD has proposed in their future architectures that they’re going to produce a meta-language that OpenCL targets, and then the hardware can target some to the GPU, and some to the CPU. So I understand the problem, and I don’t know if that solution’s the right one, but it highlights that the problem’s understood if you write too much OpenCL code. I’m personally more of a believer that we find higher-level programming interfaces like Cilk plusses, array notations, add array notations to C that explicitly tells you vectorize and the compiler can figure out whether that’s SSC, is it AVX, is it the 512-bit wide stuff on MIC, a GPU pipeline, whatever is on the hardware. But don’t pollute the programming language by telling the programmer to write three versions of your code. The good news is, though, if you do use OpenCL or CUDA to do that, you have extreme control of the hardware and will get the best hardware results you can, and we learn from that. I just think the learnings are going to drive us to more abstract programming models. That’s why I’m a big believer in the Cilk plus stuff that we’re doing.

Pfister: But how many users of HPC systems are interested in squeezing that last drop of performance out?

Reinders: HPC users are extremely interested in squeezing performance if they can keep a single source code that they can run everywhere. I hear this all the time, you know, you go to Oak Ridge, and they want to run some code. Great, we’ll run it on an Intel machine, or we’ll run it on a machine from IBM or HP or whatever, just don’t tell me it has to be rewritten in a strange language that’s only supported on your machine. It’s pretty consistent. So the success of CUDA, to be used on those machines, it’s limited in a way, but it’s been exciting. But it’s been a strain on the people who have done that because CUDA code because CUDA code’s not going to run on an Intel machine [Well, actually, the Portland Group has a CUDA C/C++ compiler targeting x86. I do not know how good the output code performance is.]. OpenCL offers some opportunities to run everywhere, but then has problems of abstraction. Nvidia will talk about 400X speedups, which aren’t real, well that depends on your definition of “real”.

Pfister: Let’s not start on that.

Reinders: OK, well, what we’re seeing constantly is that vectorization is a huge challenge. You talk to people who have taken their cluster code and moved it to MIC [Cluster? No shared memory?], very consistently they’ll tell us stories like, oh, “We ported in three days.” The Intel marketing people are like “That’s great! Three days!” I ask why the heck did it take you three days? Everybody tells me the same thing: It ran right away, since we support MPI, OpenMP, Fortran, C++. Then they had to spend a few days to vectorize because otherwise performance was terrible. They’re trying to use the 512-bit-wide vectors, and their original code was written using SSE [Xeon SIMD/vector] with intrinsics [explicit calls to the hardware operations]. They can’t automatically translate, you have to restructure the loop because it’s 512 bits wide – that should be automated, and if we don’t get that automated in the next decade we’ve made a huge mistake as an industry. So I’m hopeful that we have solutions to that today, but I think a standardized solution to that will have to come forward.

Pfister: I really wonder about that, because wildly changing the degree of parallelism, at least at a vector level – if it’s not there in the code today, you’ve just got to rewrite it.

Reinders: Right, so we’ve got low-hanging fruit, we’ve got codes that have the parallelism today, we need to give them a better way of specifying it. And then yes, over time, those need to migrate to that [way of specifying parallelism in programs]. But migrating the code where you have to restructure it a lot, and then you do it all in SSE intrinsics, that’s very painful. If it feels more readable, more intuitive, like array extensions to the language, I give it better odds. But it’s still algorithmic transformation. They have to teach people where to find their data parallelism; that’s where all the scaling is in an application. If you don’t know how to expose it or write programs that expose it, you won’t take advantage of this shift in the industry.

Pfister: Yes.

Reinders: I’m supposed to make sure you wander down at about 11:00.

Pfister: Yes, I’ve got to go to the required press briefing, so I guess we need to take off. Thanks an awful lot.

Reinders: Sure. If there are any other questions we need to follow up on, I’ll be happy to talk to you. I hope I’ve knocked off a few of your questions.

Pfister: And then some. Thanks.

[While walking down to the press briefing, I asked James whether the synchronization features he had from the X86 architecture were adequate for MIC. He said that they were OK for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something additional. Interestingly, after the conference, there was an Intel press release about the Intel/Dell “home run” win at TACC – using Knight’s Corner, “an innovative design that includes more than 50 cores.” This dovetails with what Joe Curley told me about Knight’s Corner not being the same as Knight’s Ferry. Stay tuned for the next interview.]

Thursday, July 29, 2010

Standards Are About the Money

Nonstandard Cloud

Standards for cloud computing are a never-ending topic of cloud buzz ranging all over the map: APIs (programming interfaces), system management, legal issues, and so on.

With a few exceptions where the motivation is obvious (like some legal issues in the EU), most of these discussions miss a key point: Standards are implemented and used if and only if they make money for their implementers.

Whether customers think they would like them is irrelevant – unless that liking is strong enough to clearly translate into increased sales, paying back the cost of defining and implementing appropriate standards. "Appropriate" always means "as close to my existing implementation as possible" to minimize implementation cost.

That is my experience, anyway, having spent a number of years as a company representative to the InfiniBand Trade Association and the PCI-SIG, along with some interaction with the PowerPC standard and observation of DMTF and IETF standards processes.

Right now there's an obvious tension, since cloud customers see clear benefits to having an industry-wide, stable implementation target that allows portability among cloud system vendors, a point well-detailed in the Berkeley report on cloud computing.

That's all very nice, but unless the cloud system vendors see where the money is coming from, standards aren't going to be implemented where they count. In particular, when there are major market leaders, like Amazon and Google right now, it has to be worth more to those leaders than the lock-in they get from proprietary interfaces. I've yet to see anything indicating that they will, so am not very positive about cloud standards at present time.

But it could happen. The road to any given standard is very often devious, always political, regularly suffused with all kinds of nastiness, and of course ultimately driven throughout by good old capitalist greed. An example I'm rather familiar with is the way InfiniBand came to be, and semi-failed.

The beginning was a presentation by Justin Rattner at the 1998 Intel Developer Forum, in which he declared Intel's desire for their micros to grow up to be mainframes (mmmm… really juicy profit margins!). He thought they had everything except for IO. Busses were bad. He actually showed a slide with a diagram that could have come right out of an IBM Parallel Sysplex white paper, complete with channels and channel directors (switches) connecting banks of storage with banks of computers. That was where we need to go, he said, at a commodity price point.

Shortly thereafter, Intel founded the Next Generation IO Forum (NGIO), inviting other companies to join in the creation of this new industry IO standard. That sounds fine, and rather a step better than IBM did when trying to foist Microchannel architecture on the world (a dismal failure), until you read the fine print in the membership agreement. There you find a few nasties. Intel had 51% of every vote. Oh, and if you have any intellectual property (IP) (patents) in the area, they now all belonged to Intel. Several companies did join, like Dell; they like to be "tightly integrated" with their suppliers.

A few folks with a tad of IP in the IO area, like IBM and Compaq (RIP), understandably declined to join. But they couldn't just let Intel go off and define something they would then have to license. So a collection of companies – initially Compaq, HP, and IBM – founded the rival Future IO Developer's Forum (FIO). Its membership agreement was much more palatable: One company, one vote; and if you had IP that was used, you had to promise to license it with terms that were "reasonable and nondiscriminatory," a phrase that apparently means something quite specific to IP lawyers.

Over the next several months, there was a steady movement of companies out of NGIO and into FIO. When NGIO became only Intel and Dell (still tightly integrated), the two merged as the InfiniBand Trade Association (IBTA). They even had a logo for the merger itself! (See picture.) The name "InfiniBand" was dreamed up by a multi-company collection of marketing people, by the way; when a technical group member told them he thought it was a great name (a lie) they looked worried. The IBTA had, in a major victory for the FIO crowd, the same key terms and conditions as FIO. In addition, Roberts' rules of order were to be used, and most issues were to be decided by a simple majority (of companies).

Any more questions about where the politics comes in? Let's cover devious and nasty with a sub-story:

While on one of the IBTA groups, during a contentious discussion I happened to be leading for one side, I mentioned I was going on vacation for the next two weeks. The first day I was on vacation a senior-level executive of a company on the other side in the dispute, an executive not at all directly involved in IBTA, sent an email to another senior-level executive in a completely different branch of IBM, a branch with which the other company did a very large amount of business. It complained that I "was not being cooperative" and I had said on the IBTA mailing lists that certain IBM products were bad in some way. The obvious intent was that it be forwarded to my management chain through layers of people who didn't understand (or care) what was really going on, just that I had made this key customer unhappy and had dissed IBM products. At the very least, it would have chewed up my time disentangling the mess left after it wandered around forwards for two weeks (I was on vacation, remember?); at worst, it could have resulted in orders to me to be more "cooperative," and otherwise undermined me within my own company. Fortunately, and to my wife's dismay, I had taken my laptop on vacation and watched email; and a staff guy in the different division intercepted that email, forwarded it directly to me, and asked what was going on. As a result, I could nip it all in the bud.

It's sad and perhaps nearly unbelievable that precisely the same tactic – complain at a high level through an unrelated management chain – had been used by that same company against someone else who was being particularly effective against them.

Another, shorter, story: A neighbor of mine who was also involved in a similar inter-company dispute told me that, while on a trip (and he took lots of trips; he was a regional sales manager) he happened to return to his hotel room after checking out and found people going through his trash, looking for anything incriminating.

Standards can be nasty.

Anyway, after a lot of the dust settled and IB had taken on a fairly firm shape, Intel dropped development of its IB product. Exactly why was never explicitly stated, but the consensus I heard was that compared with others' implementations in progress it was not competitive. Without the veto power of NGIO, Intel couldn't shape the standard to match what it was implementing. With Intel out, Microsoft followed suit, and the end result was InfiniBand as we see it today: A great interconnect for high-end systems that pervades HPC, but not the commodity-volume server part the founders hoped that it would be. I suspect there are folks at Intel who think they would have been more successful at achieving the original purpose if they had their veto, since then it would have matched their inexpensive parts. I tend to doubt that, since in the meantime PCI has turned into a hierarchical switched fabric (PCI Express), eliminating many of the original problems stemming from it being a bus.

All this illustrates what standards are really about, from my perspective. Any relationship with pristine technical discussions or providing the "right thing" for customers is indirect, with all motivations leading through money – with side excursions through political, devious, and just plain nasty.