The Perils of Parallel: Intel Xeon Phi Announcement (& me)

1. No, I’m not dead. Not even sick. Been a long time since a post. More on this at the end.

2. So, Intel has finally announced a product ancestrally based on the long-ago Larrabee. The architecture became known as MIC (Many Integrated Cores), development vehicles were named after tiny towns (Knights Corner/Knights Ferry – one was to be the product, but I could never keep them straight), and the final product is to be known as the Xeon Phi.

Why Phi? I don’t know. Maybe it’s the start of a convention of naming High-Performance Computing products after Greek letters. After all, they’re used in equations.

A micro-synopsis (see my post MIC and the Knights for a longer discussion): The Xeon Phi is a PCIe board containing 6GB of RAM and a chip with lots (I didn’t find out how many ahead of time) of X86 cores with wide (512 bit) vector units, able to produce over 1 TeraFLOP (more about that later). The X86 cores a programmed pretty much exactly like a traditional “big” single Xeon: All your favorite compilers and be used, and it runs Linux. Note that to accomplish that, the cores must be fully cache-coherent, just like a multi-core “big” Xeon chip. Compiler mods are clearly needed to target the wide vector units, and that Linux undoubtedly had a few tweeks to do anything useful on the 50+ cores there are per chip, but they look normal. Your old code will run on it. As I’vepointed out, modifications are needed to get anything like full performance, but you do not have to start out with a blank sheet of paper. This is potentially a very big deal.

Since I originally published this, Intel has deluged me with links to their information. See the bottom of this post if you want to read them.

So, it’s here. Finally, some of us would say, but development processes vary and may have hangups that nobody outside ever hears about.

I found a few things interesting about about the announcement.

Number one is their choice as to the first product. The one initially out of the blocks is, not a lower-performance version, but rather the high end of the current generation: The one that costs more ($2649) and has high performance on double precision floating point. Intel says it’s doing so because that’s what its customers want. This makes it extremely clear that “customers” means the big accounts – national labs, large enterprises – buying lots of them, as opposed to, say, Prof. Joe with his NSF grant or Sub-Department Kacklefoo wanting to try it out. Clearly, somebody wants to see significant revenue right now out of this project after so many years. They have had a reasonably-sized pre-product version out there for a while, now, so it has seen some trial use. At national labs and (maybe?) large enterprises.

It costs more than the clear competitor, Nvidia’s Tesla boards. $2649 vs. sub-$2000. For less peak performance. (Note: I've been told that Anantech claims the new Nvidia K20s cost >$3000. I can't otherwise confirm that.) We can argue all day about whether the actual performance is better or worse on real applications, and how much the ability to start from existing code helps, but this pricing still stands out. Not that anybody will actually pay that much; the large customer targets are always highly-negotiated deals. But the Prof. Joes and the Kacklefoos don’t have negotiation leverage.

A second odd point came up in the Q & A period of the pre-announce concall. (I was invited, which is why I’ve come out of my hole to write this.) (Guilt.) Someone asked about memory bottlenecks; it has 310GB/s to its memory, which isn’t bad, but some apps are gluttons. This prompted me to ask about the PCIe bottleneck: Isn’t it also going to be starved for data delivered to it? I was told I was doing it wrong. I was thinking of the main program running on the host, foisting stuff off to the Phi. Wrong. The main program runs on the Phi itself, so the whole program runs on the many (slower) core card.

This means they are, at this time at least, explicitly not taking the path I’ve heard Nvidia evangelists talk about recently: Having lots and lots of tiny cores, along with a number of middle-size cores, and much fewer Great Big cores – and they all live together in a crooked little… Sorry! on the same chip, sharing the same memory subsystem so there is oodles of bandwidth amongst them. This could allow the parts of an application that are limited by single- or few-thread performance to go fast, while the parts that are many-way parallel also go fast, with little impedance mismatch between them. On Phi, if you have a single-thread limited part, it runs on just one of the CPUs, which haven’t been designed for peak single-thread performance. On the other hand, the Nvidia stuff is vaporware and while this kind of multi-speed arrangement has been talked about for a long time, I don’t know of any compiler technology that supports it with any kind of transparency.

A third item, and this seems odd, are the small speedups claimed by the marketing guys: Just 2X-4X. Eh what? 50 CPUs and only 2-4X faster?

This is incredibly refreshing. The claims of accelerator-foisting companies can be outrageous to the point that they lose all credibility, as I’ve written about before.

On the other hand, it’s slightly bizarre, given that at the same conference Intel has people talking about applications that, when retargeted to Phi, get 6.6X (in figuring out graph connections on big graphs) or 4.8X (analyzing synthetic aperture radar images).

On the gripping hand, I really see the heavy hand of Strategic Marketing smacking people around here. Don’t cannibalize sales of the big Xeon E5s! They are known to make Big Money! Someone like me, coming from an IBM background, knows a whole lot about how The Powers That Be can influence how seemingly competitive products are portrayed – or developed. I’ve a sneaking suspicion this influence is why it took so long for something like Phi to reach the market. (Gee, Pete, you’re a really great engineer. Why are you wasting your time on that piddly little sideshow? We’ve got a great position and a raise for you up here in the big leagues…) (Really.)

There are rationales presented: They are comparing apples to apples, meaning well-parallelized code on Xeon E5 Big Boys compared with the same code on Phi. This is to be commended. Also, Phi ain’t got no hardware ECC for memory. Doing ECC in software on the Phi saps its strength considerably. (Hmmm, why do you suppose it doesn’t have ECC? (Hey, Pete, got a great position for you…) (Or "Oh, we're not a threat. We don't even have ECC!" Nobody will do serious stuff without ECC.")) Note: Since this pre-briefing, data sheets have emerged that indicate Phi has optional ECC. Which raises two questions: Why did they explicitly say otherwise in the pre-briefing? And: What does "optional" mean?

Anyway, Larrabee/MIC/Phi has finally hit the streets. Let the benchmark and marketing wars commence.

Now, about me not being dead after all:

I don’t do this blog-writing thing for a living. I’m on the dole, as it were – paid for continuing to breathe. I don’t earn anything from this blog; those Google-supplied ads on the sides haven’t put one dime in my pocket. My wife wants to know why I keep doing it. But note: having no deadlines is wonderful.

So if I feel like taking a year off to play Skyrim, well, I can do that. So I did. It wasn't supposed to be a year, but what the heck. It's a big game. I also participated in some pleasant Grandfatherly activities, paid very close attention to exactly when to exercise some near-expiration stock options, etc.

Finally, while I’ve occasionally poked my head up on Twitter or FaceBook when something interesting happened, there hasn’t been much recently. X added N more processors to the same architecture. Yawn. Y went lower power with more cores. Yawn. If news outlets weren’t paid for how many eyeballs they attracted, they would have been yawning, too, but they are, so every minute twitch becomes an Epoch-Defining Quantum Leap!!! (Complete with ironic use of the word “quantum.”) No judgement here; they have to make a living.

I fear we have closed in on an incrementally-improving era of computing, at least on the hardware and processing side, requiring inhuman levels of effort to push the ball forward another picometer. Just as well I’m not hanging my living on it any more.

----------------------------------------------------------------------------------------

Intel has deluged me with links. Here they are:

Intel® Xeon Phi™ product page: http://www.intel.com/xeonphi

Intel® Xeon Phi™ Coprocessor product brief: http://intel.ly/Q8fuR1

Accelerate Discovery with Powerful New HPC Solutions (Solution Brief) http://intel.ly/SHh0oQ

An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors http://intel.ly/WYsJq9

YouTube Animation Introducing the Intel® Xeon Phi™ Coprocessor http://intel.ly/RxfLtP

Intel® Xeon Phi™ Coprocessor Infographic: http://ow.ly/fe2SP

VIDEO: The History of Many Core, Episode 1: The Power Wall. http://bit.ly/RSQI4g

Diane Bryant’s presentation, additional documents and pictures will be available at Intel Newsroom

1 comment:

Todd Bezenek said...: I have a few guesses.

Intel is jockeying to get into (or back into) the Supercomputing 500.

People finally realized you can't do much without independent threads of execution.

Universal Parallel C (UPC) needs a home.

The non-embarrassingly parallel applications are jealous of their friends.; November 18, 2012 at 6:04 PM

The Perils of Parallel

Monday, November 12, 2012

Intel Xeon Phi Announcement (& me)

1 comment:

Post a Comment