The Perils of Parallel: GPGPU

Showing posts with label GPGPU. Show all posts

Monday, November 12, 2012

Intel Xeon Phi Announcement (& me)

1. No, I’m not dead. Not even sick. Been a long time since a post. More on this at the end.

2. So, Intel has finally announced a product ancestrally based on the long-ago Larrabee. The architecture became known as MIC (Many Integrated Cores), development vehicles were named after tiny towns (Knights Corner/Knights Ferry – one was to be the product, but I could never keep them straight), and the final product is to be known as the Xeon Phi.

Why Phi? I don’t know. Maybe it’s the start of a convention of naming High-Performance Computing products after Greek letters. After all, they’re used in equations.

A micro-synopsis (see my post MIC and the Knights for a longer discussion): The Xeon Phi is a PCIe board containing 6GB of RAM and a chip with lots (I didn’t find out how many ahead of time) of X86 cores with wide (512 bit) vector units, able to produce over 1 TeraFLOP (more about that later). The X86 cores a programmed pretty much exactly like a traditional “big” single Xeon: All your favorite compilers and be used, and it runs Linux. Note that to accomplish that, the cores must be fully cache-coherent, just like a multi-core “big” Xeon chip. Compiler mods are clearly needed to target the wide vector units, and that Linux undoubtedly had a few tweeks to do anything useful on the 50+ cores there are per chip, but they look normal. Your old code will run on it. As I’vepointed out, modifications are needed to get anything like full performance, but you do not have to start out with a blank sheet of paper. This is potentially a very big deal.

Since I originally published this, Intel has deluged me with links to their information. See the bottom of this post if you want to read them.

So, it’s here. Finally, some of us would say, but development processes vary and may have hangups that nobody outside ever hears about.

I found a few things interesting about about the announcement.

Number one is their choice as to the first product. The one initially out of the blocks is, not a lower-performance version, but rather the high end of the current generation: The one that costs more ($2649) and has high performance on double precision floating point. Intel says it’s doing so because that’s what its customers want. This makes it extremely clear that “customers” means the big accounts – national labs, large enterprises – buying lots of them, as opposed to, say, Prof. Joe with his NSF grant or Sub-Department Kacklefoo wanting to try it out. Clearly, somebody wants to see significant revenue right now out of this project after so many years. They have had a reasonably-sized pre-product version out there for a while, now, so it has seen some trial use. At national labs and (maybe?) large enterprises.

It costs more than the clear competitor, Nvidia’s Tesla boards. $2649 vs. sub-$2000. For less peak performance. (Note: I've been told that Anantech claims the new Nvidia K20s cost >$3000. I can't otherwise confirm that.) We can argue all day about whether the actual performance is better or worse on real applications, and how much the ability to start from existing code helps, but this pricing still stands out. Not that anybody will actually pay that much; the large customer targets are always highly-negotiated deals. But the Prof. Joes and the Kacklefoos don’t have negotiation leverage.

A second odd point came up in the Q & A period of the pre-announce concall. (I was invited, which is why I’ve come out of my hole to write this.) (Guilt.) Someone asked about memory bottlenecks; it has 310GB/s to its memory, which isn’t bad, but some apps are gluttons. This prompted me to ask about the PCIe bottleneck: Isn’t it also going to be starved for data delivered to it? I was told I was doing it wrong. I was thinking of the main program running on the host, foisting stuff off to the Phi. Wrong. The main program runs on the Phi itself, so the whole program runs on the many (slower) core card.

This means they are, at this time at least, explicitly not taking the path I’ve heard Nvidia evangelists talk about recently: Having lots and lots of tiny cores, along with a number of middle-size cores, and much fewer Great Big cores – and they all live together in a crooked little… Sorry! on the same chip, sharing the same memory subsystem so there is oodles of bandwidth amongst them. This could allow the parts of an application that are limited by single- or few-thread performance to go fast, while the parts that are many-way parallel also go fast, with little impedance mismatch between them. On Phi, if you have a single-thread limited part, it runs on just one of the CPUs, which haven’t been designed for peak single-thread performance. On the other hand, the Nvidia stuff is vaporware and while this kind of multi-speed arrangement has been talked about for a long time, I don’t know of any compiler technology that supports it with any kind of transparency.

A third item, and this seems odd, are the small speedups claimed by the marketing guys: Just 2X-4X. Eh what? 50 CPUs and only 2-4X faster?

This is incredibly refreshing. The claims of accelerator-foisting companies can be outrageous to the point that they lose all credibility, as I’ve written about before.

On the other hand, it’s slightly bizarre, given that at the same conference Intel has people talking about applications that, when retargeted to Phi, get 6.6X (in figuring out graph connections on big graphs) or 4.8X (analyzing synthetic aperture radar images).

On the gripping hand, I really see the heavy hand of Strategic Marketing smacking people around here. Don’t cannibalize sales of the big Xeon E5s! They are known to make Big Money! Someone like me, coming from an IBM background, knows a whole lot about how The Powers That Be can influence how seemingly competitive products are portrayed – or developed. I’ve a sneaking suspicion this influence is why it took so long for something like Phi to reach the market. (Gee, Pete, you’re a really great engineer. Why are you wasting your time on that piddly little sideshow? We’ve got a great position and a raise for you up here in the big leagues…) (Really.)

There are rationales presented: They are comparing apples to apples, meaning well-parallelized code on Xeon E5 Big Boys compared with the same code on Phi. This is to be commended. Also, Phi ain’t got no hardware ECC for memory. Doing ECC in software on the Phi saps its strength considerably. (Hmmm, why do you suppose it doesn’t have ECC? (Hey, Pete, got a great position for you…) (Or "Oh, we're not a threat. We don't even have ECC!" Nobody will do serious stuff without ECC.")) Note: Since this pre-briefing, data sheets have emerged that indicate Phi has optional ECC. Which raises two questions: Why did they explicitly say otherwise in the pre-briefing? And: What does "optional" mean?

Anyway, Larrabee/MIC/Phi has finally hit the streets. Let the benchmark and marketing wars commence.

Now, about me not being dead after all:

I don’t do this blog-writing thing for a living. I’m on the dole, as it were – paid for continuing to breathe. I don’t earn anything from this blog; those Google-supplied ads on the sides haven’t put one dime in my pocket. My wife wants to know why I keep doing it. But note: having no deadlines is wonderful.

So if I feel like taking a year off to play Skyrim, well, I can do that. So I did. It wasn't supposed to be a year, but what the heck. It's a big game. I also participated in some pleasant Grandfatherly activities, paid very close attention to exactly when to exercise some near-expiration stock options, etc.

Finally, while I’ve occasionally poked my head up on Twitter or FaceBook when something interesting happened, there hasn’t been much recently. X added N more processors to the same architecture. Yawn. Y went lower power with more cores. Yawn. If news outlets weren’t paid for how many eyeballs they attracted, they would have been yawning, too, but they are, so every minute twitch becomes an Epoch-Defining Quantum Leap!!! (Complete with ironic use of the word “quantum.”) No judgement here; they have to make a living.

I fear we have closed in on an incrementally-improving era of computing, at least on the hardware and processing side, requiring inhuman levels of effort to push the ball forward another picometer. Just as well I’m not hanging my living on it any more.

----------------------------------------------------------------------------------------

Intel has deluged me with links. Here they are:

Intel® Xeon Phi™ product page: http://www.intel.com/xeonphi

Intel® Xeon Phi™ Coprocessor product brief: http://intel.ly/Q8fuR1

Accelerate Discovery with Powerful New HPC Solutions (Solution Brief) http://intel.ly/SHh0oQ

An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors http://intel.ly/WYsJq9

YouTube Animation Introducing the Intel® Xeon Phi™ Coprocessor http://intel.ly/RxfLtP

Intel® Xeon Phi™ Coprocessor Infographic: http://ow.ly/fe2SP

VIDEO: The History of Many Core, Episode 1: The Power Wall. http://bit.ly/RSQI4g

Diane Bryant’s presentation, additional documents and pictures will be available at Intel Newsroom

Monday, February 13, 2012

Transactional Memory in Intel Haswell: The Good, and a Possible Ugly

Note: Part of this post is being updated based on new information received after it was published. Please check back tomorrow, 2/18/12, for an update.

Sorry... no update yet (2/21/12). Bad cold. Also seduction / travails of a new computer. Update soon.

I asked James Reinders at the Fall 2011 IDF whether the synchronization features he had from the X86 architecture were adequate for MIC. (Transcript; see the very end.) He said that they were good for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something different.

Now Intel has announced support for transactional memory in Haswell, the chip which follows their Ivy Bridge chip that is just starting shipment as I write this. So I think I’d now be willing to take bets that this is what James was vaguely hinting at, and will appear in Intel’s MIC HPC architecture as it ships in Knight’s Corner product. I prefer to take bets on sure things.

There have been some light discussion of Intel’s “Transactional Synchronization Extensions” (TSE), as this is formally called, and a good example of its use from James Reinders. But now that an architecture spec has been published for TSE, we can get a bit deeper into what, exactly, Intel is providing, and where there might just be a skeleton in the closet.

First, background: What is this “transactional memory” stuff? Why is it useful? Then we’ll get into what Intel has, and the skeleton I believe is lurking.

Transactional Memory

The term “transaction” comes from contract law, was picked up by banking, and from there went into database systems. It refers to a collection of actions that all happen as a unit; they cannot be divided. If I give you money and you give me a property deed, for example, that happens as if it were one action – a transaction. The two parts can’t be (legally) separated; both happen, or neither. Or, in the standard database example: when I transfer money from my bank account to Aunt Sadie’s, the subtraction from my account and the addition to hers must either both happen, or neither; otherwise money is being either destroyed or created, which would be a bad thing. As you might imagine, databases have evolved a robust technology to do transactions where all the changes wind up on stable storage (disk, flash).

The notion of transactional memory is much the same: a collection of changes to memory is made all-or-nothing: Either all of them happen, as seen by every thread, process, processor, or whatever; or none of them happen. So, for example, when some program plays with the pointers in a linked list to insert or delete some list member, nobody can get in there when the update is partially done and follow some pointer to oblivion.

It applies as much to a collection of accesses – reads – as it does to changes – writes. The read side is necessary to ensure that a consistent collection of information is read and acted upon by entities that may be looking around while another is updating.

To do this, typically a program will issue something meaning “Transaction On!” to start the ball rolling. Once that’s done, everything it writes is withheld from view by all other entities in the system; and anything it reads is put on special monitoring in case someone else mucks with it. The cache coherence hardware is mostly re-used to make this monitoring work; cross-system memory monitoring is what cache coherence does.

This continues, accumulating things read and written, until the program issues something meaning “Transaction Off!”, typically called “Commit!.” Then, hraglblargarolfargahglug! All changes are vomited at once into memory, and the locations read are forgotten about.

What happens if some other entity does poke its nose into those saved and monitored locations, changing something the transactor was depending on or modifying? Well, “Transaction On!” was really “Transaction On! And, by the way, if anything screws up go there.” On reaching there, all the recording of data read and changes made has been thrown away; and there is a block of code that usually sets things up to start again, going back to the “Transaction On!” point. (The code could also decide “forget it” and not try over again.) Quitting like this in a controlled manner is called aborting a transaction. It is obviously better if aborts don’t happen a lot, just like it’s better if a lock is not subject to a lot of contention. However, note that nobody else has seen any of the changes made since “On!”, so half-mangled data structures are never seen by anyone.

Why Transactions Are a Good Thing

What makes transactional semantics potentially more efficient than simple locking is that only those memory locations read or referenced at run time are maintained consistently during the transaction. The consistency does not apply to memory locations that could be referenced, only the ones that actually are referenced.

There are situations where that’s a distinction without a difference, since everybody who gets into some particular transaction-ized section of code will bash on exactly the same data every time. Example: A global counter of how many times some operation has been done by all the processors in a system. Transactions aren’t any better than locks in those situations.

But there are cases where the dynamic nature of transactional semantics can be a huge benefit. The standard example, also used by James Reinders, is a multi-access hash table, with inserts, deletions, and lookups done by many processes, etc.

I won’t go through this is detail – you can read James’ version if you like; he has a nice diagram of a hash table, which I don’t – but consider:

With the usual lock semantics, you could simply have one coarse lock around the whole table: Only one person, read or write, gets in at any time. This works, and is simple, but all access to the table is now serialized, so will cause a problem as you scale to more processors.

Alternatively, you could have a lock per hash bucket, for fine-grained rather than coarse locking. That’s a lot of locks. They take up storage, and maintaining them all correctly gets more complex.

Or you could do either of those – one lock, or many – but also get out your old textbooks and try once again to understand those multiple reader / single writer algorithms and their implications, and, by the way, did you want reader or writer priority? Painful and error-prone.

On the other hand, suppose everybody – readers and writers – simply says “Transaction On!” (I keep wanting to write “Flame On!”) before starting a read or a write; then does a “Commit!” when they exit. This is only as complicated as the single coarse lock (and sounds a lot like an “atomic” keyword on a class, hint hint).

Then what you can bank on is that the probability is tiny that two simultaneous accesses will look at the same hash bucket; if that probability is not tiny, you need a bigger hash table anyway. The most likely thing to happen is that nobody – readers or writers – ever accesses same hash bucket, so everybody just sails right through, “Commit!”s, and continues, all in parallel, with no serialization at all. (Not really. See the skeleton, later.)

In the unlikely event that a reader and a writer are working on the same bucket at the same time, whoever “Commit!”s first wins; the other aborts and tries again. Since this is highly unlikely, overall the transactional version of hashing is a big win: it’s both simple and very highly parallel.

Transactional memory is not, of course, the only way to skin this particular cat. Azul Systems has published a detailed presentation on a Lock-Free Wait-Free Hash Table algorithm that uses only compare-and-swap instructions. I got lost somewhere around the fourth state diagram. (Well, OK, actually I saw the first one and kind of gave up.) Azul has need of such things. Among other things, they sell massive Java compute appliances, going up to the Azul Vega 3 7380D, which has 864 processors sharing 768GB of RAM. Think investment banks: take that, you massively recomplicated proprietary version of a Black-Sholes option pricing model! In Java! (Those guys don’t just buy GPUs.)

However, Azul only needs that algorithm on their port of their software stack to X86-based products. Their Vega systems are based on their own proprietary 54-core Vega processors, which have shipped with transactional memory – which they call Speculative Multi-address Atomicity – since the first system shipped in 2005 (information from Gil Tene, Azul Systems CTO). So, all these notions are not exactly new news.

Anyway, if you want this wait-free super-parallel hash table (and other things, obviously) without exploding your head, transactional memory makes it possible rather simply.

What Intel Has: RTE and HLE

Intel’s Transactional Synchronization Extensions come in two flavors: Restricted Transactional Memory (RTE) and Hardware Lock Elision (HLE).

RTE is essentially what I described above: There’s XBEGIN for “Transaction On!”, XEND for “Commit!” and ABORT if you want to manually toss in the towel for some reason. XBEGIN must be given a there location to go to in case of an abort. When an abort occurs, the processor state is restored to what it was at XBEGIN, except that flags are set indicating the reason for the abort (in EAX).

HLE is a bit different. All the documentation I’ve seen so far always talks about it first, perhaps because it seems like it is more familiar, or they want to brag (no question, it’s clever). I obviously think that’s confusing, so didn’t do it in that order.

HLE lets you take your existing, lock-based, code and transactional-memory-ify it: Lock-based code now runs without blocking unless required, as in the hash table example, with minimal, miniscule change that can probably be done with a compiler and the right flag.

I feel like adding “And… at no time did their fingers leave their hands!” It sounds like a magic trick.

In addition to being magical, it’s also clearly strategic for Intel’s MIC and its Knights SDK HPC accelerators. Those are making a heavy bet on people just wanting to recompile and run without the rewrites needed for accelerators like GPGPUs. (See my post MIC and the Knights.)

HLE works by setting a new instruction prefix – XACQUIRE – on any instruction you use to try to acquire a lock. Doing so causes there to be no change to the lock data: the lock write is “elided.” Instead it (a) takes a checkpoint of the machine state; (b) saves the address of the instruction that did this; (c) puts the lock location in the set of data that is transactionally read; and (d) does a “Transaction On!”

So everybody goes charging right through the lock without stopping, but now every location read is continually monitored, and every write is saved, not appearing in memory.

If nobody steps on anybody else’s feet – writes someone else’s monitored location – then when the instruction to release the lock is done, it uses an XRELEASE prefix. This does a “Commit!” hraglblargarolfargahglug flush of all the saved writes into memory, forgets everything monitored, and turns off transaction mode.

If somebody does write a location someone else has read, then we get an ABORT with its wayback machine: back to the location that tried to acquire the lock, restoring the CPU state, so everything is like it was just before the lock acquisition instruction was done. This time, though, the write is not elided: The usual semantics apply, and the code goes through exactly what it did without TSE, the way it worked before.

So, as I understand it, if you have a hash table and read is under way, if a write to the same bucket happens then both the read and the write abort. One of those two gets the lock and does its thing, followed by the other according to the original code. But other reads or writes that don’t have conflicts go right through.

This seems like it will work, but I have to say I’d like to see the data on real code. My gut tells me that anything which changes the semantics of parallel locking, which HLE does, is going to have a weird effect somewhere. My guess would be some fun, subtle, intermittent performance bugs.

The Serial Skeleton in the Closet

This is all powerful stuff that will certainly aid parallel efficiency in both MIC, with it’s 30-plus cores; and the Xeon line, with fewer but faster cores. (Fewer faster cores need it too, since serialization inefficiency gets proportionally worse with faster cores.) But don’t think for a minute that it eliminates all serialization.

I see is no issue with the part of this that monitors locations read and written; I don’t know Intel’s exact implementation, but I feel sure it re-uses the cache coherence mechanisms already present, which operate without (too) much serialization.

However, there’s a reason I used a deliberately disgusting analogy when talking about pushing all the written data to memory on “Commit!” (XEND, XRELEASE). Recall that the required semantics are “all or nothing”: Every entity in the system sees all of the changes, or every entity sees none of them. (I’ve been saying “entity” because GPUs are now prone to directly access cache coherent memory, too.)

If the code has changed multiple locations during a transaction, probably on multiple cache lines, that means those changes have to be made all at once. If locations A and B both change, nobody can possibly see location A after it changed but location B before it changed. Nothing, anywhere, can get between the write of A and the write of B (or the making of both changes visible outside of cache).

As I said, I don’t know Intel’s exact implementation, so could conceivably be wrong, but that for me that implies that every “Commit!” requires a whole system serialization event: Every processor and thread in the whole system has to be not just halted, but pipelines drained. Everything must come to a dead stop. Once that stop is done, then all the changes can be made visible, and everything restarted.

Note that Intel’s TSE architecture spec says nothing about these semantics being limited to one particular chip or region. This is very good; software exploitation would be far harder otherwise. But it implies that in a multi-chip, multi-socket system, this halt and drain applies to every processor in every chip in every socket. It’s a dead stop of everything.

Well, OK, but lock acquire and release instructions always did this dead stop anyway, so likely the aggregate amount of serialization is reduced. (Wait a minute, they always did this anyway?! What the… Yeah. Dirty little secret of the hardware dudes.)

But lock acquire and release only involve one cache line at a time. “Commit!” may involve many. Writes involve letting everybody else know you’re writing a particular line, so they can invalidate it in their cache(s). Those notifications all have to be sent out, serially, and acknowledgements received. They can be pipelined, and probably are, but the process is still serial, and must be done while at a dead stop.

So, if your transactional segment of code modifies, say, 128KB spread over 512K cache lines, you can expect a noticeable bit of serialization time when you “Commit!”. Don’t forget this issue now includes all your old-style locking, thanks to HLE, where the original locking involved updating just one cache line. This is another reason I want to see some real running code with HLE. Who knows what evil lurks between the locks?

But as I said, I don’t know the implementation. Could Intel folks have found a way around this? Maybe; I’m writing this, as I’ve indicated, speculatively. Perhaps real magic is involved. We’ll find out when Haswell ships.

Enjoy your bright, shiny, new, non-blocking transactional memory when it ships. It’ll probably work really well. But beware the dreaded hraglblargarolfargahglug. It bites.

Monday, January 9, 2012

20 PFLOPS vs. 10s of MLOC: An Oak Ridge Conundrum

On The One Hand:

Oak Ridge National Laboratories (ORNL) is heading for a 20 PFLOPS system, getting there by using Nvidia GPUs. Lots of GPUs. Up to 18,000 GPUs.

This is, of course, neither a secret nor news. Look here, or here, or here if you haven’t heard; it was particularly trumpeted at SC11 last November. They’re upgrading the U.S. Department of Energy's largest computer, Jaguar, from a mere 2.3 petaflops. It will grow into a system to be known as Titan, boasting a roaring 10 to 20 petaflops. Jaguar and Titan are shown below. Presumably there will be more interesting panel art ultimately provided for Titan.

The upgrade of the Jaguar Cray XT5 system will introduce new Cray XE6 nodes with AMD’s 16-core Interlagos Opteron 6200. However, the big performance numbers come from new XK6 nodes, which replace two (half) of the AMDs with Nvidia Tesla 3000-series Kepler compute accelerator GPUs, as shown in the diagram. (The blue blocks are Cray’s Gemini inter-node communications.)

The actual performance is a range because it will “depend on how many (GPUs) we can afford to buy," according to Jeff Nichols, ORNL's associate lab director for scientific computing. 20 PFLOPS is achieved if they reach 18,000 XK6 nodes, apparently meaning that all the nodes are XK6s with their GPUs.

All this seems like a straightforward march of progress these days: upgrade and throw in a bunch of Nvidia number-smunchers. Business as usual. The only news, and it is significant, is that it’s actually being done, sold, installed, accepted. Reality is a good thing. (Usually.) And GPUs are, for good reason, the way to go these days. Lots and lots of GPUs.

On The Other Hand:

Oak Ridge has applications totaling at least 5 million lines of code most of which “does not run on GPGPUs and probably never will due to cost and complexity” [emphasis added by me].

That’s what was said at an Intel press briefing at SC11 by Robert Harrison, a corporate fellow at ORNL and director of the National Institute of Computational Sciences hosted at ORNL. He is the person working to get codes ported to Knight’s Ferry, a pre-product software development kit based on Intel’s MIC (May Integrated Core) architecture. (See my prior post MIC and the Knights for a short description of MIC and links to further information.)

Video of that entire briefing is available, but the things I’m referring to are all the way towards the end, starting at about the 50 minute mark. The money slide out of the entire set is page 30:

(And I really wish whoever was making the video didn’t run out of memory, or run out of battery, or have to leave for a potty break, or whatever else right after this page was presented; it's not the last.)

The presenters said that they had actually ported “tens of millions” of lines of code, most functioning within one day. That does not mean they performed well in one day – see MIC and the Knights for important issues there – but he did say that they had decades of experience making vector codes work well, going all the way back to the Cray 1.

What Harrison says in the video about the possibility of GPU use is actually quite a bit more emphatic than the statement on the slide:

Most of this software, I can confidently say since I'm working on them ... will not run on GPGPUs as we understand them right now, in part because of the sheer volume of software, millions of lines of code, and in large part because the algorithms, structures, and so on associated with the applications are just simply don't have the massive parallelism required for fine grain [execution]."

All this is, of course, right up Intel’s alley, since their target for MIC is source compatibility: Change a command-line flag, recompile, done.

I can’t be alone in seeing a disconnect between the Titan hype and these statements. They make it sound like they’re busy building a system they can’t use, and I have too much respect for the folks at ORNL to think that could be true.

So, how do we resolve this conundrum? I can think of several ways, but they’re all speculation on my part. In no particular order:

- The 20 PFLOP number is public relations hype. The contract with Cray is apparently quite flexible, allowing them to buy as many or as few XK6 Tesla-juiced nodes as they like, presumably including zero. That’s highly unlikely, but it does allow a “try some and see if you like it” approach which might result in rather few XK6 nodess installed.

- Harrison is being overly conservative. When people really get down to it, perhaps porting to GPGPUs won’t be all that painful -- particularly compared with the vectorization required to really make MIC hum.

- Those MLOCs aren’t important for Jaguar/Titan. Unless you have a clearance a lot higher than the one I used to have, you have no clue what they are really running on Jaguar/Titan. The codes ported to MIC might not be the ones they need there, or what they run there may slip smoothly onto GPGPUs, or they may be so important a GPGPU porting effort is deemed worthwhile.

- MIC doesn’t arrive on time. MIC is still vaporware, after all, and the Jaguar/Titan upgrade is starting now. (It’s a bit delayed because AMD’s having trouble delivering those Interlagos Opterons, but the target start date is already past.) The earliest firm deployment date I know of for MIC is at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. Its new Stampede system uses MIC and deploys in 2013.

- Upgrading is a lot simpler and cheaper – in direct cost and in operational changes – than installing something that could use MIC. After all, Cray likes AMD, and uses AMD’s inter-CPU interconnect to attach their Gemini inter-node network. This may not hold water, though, since Nvidia isn’t well-liked by AMD anyway, and the Nvidia chips are attached by PCI-e links. PCI-e is what Knight’s Ferry and Knight’s Crossing (the product version) use, so one could conceivably plug them in.

- MIC is too expensive.

That last one requires a bit more explanation. Nvidia Teslas are, in effect, subsidized by the volumes of their plain graphics GPUs. Thise use the same architecture and can to a significant degree re-use chip designs. As a result, the development cost to get Tesla products out the door is spread across a vastly larger volume than the HPC market provides, allowing much lower pricing than would otherwise be the case. Intel doesn’t have that volume booster, and the price might turn out to reflect that.

That Nvidia advantage won’t last forever. Every time AMD sells a Fusion system with GPU built in, or Intel sells one of their chips with graphics integrated onto the silicon, another nail goes into the coffin of low-end GPU volume. (See my post Nvidia-based Cheap Supercomputing Coming to an End; the post turned out to be too optimistic about Intel & AMD graphics performance, but the principle still holds.) However, this volume advantage is still in force now, and may result in a significantly higher cost for MIC-based units. We really have no idea how Intel’s going to price MIC, though, so this is speculation until the MIC vapor condenses into reality.

Some of the resolutions to this Tesla/MIC conflict may be totally bogus, and reality may reflect a combination of reasons, but who knows? As I said above, I’m speculating, a bit caught…

I’m just a little bit caught in the middle

MIC is a dream, and Tesla’s a riddle

I don’t know what to say, can’t believe it all, I tried

I’ve got to let it go

And just enjoy the show.[1]

[1] With apologies to Lenka, the artist who actually wrote the song the girl sings in Moneyball. Great movie, by the way.

Friday, October 28, 2011

MIC and the Knights

Intel’s Many-Integrated-Core architecture (MIC) was on wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s Ferry (KF) software development kit. Well, I thought it was wide display, but I’m an IDF Newbie. There was mention in two keynotes, a demo in the first booth on the right in the exhibit hall, several sessions, etc. Some old hands at IDF probably wouldn’t consider the display “wide” in IDF terms unless it’s in your face on the banners, the escalators, the backpacks, and the bagels.

Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.

This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.

If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.

An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.

Is It Real?

The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)

More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.

The Big Feature: Source Compatibility

There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.

You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.

Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.

If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.

Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –

Well, not necessarily.

The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)

Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.

On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.

But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.

In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)

Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R² force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)

The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.

Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)

By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.

Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.

Several Other Things

Here are other MIC facts/factlets/opinions, each needing far less discussion.

How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.

What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.

Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.

Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)

Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.

Concluding Thoughts

Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.

Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.

Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.

I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.

Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…

Larrabee was supposed to be real, too.

**************************************************************

Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!