The Perils of Parallel

Monday, November 12, 2012

Intel Xeon Phi Announcement (& me)

1. No, I’m not dead. Not even sick. Been a long time since a post. More on this at the end.

2. So, Intel has finally announced a product ancestrally based on the long-ago Larrabee. The architecture became known as MIC (Many Integrated Cores), development vehicles were named after tiny towns (Knights Corner/Knights Ferry – one was to be the product, but I could never keep them straight), and the final product is to be known as the Xeon Phi.

Why Phi? I don’t know. Maybe it’s the start of a convention of naming High-Performance Computing products after Greek letters. After all, they’re used in equations.

A micro-synopsis (see my post MIC and the Knights for a longer discussion): The Xeon Phi is a PCIe board containing 6GB of RAM and a chip with lots (I didn’t find out how many ahead of time) of X86 cores with wide (512 bit) vector units, able to produce over 1 TeraFLOP (more about that later). The X86 cores a programmed pretty much exactly like a traditional “big” single Xeon: All your favorite compilers and be used, and it runs Linux. Note that to accomplish that, the cores must be fully cache-coherent, just like a multi-core “big” Xeon chip. Compiler mods are clearly needed to target the wide vector units, and that Linux undoubtedly had a few tweeks to do anything useful on the 50+ cores there are per chip, but they look normal. Your old code will run on it. As I’vepointed out, modifications are needed to get anything like full performance, but you do not have to start out with a blank sheet of paper. This is potentially a very big deal.

Since I originally published this, Intel has deluged me with links to their information. See the bottom of this post if you want to read them.

So, it’s here. Finally, some of us would say, but development processes vary and may have hangups that nobody outside ever hears about.

I found a few things interesting about about the announcement.

Number one is their choice as to the first product. The one initially out of the blocks is, not a lower-performance version, but rather the high end of the current generation: The one that costs more ($2649) and has high performance on double precision floating point. Intel says it’s doing so because that’s what its customers want. This makes it extremely clear that “customers” means the big accounts – national labs, large enterprises – buying lots of them, as opposed to, say, Prof. Joe with his NSF grant or Sub-Department Kacklefoo wanting to try it out. Clearly, somebody wants to see significant revenue right now out of this project after so many years. They have had a reasonably-sized pre-product version out there for a while, now, so it has seen some trial use. At national labs and (maybe?) large enterprises.

It costs more than the clear competitor, Nvidia’s Tesla boards. $2649 vs. sub-$2000. For less peak performance. (Note: I've been told that Anantech claims the new Nvidia K20s cost >$3000. I can't otherwise confirm that.) We can argue all day about whether the actual performance is better or worse on real applications, and how much the ability to start from existing code helps, but this pricing still stands out. Not that anybody will actually pay that much; the large customer targets are always highly-negotiated deals. But the Prof. Joes and the Kacklefoos don’t have negotiation leverage.

A second odd point came up in the Q & A period of the pre-announce concall. (I was invited, which is why I’ve come out of my hole to write this.) (Guilt.) Someone asked about memory bottlenecks; it has 310GB/s to its memory, which isn’t bad, but some apps are gluttons. This prompted me to ask about the PCIe bottleneck: Isn’t it also going to be starved for data delivered to it? I was told I was doing it wrong. I was thinking of the main program running on the host, foisting stuff off to the Phi. Wrong. The main program runs on the Phi itself, so the whole program runs on the many (slower) core card.

This means they are, at this time at least, explicitly not taking the path I’ve heard Nvidia evangelists talk about recently: Having lots and lots of tiny cores, along with a number of middle-size cores, and much fewer Great Big cores – and they all live together in a crooked little… Sorry! on the same chip, sharing the same memory subsystem so there is oodles of bandwidth amongst them. This could allow the parts of an application that are limited by single- or few-thread performance to go fast, while the parts that are many-way parallel also go fast, with little impedance mismatch between them. On Phi, if you have a single-thread limited part, it runs on just one of the CPUs, which haven’t been designed for peak single-thread performance. On the other hand, the Nvidia stuff is vaporware and while this kind of multi-speed arrangement has been talked about for a long time, I don’t know of any compiler technology that supports it with any kind of transparency.

A third item, and this seems odd, are the small speedups claimed by the marketing guys: Just 2X-4X. Eh what? 50 CPUs and only 2-4X faster?

This is incredibly refreshing. The claims of accelerator-foisting companies can be outrageous to the point that they lose all credibility, as I’ve written about before.

On the other hand, it’s slightly bizarre, given that at the same conference Intel has people talking about applications that, when retargeted to Phi, get 6.6X (in figuring out graph connections on big graphs) or 4.8X (analyzing synthetic aperture radar images).

On the gripping hand, I really see the heavy hand of Strategic Marketing smacking people around here. Don’t cannibalize sales of the big Xeon E5s! They are known to make Big Money! Someone like me, coming from an IBM background, knows a whole lot about how The Powers That Be can influence how seemingly competitive products are portrayed – or developed. I’ve a sneaking suspicion this influence is why it took so long for something like Phi to reach the market. (Gee, Pete, you’re a really great engineer. Why are you wasting your time on that piddly little sideshow? We’ve got a great position and a raise for you up here in the big leagues…) (Really.)

There are rationales presented: They are comparing apples to apples, meaning well-parallelized code on Xeon E5 Big Boys compared with the same code on Phi. This is to be commended. Also, Phi ain’t got no hardware ECC for memory. Doing ECC in software on the Phi saps its strength considerably. (Hmmm, why do you suppose it doesn’t have ECC? (Hey, Pete, got a great position for you…) (Or "Oh, we're not a threat. We don't even have ECC!" Nobody will do serious stuff without ECC.")) Note: Since this pre-briefing, data sheets have emerged that indicate Phi has optional ECC. Which raises two questions: Why did they explicitly say otherwise in the pre-briefing? And: What does "optional" mean?

Anyway, Larrabee/MIC/Phi has finally hit the streets. Let the benchmark and marketing wars commence.

Now, about me not being dead after all:

I don’t do this blog-writing thing for a living. I’m on the dole, as it were – paid for continuing to breathe. I don’t earn anything from this blog; those Google-supplied ads on the sides haven’t put one dime in my pocket. My wife wants to know why I keep doing it. But note: having no deadlines is wonderful.

So if I feel like taking a year off to play Skyrim, well, I can do that. So I did. It wasn't supposed to be a year, but what the heck. It's a big game. I also participated in some pleasant Grandfatherly activities, paid very close attention to exactly when to exercise some near-expiration stock options, etc.

Finally, while I’ve occasionally poked my head up on Twitter or FaceBook when something interesting happened, there hasn’t been much recently. X added N more processors to the same architecture. Yawn. Y went lower power with more cores. Yawn. If news outlets weren’t paid for how many eyeballs they attracted, they would have been yawning, too, but they are, so every minute twitch becomes an Epoch-Defining Quantum Leap!!! (Complete with ironic use of the word “quantum.”) No judgement here; they have to make a living.

I fear we have closed in on an incrementally-improving era of computing, at least on the hardware and processing side, requiring inhuman levels of effort to push the ball forward another picometer. Just as well I’m not hanging my living on it any more.

----------------------------------------------------------------------------------------

Intel has deluged me with links. Here they are:

Intel® Xeon Phi™ product page: http://www.intel.com/xeonphi

Intel® Xeon Phi™ Coprocessor product brief: http://intel.ly/Q8fuR1

Accelerate Discovery with Powerful New HPC Solutions (Solution Brief) http://intel.ly/SHh0oQ

An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors http://intel.ly/WYsJq9

YouTube Animation Introducing the Intel® Xeon Phi™ Coprocessor http://intel.ly/RxfLtP

Intel® Xeon Phi™ Coprocessor Infographic: http://ow.ly/fe2SP

VIDEO: The History of Many Core, Episode 1: The Power Wall. http://bit.ly/RSQI4g

Diane Bryant’s presentation, additional documents and pictures will be available at Intel Newsroom

Monday, February 13, 2012

Transactional Memory in Intel Haswell: The Good, and a Possible Ugly

Note: Part of this post is being updated based on new information received after it was published. Please check back tomorrow, 2/18/12, for an update.

Sorry... no update yet (2/21/12). Bad cold. Also seduction / travails of a new computer. Update soon.

I asked James Reinders at the Fall 2011 IDF whether the synchronization features he had from the X86 architecture were adequate for MIC. (Transcript; see the very end.) He said that they were good for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something different.

Now Intel has announced support for transactional memory in Haswell, the chip which follows their Ivy Bridge chip that is just starting shipment as I write this. So I think I’d now be willing to take bets that this is what James was vaguely hinting at, and will appear in Intel’s MIC HPC architecture as it ships in Knight’s Corner product. I prefer to take bets on sure things.

There have been some light discussion of Intel’s “Transactional Synchronization Extensions” (TSE), as this is formally called, and a good example of its use from James Reinders. But now that an architecture spec has been published for TSE, we can get a bit deeper into what, exactly, Intel is providing, and where there might just be a skeleton in the closet.

First, background: What is this “transactional memory” stuff? Why is it useful? Then we’ll get into what Intel has, and the skeleton I believe is lurking.

Transactional Memory

The term “transaction” comes from contract law, was picked up by banking, and from there went into database systems. It refers to a collection of actions that all happen as a unit; they cannot be divided. If I give you money and you give me a property deed, for example, that happens as if it were one action – a transaction. The two parts can’t be (legally) separated; both happen, or neither. Or, in the standard database example: when I transfer money from my bank account to Aunt Sadie’s, the subtraction from my account and the addition to hers must either both happen, or neither; otherwise money is being either destroyed or created, which would be a bad thing. As you might imagine, databases have evolved a robust technology to do transactions where all the changes wind up on stable storage (disk, flash).

The notion of transactional memory is much the same: a collection of changes to memory is made all-or-nothing: Either all of them happen, as seen by every thread, process, processor, or whatever; or none of them happen. So, for example, when some program plays with the pointers in a linked list to insert or delete some list member, nobody can get in there when the update is partially done and follow some pointer to oblivion.

It applies as much to a collection of accesses – reads – as it does to changes – writes. The read side is necessary to ensure that a consistent collection of information is read and acted upon by entities that may be looking around while another is updating.

To do this, typically a program will issue something meaning “Transaction On!” to start the ball rolling. Once that’s done, everything it writes is withheld from view by all other entities in the system; and anything it reads is put on special monitoring in case someone else mucks with it. The cache coherence hardware is mostly re-used to make this monitoring work; cross-system memory monitoring is what cache coherence does.

This continues, accumulating things read and written, until the program issues something meaning “Transaction Off!”, typically called “Commit!.” Then, hraglblargarolfargahglug! All changes are vomited at once into memory, and the locations read are forgotten about.

What happens if some other entity does poke its nose into those saved and monitored locations, changing something the transactor was depending on or modifying? Well, “Transaction On!” was really “Transaction On! And, by the way, if anything screws up go there.” On reaching there, all the recording of data read and changes made has been thrown away; and there is a block of code that usually sets things up to start again, going back to the “Transaction On!” point. (The code could also decide “forget it” and not try over again.) Quitting like this in a controlled manner is called aborting a transaction. It is obviously better if aborts don’t happen a lot, just like it’s better if a lock is not subject to a lot of contention. However, note that nobody else has seen any of the changes made since “On!”, so half-mangled data structures are never seen by anyone.

Why Transactions Are a Good Thing

What makes transactional semantics potentially more efficient than simple locking is that only those memory locations read or referenced at run time are maintained consistently during the transaction. The consistency does not apply to memory locations that could be referenced, only the ones that actually are referenced.

There are situations where that’s a distinction without a difference, since everybody who gets into some particular transaction-ized section of code will bash on exactly the same data every time. Example: A global counter of how many times some operation has been done by all the processors in a system. Transactions aren’t any better than locks in those situations.

But there are cases where the dynamic nature of transactional semantics can be a huge benefit. The standard example, also used by James Reinders, is a multi-access hash table, with inserts, deletions, and lookups done by many processes, etc.

I won’t go through this is detail – you can read James’ version if you like; he has a nice diagram of a hash table, which I don’t – but consider:

With the usual lock semantics, you could simply have one coarse lock around the whole table: Only one person, read or write, gets in at any time. This works, and is simple, but all access to the table is now serialized, so will cause a problem as you scale to more processors.

Alternatively, you could have a lock per hash bucket, for fine-grained rather than coarse locking. That’s a lot of locks. They take up storage, and maintaining them all correctly gets more complex.

Or you could do either of those – one lock, or many – but also get out your old textbooks and try once again to understand those multiple reader / single writer algorithms and their implications, and, by the way, did you want reader or writer priority? Painful and error-prone.

On the other hand, suppose everybody – readers and writers – simply says “Transaction On!” (I keep wanting to write “Flame On!”) before starting a read or a write; then does a “Commit!” when they exit. This is only as complicated as the single coarse lock (and sounds a lot like an “atomic” keyword on a class, hint hint).

Then what you can bank on is that the probability is tiny that two simultaneous accesses will look at the same hash bucket; if that probability is not tiny, you need a bigger hash table anyway. The most likely thing to happen is that nobody – readers or writers – ever accesses same hash bucket, so everybody just sails right through, “Commit!”s, and continues, all in parallel, with no serialization at all. (Not really. See the skeleton, later.)

In the unlikely event that a reader and a writer are working on the same bucket at the same time, whoever “Commit!”s first wins; the other aborts and tries again. Since this is highly unlikely, overall the transactional version of hashing is a big win: it’s both simple and very highly parallel.

Transactional memory is not, of course, the only way to skin this particular cat. Azul Systems has published a detailed presentation on a Lock-Free Wait-Free Hash Table algorithm that uses only compare-and-swap instructions. I got lost somewhere around the fourth state diagram. (Well, OK, actually I saw the first one and kind of gave up.) Azul has need of such things. Among other things, they sell massive Java compute appliances, going up to the Azul Vega 3 7380D, which has 864 processors sharing 768GB of RAM. Think investment banks: take that, you massively recomplicated proprietary version of a Black-Sholes option pricing model! In Java! (Those guys don’t just buy GPUs.)

However, Azul only needs that algorithm on their port of their software stack to X86-based products. Their Vega systems are based on their own proprietary 54-core Vega processors, which have shipped with transactional memory – which they call Speculative Multi-address Atomicity – since the first system shipped in 2005 (information from Gil Tene, Azul Systems CTO). So, all these notions are not exactly new news.

Anyway, if you want this wait-free super-parallel hash table (and other things, obviously) without exploding your head, transactional memory makes it possible rather simply.

What Intel Has: RTE and HLE

Intel’s Transactional Synchronization Extensions come in two flavors: Restricted Transactional Memory (RTE) and Hardware Lock Elision (HLE).

RTE is essentially what I described above: There’s XBEGIN for “Transaction On!”, XEND for “Commit!” and ABORT if you want to manually toss in the towel for some reason. XBEGIN must be given a there location to go to in case of an abort. When an abort occurs, the processor state is restored to what it was at XBEGIN, except that flags are set indicating the reason for the abort (in EAX).

HLE is a bit different. All the documentation I’ve seen so far always talks about it first, perhaps because it seems like it is more familiar, or they want to brag (no question, it’s clever). I obviously think that’s confusing, so didn’t do it in that order.

HLE lets you take your existing, lock-based, code and transactional-memory-ify it: Lock-based code now runs without blocking unless required, as in the hash table example, with minimal, miniscule change that can probably be done with a compiler and the right flag.

I feel like adding “And… at no time did their fingers leave their hands!” It sounds like a magic trick.

In addition to being magical, it’s also clearly strategic for Intel’s MIC and its Knights SDK HPC accelerators. Those are making a heavy bet on people just wanting to recompile and run without the rewrites needed for accelerators like GPGPUs. (See my post MIC and the Knights.)

HLE works by setting a new instruction prefix – XACQUIRE – on any instruction you use to try to acquire a lock. Doing so causes there to be no change to the lock data: the lock write is “elided.” Instead it (a) takes a checkpoint of the machine state; (b) saves the address of the instruction that did this; (c) puts the lock location in the set of data that is transactionally read; and (d) does a “Transaction On!”

So everybody goes charging right through the lock without stopping, but now every location read is continually monitored, and every write is saved, not appearing in memory.

If nobody steps on anybody else’s feet – writes someone else’s monitored location – then when the instruction to release the lock is done, it uses an XRELEASE prefix. This does a “Commit!” hraglblargarolfargahglug flush of all the saved writes into memory, forgets everything monitored, and turns off transaction mode.

If somebody does write a location someone else has read, then we get an ABORT with its wayback machine: back to the location that tried to acquire the lock, restoring the CPU state, so everything is like it was just before the lock acquisition instruction was done. This time, though, the write is not elided: The usual semantics apply, and the code goes through exactly what it did without TSE, the way it worked before.

So, as I understand it, if you have a hash table and read is under way, if a write to the same bucket happens then both the read and the write abort. One of those two gets the lock and does its thing, followed by the other according to the original code. But other reads or writes that don’t have conflicts go right through.

This seems like it will work, but I have to say I’d like to see the data on real code. My gut tells me that anything which changes the semantics of parallel locking, which HLE does, is going to have a weird effect somewhere. My guess would be some fun, subtle, intermittent performance bugs.

The Serial Skeleton in the Closet

This is all powerful stuff that will certainly aid parallel efficiency in both MIC, with it’s 30-plus cores; and the Xeon line, with fewer but faster cores. (Fewer faster cores need it too, since serialization inefficiency gets proportionally worse with faster cores.) But don’t think for a minute that it eliminates all serialization.

I see is no issue with the part of this that monitors locations read and written; I don’t know Intel’s exact implementation, but I feel sure it re-uses the cache coherence mechanisms already present, which operate without (too) much serialization.

However, there’s a reason I used a deliberately disgusting analogy when talking about pushing all the written data to memory on “Commit!” (XEND, XRELEASE). Recall that the required semantics are “all or nothing”: Every entity in the system sees all of the changes, or every entity sees none of them. (I’ve been saying “entity” because GPUs are now prone to directly access cache coherent memory, too.)

If the code has changed multiple locations during a transaction, probably on multiple cache lines, that means those changes have to be made all at once. If locations A and B both change, nobody can possibly see location A after it changed but location B before it changed. Nothing, anywhere, can get between the write of A and the write of B (or the making of both changes visible outside of cache).

As I said, I don’t know Intel’s exact implementation, so could conceivably be wrong, but that for me that implies that every “Commit!” requires a whole system serialization event: Every processor and thread in the whole system has to be not just halted, but pipelines drained. Everything must come to a dead stop. Once that stop is done, then all the changes can be made visible, and everything restarted.

Note that Intel’s TSE architecture spec says nothing about these semantics being limited to one particular chip or region. This is very good; software exploitation would be far harder otherwise. But it implies that in a multi-chip, multi-socket system, this halt and drain applies to every processor in every chip in every socket. It’s a dead stop of everything.

Well, OK, but lock acquire and release instructions always did this dead stop anyway, so likely the aggregate amount of serialization is reduced. (Wait a minute, they always did this anyway?! What the… Yeah. Dirty little secret of the hardware dudes.)

But lock acquire and release only involve one cache line at a time. “Commit!” may involve many. Writes involve letting everybody else know you’re writing a particular line, so they can invalidate it in their cache(s). Those notifications all have to be sent out, serially, and acknowledgements received. They can be pipelined, and probably are, but the process is still serial, and must be done while at a dead stop.

So, if your transactional segment of code modifies, say, 128KB spread over 512K cache lines, you can expect a noticeable bit of serialization time when you “Commit!”. Don’t forget this issue now includes all your old-style locking, thanks to HLE, where the original locking involved updating just one cache line. This is another reason I want to see some real running code with HLE. Who knows what evil lurks between the locks?

But as I said, I don’t know the implementation. Could Intel folks have found a way around this? Maybe; I’m writing this, as I’ve indicated, speculatively. Perhaps real magic is involved. We’ll find out when Haswell ships.

Enjoy your bright, shiny, new, non-blocking transactional memory when it ships. It’ll probably work really well. But beware the dreaded hraglblargarolfargahglug. It bites.

Monday, January 9, 2012

20 PFLOPS vs. 10s of MLOC: An Oak Ridge Conundrum

On The One Hand:

Oak Ridge National Laboratories (ORNL) is heading for a 20 PFLOPS system, getting there by using Nvidia GPUs. Lots of GPUs. Up to 18,000 GPUs.

This is, of course, neither a secret nor news. Look here, or here, or here if you haven’t heard; it was particularly trumpeted at SC11 last November. They’re upgrading the U.S. Department of Energy's largest computer, Jaguar, from a mere 2.3 petaflops. It will grow into a system to be known as Titan, boasting a roaring 10 to 20 petaflops. Jaguar and Titan are shown below. Presumably there will be more interesting panel art ultimately provided for Titan.

The upgrade of the Jaguar Cray XT5 system will introduce new Cray XE6 nodes with AMD’s 16-core Interlagos Opteron 6200. However, the big performance numbers come from new XK6 nodes, which replace two (half) of the AMDs with Nvidia Tesla 3000-series Kepler compute accelerator GPUs, as shown in the diagram. (The blue blocks are Cray’s Gemini inter-node communications.)

The actual performance is a range because it will “depend on how many (GPUs) we can afford to buy," according to Jeff Nichols, ORNL's associate lab director for scientific computing. 20 PFLOPS is achieved if they reach 18,000 XK6 nodes, apparently meaning that all the nodes are XK6s with their GPUs.

All this seems like a straightforward march of progress these days: upgrade and throw in a bunch of Nvidia number-smunchers. Business as usual. The only news, and it is significant, is that it’s actually being done, sold, installed, accepted. Reality is a good thing. (Usually.) And GPUs are, for good reason, the way to go these days. Lots and lots of GPUs.

On The Other Hand:

Oak Ridge has applications totaling at least 5 million lines of code most of which “does not run on GPGPUs and probably never will due to cost and complexity” [emphasis added by me].

That’s what was said at an Intel press briefing at SC11 by Robert Harrison, a corporate fellow at ORNL and director of the National Institute of Computational Sciences hosted at ORNL. He is the person working to get codes ported to Knight’s Ferry, a pre-product software development kit based on Intel’s MIC (May Integrated Core) architecture. (See my prior post MIC and the Knights for a short description of MIC and links to further information.)

Video of that entire briefing is available, but the things I’m referring to are all the way towards the end, starting at about the 50 minute mark. The money slide out of the entire set is page 30:

(And I really wish whoever was making the video didn’t run out of memory, or run out of battery, or have to leave for a potty break, or whatever else right after this page was presented; it's not the last.)

The presenters said that they had actually ported “tens of millions” of lines of code, most functioning within one day. That does not mean they performed well in one day – see MIC and the Knights for important issues there – but he did say that they had decades of experience making vector codes work well, going all the way back to the Cray 1.

What Harrison says in the video about the possibility of GPU use is actually quite a bit more emphatic than the statement on the slide:

Most of this software, I can confidently say since I'm working on them ... will not run on GPGPUs as we understand them right now, in part because of the sheer volume of software, millions of lines of code, and in large part because the algorithms, structures, and so on associated with the applications are just simply don't have the massive parallelism required for fine grain [execution]."

All this is, of course, right up Intel’s alley, since their target for MIC is source compatibility: Change a command-line flag, recompile, done.

I can’t be alone in seeing a disconnect between the Titan hype and these statements. They make it sound like they’re busy building a system they can’t use, and I have too much respect for the folks at ORNL to think that could be true.

So, how do we resolve this conundrum? I can think of several ways, but they’re all speculation on my part. In no particular order:

- The 20 PFLOP number is public relations hype. The contract with Cray is apparently quite flexible, allowing them to buy as many or as few XK6 Tesla-juiced nodes as they like, presumably including zero. That’s highly unlikely, but it does allow a “try some and see if you like it” approach which might result in rather few XK6 nodess installed.

- Harrison is being overly conservative. When people really get down to it, perhaps porting to GPGPUs won’t be all that painful -- particularly compared with the vectorization required to really make MIC hum.

- Those MLOCs aren’t important for Jaguar/Titan. Unless you have a clearance a lot higher than the one I used to have, you have no clue what they are really running on Jaguar/Titan. The codes ported to MIC might not be the ones they need there, or what they run there may slip smoothly onto GPGPUs, or they may be so important a GPGPU porting effort is deemed worthwhile.

- MIC doesn’t arrive on time. MIC is still vaporware, after all, and the Jaguar/Titan upgrade is starting now. (It’s a bit delayed because AMD’s having trouble delivering those Interlagos Opterons, but the target start date is already past.) The earliest firm deployment date I know of for MIC is at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. Its new Stampede system uses MIC and deploys in 2013.

- Upgrading is a lot simpler and cheaper – in direct cost and in operational changes – than installing something that could use MIC. After all, Cray likes AMD, and uses AMD’s inter-CPU interconnect to attach their Gemini inter-node network. This may not hold water, though, since Nvidia isn’t well-liked by AMD anyway, and the Nvidia chips are attached by PCI-e links. PCI-e is what Knight’s Ferry and Knight’s Crossing (the product version) use, so one could conceivably plug them in.

- MIC is too expensive.

That last one requires a bit more explanation. Nvidia Teslas are, in effect, subsidized by the volumes of their plain graphics GPUs. Thise use the same architecture and can to a significant degree re-use chip designs. As a result, the development cost to get Tesla products out the door is spread across a vastly larger volume than the HPC market provides, allowing much lower pricing than would otherwise be the case. Intel doesn’t have that volume booster, and the price might turn out to reflect that.

That Nvidia advantage won’t last forever. Every time AMD sells a Fusion system with GPU built in, or Intel sells one of their chips with graphics integrated onto the silicon, another nail goes into the coffin of low-end GPU volume. (See my post Nvidia-based Cheap Supercomputing Coming to an End; the post turned out to be too optimistic about Intel & AMD graphics performance, but the principle still holds.) However, this volume advantage is still in force now, and may result in a significantly higher cost for MIC-based units. We really have no idea how Intel’s going to price MIC, though, so this is speculation until the MIC vapor condenses into reality.

Some of the resolutions to this Tesla/MIC conflict may be totally bogus, and reality may reflect a combination of reasons, but who knows? As I said above, I’m speculating, a bit caught…

I’m just a little bit caught in the middle

MIC is a dream, and Tesla’s a riddle

I don’t know what to say, can’t believe it all, I tried

I’ve got to let it go

And just enjoy the show.[1]

[1] With apologies to Lenka, the artist who actually wrote the song the girl sings in Moneyball. Great movie, by the way.

Friday, October 28, 2011

MIC and the Knights

Intel’s Many-Integrated-Core architecture (MIC) was on wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s Ferry (KF) software development kit. Well, I thought it was wide display, but I’m an IDF Newbie. There was mention in two keynotes, a demo in the first booth on the right in the exhibit hall, several sessions, etc. Some old hands at IDF probably wouldn’t consider the display “wide” in IDF terms unless it’s in your face on the banners, the escalators, the backpacks, and the bagels.

Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.

This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.

If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.

An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.

Is It Real?

The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)

More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.

The Big Feature: Source Compatibility

There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.

You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.

Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.

If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.

Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –

Well, not necessarily.

The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)

Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.

On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.

But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.

In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)

Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R² force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)

The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.

Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)

By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.

Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.

Several Other Things

Here are other MIC facts/factlets/opinions, each needing far less discussion.

How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.

What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.

Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.

Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)

Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.

Concluding Thoughts

Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.

Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.

Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.

I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.

Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…

Larrabee was supposed to be real, too.

**************************************************************

Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!

Wednesday, October 5, 2011

Will Knight’s Corner Be Different? Talking to Intel’s Joe Curley at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview Joe Curley, Director, Technical Computing Marketing of Intel’s Datacenter & Connected Systems Group in Hillsboro.

Intel-provided information about Joe:

Joe Curley, serves Intel® Corporation as director of marketing for technical computing in the Data Center Group. The technical computing marketing team manages marketing for high-performance computing (HPC) and workstation product lines as well as future Intel® Many Integrated Core (Intel® MIC) products. Joe joined Intel in 2007 to manage planning activities that lead up to the announcement of the Intel® MIC Architecture in May of 2010. Prior to joining Intel, Joe worked at Dell, Inc. and graphics pioneer Tseng Labs in a series of marketing and engineering leadership roles.

I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! to all who responded.)

This is the last in a series of three such transcripts. Hallelujah! Doing this has been a pain. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews. But some were in the interviews, including here.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

[We began by discovering we had similar deep backgrounds, both starting in graphics hardware. I designed & built a display processor (a prehistoric GPU), he built “the most efficient framework buffer controller you could possibly make”. Guess which one of us is in marketing?]

A: My experience in the [HPC] business really started relatively recently, a little under five years ago, [when] I started working on many-core processors. I won’t be able to go into history, but I can at least tell you what we’re doing and why.

Q: Why don’t we start there? At a high level, what are you doing, and why? High level for what you are doing, and as much detail on “why” as you can provide.

A: We have to narrow the question. So, at Intel, what we’re after first of all is in what we call our Technical Computing Marketing Group inside Data Center Group. That has really three major objectives. The first one is to specify the needs for high performance computing, how we can help our customers and developers build the best high performance computing systems.

Q: Let me stop you for a second right there. My impression for high performance computing is that they are people whose needs are that they want more. Just more.

A: Oh, yes, but more at what cost? What cost of power, what cost of programability, what cost of size. How are we going to build the IO system to handle it affordably or use the fabric of the day.

Q: Yes, they want more, but they want it at two bytes/FLOPS of memory bandwidth and communication bandwidth.

A: There’s an old thing called the Dilbert Spec, which is “I want it all, and by the way, can it be free?” But that’s not really what people tell us they want. People in HPC have actually been remarkably pragmatic about what it takes to develop innovation. So they really want us to do some things, and do them really well.

By the way, to finish what we do, we also have the workstation segment, and the MIC Many Integrated Core product line. The marketing for that is also in our group.

You asked “what are you doing and why.” It would probably take forever to go across all domains, but we could go into any one of them a little bit better.

Q: Can you give me a general “why” for HPC, and a specific “why” for MIC?

A: Well, HPC’s a really good business. I get stunned, somebody must be Asking really weird questions, asking “why are you doing HPC?”

Q: What I’ve heard is that HPC is traditionally 12% of the market.

A: Supercomputing is a relatively small percentage of the market. HPC and technical computing, combined, is, not exactly, but roughly, a third of our data center business. [emphasis added by me] Our data center business is a pretty robust business. And high performance computing is a business that requires very high end, high performance processors. It’s actually a very desirable business to be in, if you can do it, and if your systems work. It’s a business we spend a lot of time working on because it’s a good business.

Now, if you look at MIC, back in 2005 we made a tacit conclusion that the performance of a system will come out of parallelism. Parallelism could be expressed at Intel in a lot of different ways. You can look at it as threads, we have this concept called hyperthreading. You can look at it as cores. And we have the SSE instructions sitting around which are SIMD, that’s a form of parallelism; people argue about the definition, but yes, it is. [I agree.] So you take a look at the basic architectural constructs, ease of programming, you know, a cache-based CISC model, and then scaling on cores, threads, SIMD or vectors, these common attributes have been adopted and well-used by a lot of programmers. There are programs across the continuum of coarse- to fine-grained parallel, embarrassingly parallel, pick your taxonomy. But there are applications that developers would be willing to trade the performance of any particular task or thread for the sum of what you can do inside the power envelope at a given period of time. Lots of people have different ways of defining that, you hear throughput, whatever, but this is the class of applications, and over time they’re growing.

Q: Growing relatively, or, say, compared to commercial processing, or…? Is the segment getting larger?

A: The number of people who have tasks they want to run on that kind of hardware is clearly growing. One of the reasons we’re doing MIC, maybe I should just cut it to the easiest answer, is developers and customers asked us to.

Q: Really?

A: And they came to us with a really simple question. We were struggling in the marketing group with how to position MIC, and one of our developers got worked up, like “Look, you give me the parallel performance of an accelerator, but you give me the ease of CPU programming!” Now, ease is a funny word; you can get into religious arguments about ease. But I think what he means is “I don’t have to re-think my algorithm, I don’t have to reorder my data set, there are some things that I don’t have to do. So that they wanted to have the idea of give me this architecture and get it to scale to be wildly parallel. And that is exactly what we’ve done with the MIC architecture. If you think about what the Kinght’s Ferry STP [? Undoubtedly this is SDP - Software Development Platform; I just heard it wrong on the recording.] is, a 32 core, coherent, on a chip, teraflop part, it’s kind of like Paragon or ASCI Red on a chip. [but it is only a TFLOPS in single precision] And the programming model is, surprisingly, kind of like a bunch of processor cores on a network, which a lot of people understand and can get a lot of utility out of in a very well-understood way. So, in a sense, we’re giving people what they want, and that, generally, is good business. And if you don’t give them what they want, they’ll have to go find someone else. So we’re simply doing what our marketplace asked us for.

Q: Well, let me play a little bit of devil’s advocate here, because MIC is very clearly derivative of Larrabee, and…

A: Knight’s Ferry is.

Q: … Knight’s Ferry is. Not MIC?

A: No. I think you have to take a look at what Larrabee was. Larrabee, by the way, was a really cool project, but what Larrabee was was a tile rendering graphics device, which meant its design point, was first of all the programming model was derived from what you do for graphics. It’s going to be API-based, the answer it’s going to generate is going to be a pixel, the pixel is going to have a defined level of sub-pixel accuracy. It’s a very predictable output. The internal optimizations you would make for a graphics implementation of a general many-core architecture is one very specific implementation. Let’s talk about the needs of the high performance computing market. I need bandwidth. I need memory depth. Larrabee didn’t need memory depth; it didn’t have a frame buffer.

Q: It needed bandwidth to local memory [of which it didn’t have enough; see my post The Problem with Larrabee]

A: Yes, but less than you think, because the cache was the critical element in that architecture [again, see that post] if you look through the academic papers on that…

Q: OK, OK.

A: So, they have a common heritage, they’re both derived out of the thoughts that came out of the Intel Labs terascale research. They’re both many-core. But Knight’s Ferry came out with a few, they’re only a few, modifications. But the programming model is completely different. You don’t program a graphics device like you do a computer, and MIC is a computer.

Q: The higher-level programming model is different.

A: Correct.

Q: But it is a big, wide, cache-coherent SMP.

A: Well, yes, that’s what Knight’s Ferry is, but we haven’t talked about what Knight’s Corner yet, and unfortunately I won’t today, and we haven’t talked about where the product line will go from there, either. But there are many things that will remain the same, because there are things you can take and embellish and work and things that will be really different.

Q: But can you at least give me a hint? Is there a chance that Knight’s Corner will be a substantially different hardware model than Knight’s Ferry?

A: I’m going to really love to talk to you about Knight’s Corner. [his emphasis]

Q: But not today.

A: I’m going to duck it today.

Q: Oh, man…

A: The product is going to be in our 22 nm process, and 22 nm isn’t shipping yet. When we get a little bit closer, when it deserves to have the buzz generated, we’ll start generating buzz. Right now, the big thing is that we’re making the investments in the Knight’s Ferry software development platform, to see how codes scale across the many-core, to get the environment and tools up, to let developers poke at it and find stuff, good stuff, bad stuff, in between stuff, that allow us to adjust the product line for ongoing generations. We’ve done that really well since we announced the architecture about 15 months ago.

Q: I was wondering what else I was going to talk about after having talked to both John Hengeveld and Jim Reinders. This is great. Nobody talked about where it really came from, and even hinted that there were changes to the MIC chip [architecture].

A: Oh, no, no, many things will be the same, many things will be different. If you’re targeting trying to do a pixel-renderer, go do a pixel-renderer. If you’re trying to do a general-purpose computing device, do a general-purpose computing device. You’ll see some things and say “well, it’s all the same” and other things “wow, it’s completely different.” We’ll get around to talking about the part when we’re a little closer.

The most important thing that James and/or John should have been talking about is that the key thing is the ability to not force the developer to completely and utterly re-think their problem to use your hardware. There are two models: In an accelerator model, which is something I spent a lot of my life working with, accelerators have the advantage of optimization. You can say “I want to do one thing really well.” So you can then describe a programming model for the hardware. You can say “build your data this way, write your program this way” and if you do it will work. The problem is that not everything fits into the box. Oh, you have sparse data. Oh, you have recursive code.

Q: And there’s madness in that direction, because if you start supporting that you wind yourself around to a general-purpose machine. […usually, a very odd-looking general-purpose machine. I’ve talked about Sutherland’s “Wheel of Reincarnation” in this blog, haven’t I? Oh, there it is: The Cloud Got GPUs, back in November 2010.]

A: Then it’s not an accelerator any more. The thing that you get in MIC is the performance of one of those accelerators. We’ve shown this. We’ve hit 960GF out of a peak 1.2TF without throwing away precision, without playing any circus tricks, just run the hardware. On Knight’s Ferry we’ve shown that. So you get performance, but you’re getting it out of the general purpose programming model.

Q: That’s running LINPACK, or… ?

A: That was an even more basic thing; I’m just talking about SGEMM [single-precision dense matrix multiply].

Q: I just wanted to ground the number.

A: For LU factorization, I think we showed hybrid LU, really cool, one of the great things about this hybrid…

Q: They’re demo-ing that downstairs.

A: … OK. When the matrix size is small, I keep it on the host; when the matrix size is large, I move it. But it’s all the same code, the same code either place. I’m just deciding where I want to run the code intelligently, based on the size of the matrix. You can get the exact number, but I think it’s on the order of 750GBytes/sec for LU [GFLOPS?], which is actually, for a first-generation part, not shabby. [They were doing 650-750 GF according to the meter I saw. That's single precision; Knight's Ferry was originally a graphics part.]

Q: Yaahh, well, there are a lot of people who can deliver something like that.

A: We’ll keep working on it and making it better and better. So, what are we proving today. All we’ve proven today is that the architecture is capable of performance. We’ve got a lot of work to do before we have a product, but the architecture has shown itself to be capable. The programming model, we have people who will speak for us, like the quotes that came from LRZ [data center for the universities of Munich and the Bavarian Academy of Sciences], from Leibnitz [same place], a code they couldn’t port to other accelerators was running in two hours and optimized in two days. Now, actual mileage may vary, see dealer for…

Q: So, there are things that just won’t run on a CUDA model? Example?

A: Well, perhaps, again, the thing you try to get to is whether there is evidence growing that what you say is real. So we’re having people who are starting to be able to speak to that, and that gives people the confidence that we’re going to be able to get there. The other thing it ends up doing, it’s kind of an odd benefit, as people have started building their code, trying to optimize it for MIC, they’re finding the parallelism, they’re doing what we wanted them to do all along, they’re taking the same code on their current cluster and they’re getting benefits right now.

Q: That’s got a long history. People would have some grotty old FORTRAN code, and want to vectorize it, but the vectorizing compiler couldn’t make crap out of it. So they cleaned it up, made it obvious what was going on, and the vectorizer did its thing well. Then they put it back on the original machine and it ran twice as fast.

A: So, one of the nice things that’s happened is that as people are looking at ways to scale power, performance, they’re finally getting around to dealing with parallelism. The offer that we’re trying to provide is portable, high level, standards-based, and you can use it now.

You said “why.” That’s why. Our customers and developers say “if you can do that, that’s really valuable.” Now. We’re four men and a pudding, we haven’t shipped a product yet, we’ve got a lot of work to do, but the thought and the promise and the early data is really good.

Q: OK. Well, great.

A: Was that a good use of the time?

Q: That’s a very good use of the time. Let me poke on one thing a little bit. Conceptually, it ought to be simpler to write code to that kind of a shared memory model and get parallelism out of the code that way. Now, on the other hand, there was a talk – sorry, I forget his name, he was one of the software guys working on Larrabee [it was Tom Forsyth; see my post The Problem with Larrabee again] said someone on the project had written four renderers, and three of them were for Larrabee. He was having one hell of a time trying to get something that performed well. His big issue, at least what it came down to from what I remember of the talk, was memory bandwidth.

A: Well, first of all, we’ve said Larrabee’s not a product. As I’ve said, one of the things that is critical, you’ve got the compute-bound, you’ve got the memory-bound, and most people are somewhere in between, but you have to be able to handle the two edge cases. We understand that, and we intend to deliver a really good value across the spectrum. Now, Knight’s Ferry has the RVI silicon [RVI? I’m guessing here], it’s a variation off the silicon we used, no one cares about that, but on Knight’s Ferry, the memory bus is 256 bits wide. Relatively narrow, and for a graphics processor, very narrow. There are definitely design decisions in how that chip was made that would limit the bandwidth. And the memory it was designed with is slower than the memory today, you have all of the normal things. But if you went downstairs to the show floor, and talk to Daniel Paul, he’s demonstrating a pretty dramatic ray-tracer.

[What follows is a bit confused. He didn’t mean the Austrian Crown stochastic ray-tracing demo, but rather the real-time ray-tracing demo. As I said in my immediately previous post (Random Things of Interest at IDF 2011), the real-time demo is on a set of Knight’s Ferries attached to a Xeon-based node. At the time of the interview, I hadn’t seen the real-time demo, just the stochastic one; the latter is not on Knight’s Ferry.]

Q: I’ve seen that one. The Austrian Crown?

A: Yes.

Q: I thought that was on a cluster.

A: In the little box behind there, he’s able to scale from one to eight Knight’s Ferries.

Q: He never told me there was a Knight’s Ferry in there.

A: Yes, it’s all Knight’s Ferry.

Q: Well, I’m going to go down there and beat on him a little bit.

A: I’m about to point you to a YouTube site, it got compressed and thrown up on YouTube. You can’t get the impact of the complexity of the rays, but you can at least get the superficial idea of the responsiveness of the system from Knight’s Ferry.

[He didn’t point me to YouTube, or I lost it, but here’s one I found. Ignore the fact that the introduction is in Swedish or something [it's Dutch, actually]; Daniel – and it’s Daniel, not David – speaks English, and gives a good demo. Yes, everybody in the “Labs” part of the showroom wore white lab coats. I did a bit of teasing. I also updated the Random Things of Interest post to directly include it.]

Well, if you believe that what we’re going to do in our mainstream processors is roughly double the FLOPS every generation for the next many generations, that’s our intent. What if we can do that on the MIC line as well? By the time you get to where ray-tracing would be practical, you could see multiple of those being integrated into a single device [added in transcription: Multiple MICs in a single device? Hierarchical MIC?] becomes practical computationally. That won’t be far from now. So, it’s a nice demo. David’s an expert in his field, I didn’t hear what he said, but it you want to see the device downstairs actually running a fairly strenuous graphics workload, take a look at that.

Q: OK. I did go down there and I did see that, I just didn’t know it was Knight’s Ferry. [It’s not, it’s not, still confused here.] On that HDR display that is gorgeous. [Where “it” = stochastically-ray-traced Austrian Crown. It is.]

[At that point, Dave Patterson walked in, which interrupted us. We said hello – I know Dave of old, a bit – thanks were exchanged with Joe, and I departed.]

[I can’t believe this is the end of the last one. I really don’t like transcribing.]

Sunday, October 2, 2011

Random Things of Interest at IDF 2011 (Intel Developer Forum)

I still have one IDF interview to transcribe (Joe Curley), but I’m sick of doing transcriptions. So here are a few other random things I observed at the 2011 Intel Developers Forum. It is nothing like comprehensive. It’s also not yet the promised MIC dump; that will still come.

Exhibit Hall

I found very few products I had a direct interest in, but then again I didn’t look very hard.

On the right, immediately as you enter, was a demo of a Xeon/MIC combination clocking 600-700 GFLOPS (quite assuredly single precision) doing LRU Factorization. Questions to the guys running the demo indicated: (1) They did part on the Xeon, and there may have been two of those, they weren’t sure (the diagram showed two). (2) They really learned how to say “We don’t comment on competitors” and “We don’t comment on unannounced products.”

A 6-legged robot controlled by Atom, controlled by a game controller. I included this here only because it looked funky and I took a picture (q. v.). Also, for some reason it was in constant slight motion, like it couldn’t sit still, ever.

There were three things that were interesting to me in the Intel Labs section:

One Tbit/sec memory stack: To understand why this is interesting, you need to know that the semiconductor manufacturing processes used to make DRAM and logic are quite different. Putting both on the same chip requires compromises in one or the other. The logic that must exist on DRAM chips isn’t quite as good as it could be, for example. In this project, they separated the two onto separate chips in a stack: Logic is on one, the bottom one, that interfaces with the outside world. On top of this are multiple pure memory chips, multiple layers of pure DRAM, no logic. They connect by solder bumps or something (I’m not sure), and there are many (thousands of) “through silicon vias” that go all the way through the memory chips to allow connecting a whole stack to the logic at the bottom with very high bandwidth. This whole idea eliminates the need to compromise on semiconductor processes, so the DRAM can be dense (and fast), and the logic can be fast (and low power). One result is that they can suck 1 Tbit/sec of data out of one of these stacks. This just feels right to me as a direction. Too bad they’re unlikely to use the new IBM/3M thermally conductive glue to suck heat out of the stack.

Stochastic Ray-Tracing: What it says: Ray-tracing, but allows light to be probabilistically scattered off surfaces, so, for example, shiny matte surfaces have realistically blurred reflections on them, and produce more realistic color effects on other surfaces to which they reflect. Shiny matte surfaces like the surface of the golden dome in the center of the Austrian crown, reflecting the jewels in the outer band, which was their demo image. I have a picture here, but it comes nowhere near doing this justice. The large, high dynamic range monitor they had, though – wow. Just wow. Spectacular. A guy was explaining this to me pointing to a normal monitor when I happened to glance up at the HDR one. I was like “shut up already, I just want to look at that.” To run it they used a cluster of four Xeon-based nodes, each apparently about 4U high, and that was not in real time; several seconds were required per update. But wow.

Real-Time Ray-Tracing: This has been done before; I saw it a demo on a Cell processor back in about 2006. This, however, was a much more complex scene than I’d previously viewed. It had the usual shiny classic car, but that was now in the courtyard of a much larger old palace-like building, with lots of columns and crenellations and the like. It ran on a MIC, of course – actually, several of them, all attached to the same Xeon system. Each had a complete copy of the scene data in its memory, which is unrealistic but does serve to make the problem “pleasantly parallel” (which is what I’m told is now the PC way to describe what used to be called “embarrassingly parallel”). However, the demo was still fun. Here's a video of it I found. It apparently was shot at a different event, but still the same technology demonstrated. The intro is in Swedish, or something, but it reverts to English at the demo. And yes, all the Intel Labs guys wore white lab coats. I teased them a bit on that.

Keynotes

Otellini (CEO): Intel is going hot and heavy into supporting the venerable Trusted Platform technology, a collection of technology which might well work, but upon which nobody has yet bitten. This security emphasis clearly fits with the purchase of MacAfee (everybody got a free MacAfee package at registration, good for 3 systems). “2011 may be the year the industry got serious about security!” I remain unconvinced.

Mooley Eden (General Manager, Mobile Platforms): OK. Right now, I have to say that this is the one time in the course of these IDF posts that I am going to bow to Intel’s having paid for me to attend IDF, bite my tongue rather than succumbing to my usual practice of biting the hand that feeds me, and limit my comments to:

Mooley Eden must be an acquired taste.

To learn more of my personal opinions on this subject, you are going to have to buy me a craft beer (dark & hoppy) in a very noisy bar. Since I don’t like noisy bars, and that’s an unusual combination, I consider this unlikely.

Technically… Ultrabooks, ultrabooks, “beyond thin and light.” More security. They had a lame Ninja-garbed guy on stage, trying to hack into trusted-platform-protected system, and of course failing. (Please see this.) There was also a picture of a castle with a moat, and a (deliberately) crude animation of knights trying to cross the moat and falling in. (I mention this only because it’s relevant to something below.)

People never use hibernate, because it takes too long to wake up. The solution is… to have the system wake up regularly. And run sync operations. Eh what? Is this supposed to cause your wakeup to take less time because the wakeup time is actually spent syncing? My own wakeup time is mostly wakeup. All I know is that suspend/resume used to be really fast, reliable, and smart. Then it got transplanted to Windows from BIOS and has been unsatisfactory - slow and dumb - ever since.

This was my first time seeing Windows 8. It looks like Mango phone interface. Is making phones & PCs look alike supposed to help in some way? (Like boost Windows Phone sales?) I’m quite a bit less than intrigued. It means I’m going to have to buy another laptop before Win 8 becomes the standard.

Justin Rattner (CTO): Some of his stuff I covered in my first post on IDF. One I didn’t cover was the massive deal made of CERN and the LHC (Large Hadron Collider) (“the largest machine human beings have ever created”) (everybody please now go “ooooohhh”) using MICs. Look, folks, the major high energy physics apps are embarrassingly parallel: You get a whole lot, like millions, billions, of particle collisions, gather each one’s data, and do an astounding amount of floating-point computing on each completely independent set of collision data. Separately. Hoping to find out that one is a Higgs boson or something. I saw people doing this in the late 1980s at Fermilab on a homebrew parallel system. They even had a good software framework for using it: Write your (serial) code for analyzing a collision your way, and hand it to us; we run it many times in parallel, just handing out each event’s data to an instance of your code. The only thing that would be interesting about this would be if for some reason they actually couldn’t run HEP codes very well indeed. But they can run them well. Which makes it a yawn for me. I’ve no question that the LHC is ungodly impressive, of course. I just wish it were in Texas and called something else.

Intel Fellows Panel

Some interesting questions asked and answered, many questions lame. Like: “Will high-end Xeons pass mainframes?” Silly question. Depends on what “pass” means. In the sense in which most people may mean, they already have, and it doesn’t matter. Here are some others:

Q: Besides MIC, what else is needed for Exascale? A: We’re having to go all the way down to device level. In particular, we’re looking at subthreshold or near-threshold logic. We tried that before, but failed. Devices turn out to be most efficient 20mv above threshold. May have to run at 800MHz. [Implication: A whole lot of parallelism.] Funny how they talked about near-threshold logic, and Justin Rattner just happened to have a demo of that at the next day’s keynote.

Q: Are you running out of rare metals? A: It’s a question of cost. Yes, we always try to move off expensive materials. Rare earths needed, but not much; we only use them in layers like five atoms thick.

Q: Is Moore’s Law going to end? A: This was answered by Kelin J. Kuhn, Fellow & Director of the Technology and Manufacturing Group – i.e., she really knows silicon. She noted that, by observation, at every given generation it always looks like Moore’s Law ends in two generations. But it never has. Every time we see a major impediment to the physics – several examples given, going back to the 1980s and the end of Dennard scaling – something seems to come along to avoid the problem. The exception seems to be right now: Unlike prior eras when it will end in two generations, there don't seem to be any clouds on this particular horizon at all. (While I personally know of no reason to dispute this, keep in mind that this is from Intel, whose whole existence seems tied to Moore's Law, and it's said by the woman who probably has the biggest responsibility to make it all come about.)

An aside concerning the question-taking woman with the microphone on my side of the hall: I apparently reminded her of something she hates. She kept going elsewhere, even after standing right beside me for several minutes while I had my hand raised. What I was going to ask was: This morning in the keynote we saw a castle with a moat, and several knights dropping into the moat. The last two days we also heard a lot about a knight which appears to take a ferry across the moat of PCIe. Why are you strangling a TFLOP of computation with PCIe? Other accelerator vendors don’t have a choice with their accelerators, but you guys own the whole architecture. Surely something better could be done. Does this, perhaps, indicate a lack of integration or commitment to the new architecture across the organization?

Maybe she was fitted with a wiseass detection system.

Anyway, I guess I won’t find out this year.