The Perils of Parallel

Intel Xeon Phi Announcement (& me)

2012-11-12T14:00:00.000-07:00

1. No, I’m not dead. Not even sick. Been a long time since a post. More on this at the end.

2. So, Intel has finally announced a product ancestrally based on the long-ago Larrabee. The architecture became known as MIC (Many Integrated Cores), development vehicles were named after tiny towns (Knights Corner/Knights Ferry – one was to be the product, but I could never keep them straight), and the final product is to be known as the Xeon Phi.

Why Phi? I don’t know. Maybe it’s the start of a convention of naming High-Performance Computing products after Greek letters. After all, they’re used in equations.

A micro-synopsis (see my post MIC and the Knights for a longer discussion): The Xeon Phi is a PCIe board containing 6GB of RAM and a chip with lots (I didn’t find out how many ahead of time) of X86 cores with wide (512 bit) vector units, able to produce over 1 TeraFLOP (more about that later). The X86 cores a programmed pretty much exactly like a traditional “big” single Xeon: All your favorite compilers and be used, and it runs Linux. Note that to accomplish that, the cores must be fully cache-coherent, just like a multi-core “big” Xeon chip. Compiler mods are clearly needed to target the wide vector units, and that Linux undoubtedly had a few tweeks to do anything useful on the 50+ cores there are per chip, but they look normal. Your old code will run on it. As I’vepointed out, modifications are needed to get anything like full performance, but you do not have to start out with a blank sheet of paper. This is potentially a very big deal.

Since I originally published this, Intel has deluged me with links to their information. See the bottom of this post if you want to read them.

So, it’s here. Finally, some of us would say, but development processes vary and may have hangups that nobody outside ever hears about.

I found a few things interesting about about the announcement.

Number one is their choice as to the first product. The one initially out of the blocks is, not a lower-performance version, but rather the high end of the current generation: The one that costs more ($2649) and has high performance on double precision floating point. Intel says it’s doing so because that’s what its customers want. This makes it extremely clear that “customers” means the big accounts – national labs, large enterprises – buying lots of them, as opposed to, say, Prof. Joe with his NSF grant or Sub-Department Kacklefoo wanting to try it out. Clearly, somebody wants to see significant revenue right now out of this project after so many years. They have had a reasonably-sized pre-product version out there for a while, now, so it has seen some trial use. At national labs and (maybe?) large enterprises.

It costs more than the clear competitor, Nvidia’s Tesla boards. $2649 vs. sub-$2000. For less peak performance. (Note: I've been told that Anantech claims the new Nvidia K20s cost >$3000. I can't otherwise confirm that.) We can argue all day about whether the actual performance is better or worse on real applications, and how much the ability to start from existing code helps, but this pricing still stands out. Not that anybody will actually pay that much; the large customer targets are always highly-negotiated deals. But the Prof. Joes and the Kacklefoos don’t have negotiation leverage.

A second odd point came up in the Q & A period of the pre-announce concall. (I was invited, which is why I’ve come out of my hole to write this.) (Guilt.) Someone asked about memory bottlenecks; it has 310GB/s to its memory, which isn’t bad, but some apps are gluttons. This prompted me to ask about the PCIe bottleneck: Isn’t it also going to be starved for data delivered to it? I was told I was doing it wrong. I was thinking of the main program running on the host, foisting stuff off to the Phi. Wrong. The main program runs on the Phi itself, so the whole program runs on the many (slower) core card.

This means they are, at this time at least, explicitly not taking the path I’ve heard Nvidia evangelists talk about recently: Having lots and lots of tiny cores, along with a number of middle-size cores, and much fewer Great Big cores – and they all live together in a crooked little… Sorry! on the same chip, sharing the same memory subsystem so there is oodles of bandwidth amongst them. This could allow the parts of an application that are limited by single- or few-thread performance to go fast, while the parts that are many-way parallel also go fast, with little impedance mismatch between them. On Phi, if you have a single-thread limited part, it runs on just one of the CPUs, which haven’t been designed for peak single-thread performance. On the other hand, the Nvidia stuff is vaporware and while this kind of multi-speed arrangement has been talked about for a long time, I don’t know of any compiler technology that supports it with any kind of transparency.

A third item, and this seems odd, are the small speedups claimed by the marketing guys: Just 2X-4X. Eh what? 50 CPUs and only 2-4X faster?

This is incredibly refreshing. The claims of accelerator-foisting companies can be outrageous to the point that they lose all credibility, as I’ve written about before.

On the other hand, it’s slightly bizarre, given that at the same conference Intel has people talking about applications that, when retargeted to Phi, get 6.6X (in figuring out graph connections on big graphs) or 4.8X (analyzing synthetic aperture radar images).

On the gripping hand, I really see the heavy hand of Strategic Marketing smacking people around here. Don’t cannibalize sales of the big Xeon E5s! They are known to make Big Money! Someone like me, coming from an IBM background, knows a whole lot about how The Powers That Be can influence how seemingly competitive products are portrayed – or developed. I’ve a sneaking suspicion this influence is why it took so long for something like Phi to reach the market. (Gee, Pete, you’re a really great engineer. Why are you wasting your time on that piddly little sideshow? We’ve got a great position and a raise for you up here in the big leagues…) (Really.)

There are rationales presented: They are comparing apples to apples, meaning well-parallelized code on Xeon E5 Big Boys compared with the same code on Phi. This is to be commended. Also, Phi ain’t got no hardware ECC for memory. Doing ECC in software on the Phi saps its strength considerably. (Hmmm, why do you suppose it doesn’t have ECC? (Hey, Pete, got a great position for you…) (Or "Oh, we're not a threat. We don't even have ECC!" Nobody will do serious stuff without ECC.")) Note: Since this pre-briefing, data sheets have emerged that indicate Phi has optional ECC. Which raises two questions: Why did they explicitly say otherwise in the pre-briefing? And: What does "optional" mean?

Anyway, Larrabee/MIC/Phi has finally hit the streets. Let the benchmark and marketing wars commence.

Now, about me not being dead after all:

I don’t do this blog-writing thing for a living. I’m on the dole, as it were – paid for continuing to breathe. I don’t earn anything from this blog; those Google-supplied ads on the sides haven’t put one dime in my pocket. My wife wants to know why I keep doing it. But note: having no deadlines is wonderful.

So if I feel like taking a year off to play Skyrim, well, I can do that. So I did. It wasn't supposed to be a year, but what the heck. It's a big game. I also participated in some pleasant Grandfatherly activities, paid very close attention to exactly when to exercise some near-expiration stock options, etc.

Finally, while I’ve occasionally poked my head up on Twitter or FaceBook when something interesting happened, there hasn’t been much recently. X added N more processors to the same architecture. Yawn. Y went lower power with more cores. Yawn. If news outlets weren’t paid for how many eyeballs they attracted, they would have been yawning, too, but they are, so every minute twitch becomes an Epoch-Defining Quantum Leap!!! (Complete with ironic use of the word “quantum.”) No judgement here; they have to make a living.

I fear we have closed in on an incrementally-improving era of computing, at least on the hardware and processing side, requiring inhuman levels of effort to push the ball forward another picometer. Just as well I’m not hanging my living on it any more.

----------------------------------------------------------------------------------------

Intel has deluged me with links. Here they are:

Intel® Xeon Phi™ product page: http://www.intel.com/xeonphi

Intel® Xeon Phi™ Coprocessor product brief: http://intel.ly/Q8fuR1

Accelerate Discovery with Powerful New HPC Solutions (Solution Brief) http://intel.ly/SHh0oQ

An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors http://intel.ly/WYsJq9

YouTube Animation Introducing the Intel® Xeon Phi™ Coprocessor http://intel.ly/RxfLtP

Intel® Xeon Phi™ Coprocessor Infographic: http://ow.ly/fe2SP

VIDEO: The History of Many Core, Episode 1: The Power Wall. http://bit.ly/RSQI4g

Diane Bryant’s presentation, additional documents and pictures will be available at Intel Newsroom

Transactional Memory in Intel Haswell: The Good, and a Possible Ugly

2012-02-13T19:49:00.000-07:00

Note: Part of this post is being updated based on new information received after it was published. Please check back tomorrow, 2/18/12, for an update.

Sorry... no update yet (2/21/12). Bad cold. Also seduction / travails of a new computer. Update soon.

I asked James Reinders at the Fall 2011 IDF whether the synchronization features he had from the X86 architecture were adequate for MIC. (Transcript; see the very end.) He said that they were good for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something different.

Now Intel has announced support for transactional memory in Haswell, the chip which follows their Ivy Bridge chip that is just starting shipment as I write this. So I think I’d now be willing to take bets that this is what James was vaguely hinting at, and will appear in Intel’s MIC HPC architecture as it ships in Knight’s Corner product. I prefer to take bets on sure things.

There have been some light discussion of Intel’s “Transactional Synchronization Extensions” (TSE), as this is formally called, and a good example of its use from James Reinders. But now that an architecture spec has been published for TSE, we can get a bit deeper into what, exactly, Intel is providing, and where there might just be a skeleton in the closet.

First, background: What is this “transactional memory” stuff? Why is it useful? Then we’ll get into what Intel has, and the skeleton I believe is lurking.

Transactional Memory

The term “transaction” comes from contract law, was picked up by banking, and from there went into database systems. It refers to a collection of actions that all happen as a unit; they cannot be divided. If I give you money and you give me a property deed, for example, that happens as if it were one action – a transaction. The two parts can’t be (legally) separated; both happen, or neither. Or, in the standard database example: when I transfer money from my bank account to Aunt Sadie’s, the subtraction from my account and the addition to hers must either both happen, or neither; otherwise money is being either destroyed or created, which would be a bad thing. As you might imagine, databases have evolved a robust technology to do transactions where all the changes wind up on stable storage (disk, flash).

The notion of transactional memory is much the same: a collection of changes to memory is made all-or-nothing: Either all of them happen, as seen by every thread, process, processor, or whatever; or none of them happen. So, for example, when some program plays with the pointers in a linked list to insert or delete some list member, nobody can get in there when the update is partially done and follow some pointer to oblivion.

It applies as much to a collection of accesses – reads – as it does to changes – writes. The read side is necessary to ensure that a consistent collection of information is read and acted upon by entities that may be looking around while another is updating.

To do this, typically a program will issue something meaning “Transaction On!” to start the ball rolling. Once that’s done, everything it writes is withheld from view by all other entities in the system; and anything it reads is put on special monitoring in case someone else mucks with it. The cache coherence hardware is mostly re-used to make this monitoring work; cross-system memory monitoring is what cache coherence does.

This continues, accumulating things read and written, until the program issues something meaning “Transaction Off!”, typically called “Commit!.” Then, hraglblargarolfargahglug! All changes are vomited at once into memory, and the locations read are forgotten about.

What happens if some other entity does poke its nose into those saved and monitored locations, changing something the transactor was depending on or modifying? Well, “Transaction On!” was really “Transaction On! And, by the way, if anything screws up go there.” On reaching there, all the recording of data read and changes made has been thrown away; and there is a block of code that usually sets things up to start again, going back to the “Transaction On!” point. (The code could also decide “forget it” and not try over again.) Quitting like this in a controlled manner is called aborting a transaction. It is obviously better if aborts don’t happen a lot, just like it’s better if a lock is not subject to a lot of contention. However, note that nobody else has seen any of the changes made since “On!”, so half-mangled data structures are never seen by anyone.

Why Transactions Are a Good Thing

What makes transactional semantics potentially more efficient than simple locking is that only those memory locations read or referenced at run time are maintained consistently during the transaction. The consistency does not apply to memory locations that could be referenced, only the ones that actually are referenced.

There are situations where that’s a distinction without a difference, since everybody who gets into some particular transaction-ized section of code will bash on exactly the same data every time. Example: A global counter of how many times some operation has been done by all the processors in a system. Transactions aren’t any better than locks in those situations.

But there are cases where the dynamic nature of transactional semantics can be a huge benefit. The standard example, also used by James Reinders, is a multi-access hash table, with inserts, deletions, and lookups done by many processes, etc.

I won’t go through this is detail – you can read James’ version if you like; he has a nice diagram of a hash table, which I don’t – but consider:

With the usual lock semantics, you could simply have one coarse lock around the whole table: Only one person, read or write, gets in at any time. This works, and is simple, but all access to the table is now serialized, so will cause a problem as you scale to more processors.

Alternatively, you could have a lock per hash bucket, for fine-grained rather than coarse locking. That’s a lot of locks. They take up storage, and maintaining them all correctly gets more complex.

Or you could do either of those – one lock, or many – but also get out your old textbooks and try once again to understand those multiple reader / single writer algorithms and their implications, and, by the way, did you want reader or writer priority? Painful and error-prone.

On the other hand, suppose everybody – readers and writers – simply says “Transaction On!” (I keep wanting to write “Flame On!”) before starting a read or a write; then does a “Commit!” when they exit. This is only as complicated as the single coarse lock (and sounds a lot like an “atomic” keyword on a class, hint hint).

Then what you can bank on is that the probability is tiny that two simultaneous accesses will look at the same hash bucket; if that probability is not tiny, you need a bigger hash table anyway. The most likely thing to happen is that nobody – readers or writers – ever accesses same hash bucket, so everybody just sails right through, “Commit!”s, and continues, all in parallel, with no serialization at all. (Not really. See the skeleton, later.)

In the unlikely event that a reader and a writer are working on the same bucket at the same time, whoever “Commit!”s first wins; the other aborts and tries again. Since this is highly unlikely, overall the transactional version of hashing is a big win: it’s both simple and very highly parallel.

Transactional memory is not, of course, the only way to skin this particular cat. Azul Systems has published a detailed presentation on a Lock-Free Wait-Free Hash Table algorithm that uses only compare-and-swap instructions. I got lost somewhere around the fourth state diagram. (Well, OK, actually I saw the first one and kind of gave up.) Azul has need of such things. Among other things, they sell massive Java compute appliances, going up to the Azul Vega 3 7380D, which has 864 processors sharing 768GB of RAM. Think investment banks: take that, you massively recomplicated proprietary version of a Black-Sholes option pricing model! In Java! (Those guys don’t just buy GPUs.)

However, Azul only needs that algorithm on their port of their software stack to X86-based products. Their Vega systems are based on their own proprietary 54-core Vega processors, which have shipped with transactional memory – which they call Speculative Multi-address Atomicity – since the first system shipped in 2005 (information from Gil Tene, Azul Systems CTO). So, all these notions are not exactly new news.

Anyway, if you want this wait-free super-parallel hash table (and other things, obviously) without exploding your head, transactional memory makes it possible rather simply.

What Intel Has: RTE and HLE

Intel’s Transactional Synchronization Extensions come in two flavors: Restricted Transactional Memory (RTE) and Hardware Lock Elision (HLE).

RTE is essentially what I described above: There’s XBEGIN for “Transaction On!”, XEND for “Commit!” and ABORT if you want to manually toss in the towel for some reason. XBEGIN must be given a there location to go to in case of an abort. When an abort occurs, the processor state is restored to what it was at XBEGIN, except that flags are set indicating the reason for the abort (in EAX).

HLE is a bit different. All the documentation I’ve seen so far always talks about it first, perhaps because it seems like it is more familiar, or they want to brag (no question, it’s clever). I obviously think that’s confusing, so didn’t do it in that order.

HLE lets you take your existing, lock-based, code and transactional-memory-ify it: Lock-based code now runs without blocking unless required, as in the hash table example, with minimal, miniscule change that can probably be done with a compiler and the right flag.

I feel like adding “And… at no time did their fingers leave their hands!” It sounds like a magic trick.

In addition to being magical, it’s also clearly strategic for Intel’s MIC and its Knights SDK HPC accelerators. Those are making a heavy bet on people just wanting to recompile and run without the rewrites needed for accelerators like GPGPUs. (See my post MIC and the Knights.)

HLE works by setting a new instruction prefix – XACQUIRE – on any instruction you use to try to acquire a lock. Doing so causes there to be no change to the lock data: the lock write is “elided.” Instead it (a) takes a checkpoint of the machine state; (b) saves the address of the instruction that did this; (c) puts the lock location in the set of data that is transactionally read; and (d) does a “Transaction On!”

So everybody goes charging right through the lock without stopping, but now every location read is continually monitored, and every write is saved, not appearing in memory.

If nobody steps on anybody else’s feet – writes someone else’s monitored location – then when the instruction to release the lock is done, it uses an XRELEASE prefix. This does a “Commit!” hraglblargarolfargahglug flush of all the saved writes into memory, forgets everything monitored, and turns off transaction mode.

If somebody does write a location someone else has read, then we get an ABORT with its wayback machine: back to the location that tried to acquire the lock, restoring the CPU state, so everything is like it was just before the lock acquisition instruction was done. This time, though, the write is not elided: The usual semantics apply, and the code goes through exactly what it did without TSE, the way it worked before.

So, as I understand it, if you have a hash table and read is under way, if a write to the same bucket happens then both the read and the write abort. One of those two gets the lock and does its thing, followed by the other according to the original code. But other reads or writes that don’t have conflicts go right through.

This seems like it will work, but I have to say I’d like to see the data on real code. My gut tells me that anything which changes the semantics of parallel locking, which HLE does, is going to have a weird effect somewhere. My guess would be some fun, subtle, intermittent performance bugs.

The Serial Skeleton in the Closet

This is all powerful stuff that will certainly aid parallel efficiency in both MIC, with it’s 30-plus cores; and the Xeon line, with fewer but faster cores. (Fewer faster cores need it too, since serialization inefficiency gets proportionally worse with faster cores.) But don’t think for a minute that it eliminates all serialization.

I see is no issue with the part of this that monitors locations read and written; I don’t know Intel’s exact implementation, but I feel sure it re-uses the cache coherence mechanisms already present, which operate without (too) much serialization.

However, there’s a reason I used a deliberately disgusting analogy when talking about pushing all the written data to memory on “Commit!” (XEND, XRELEASE). Recall that the required semantics are “all or nothing”: Every entity in the system sees all of the changes, or every entity sees none of them. (I’ve been saying “entity” because GPUs are now prone to directly access cache coherent memory, too.)

If the code has changed multiple locations during a transaction, probably on multiple cache lines, that means those changes have to be made all at once. If locations A and B both change, nobody can possibly see location A after it changed but location B before it changed. Nothing, anywhere, can get between the write of A and the write of B (or the making of both changes visible outside of cache).

As I said, I don’t know Intel’s exact implementation, so could conceivably be wrong, but that for me that implies that every “Commit!” requires a whole system serialization event: Every processor and thread in the whole system has to be not just halted, but pipelines drained. Everything must come to a dead stop. Once that stop is done, then all the changes can be made visible, and everything restarted.

Note that Intel’s TSE architecture spec says nothing about these semantics being limited to one particular chip or region. This is very good; software exploitation would be far harder otherwise. But it implies that in a multi-chip, multi-socket system, this halt and drain applies to every processor in every chip in every socket. It’s a dead stop of everything.

Well, OK, but lock acquire and release instructions always did this dead stop anyway, so likely the aggregate amount of serialization is reduced. (Wait a minute, they always did this anyway?! What the… Yeah. Dirty little secret of the hardware dudes.)

But lock acquire and release only involve one cache line at a time. “Commit!” may involve many. Writes involve letting everybody else know you’re writing a particular line, so they can invalidate it in their cache(s). Those notifications all have to be sent out, serially, and acknowledgements received. They can be pipelined, and probably are, but the process is still serial, and must be done while at a dead stop.

So, if your transactional segment of code modifies, say, 128KB spread over 512K cache lines, you can expect a noticeable bit of serialization time when you “Commit!”. Don’t forget this issue now includes all your old-style locking, thanks to HLE, where the original locking involved updating just one cache line. This is another reason I want to see some real running code with HLE. Who knows what evil lurks between the locks?

But as I said, I don’t know the implementation. Could Intel folks have found a way around this? Maybe; I’m writing this, as I’ve indicated, speculatively. Perhaps real magic is involved. We’ll find out when Haswell ships.

Enjoy your bright, shiny, new, non-blocking transactional memory when it ships. It’ll probably work really well. But beware the dreaded hraglblargarolfargahglug. It bites.

20 PFLOPS vs. 10s of MLOC: An Oak Ridge Conundrum

2012-01-09T16:37:00.000-07:00

On The One Hand:

Oak Ridge National Laboratories (ORNL) is heading for a 20 PFLOPS system, getting there by using Nvidia GPUs. Lots of GPUs. Up to 18,000 GPUs.

This is, of course, neither a secret nor news. Look here, or here, or here if you haven’t heard; it was particularly trumpeted at SC11 last November. They’re upgrading the U.S. Department of Energy's largest computer, Jaguar, from a mere 2.3 petaflops. It will grow into a system to be known as Titan, boasting a roaring 10 to 20 petaflops. Jaguar and Titan are shown below. Presumably there will be more interesting panel art ultimately provided for Titan.

The upgrade of the Jaguar Cray XT5 system will introduce new Cray XE6 nodes with AMD’s 16-core Interlagos Opteron 6200. However, the big performance numbers come from new XK6 nodes, which replace two (half) of the AMDs with Nvidia Tesla 3000-series Kepler compute accelerator GPUs, as shown in the diagram. (The blue blocks are Cray’s Gemini inter-node communications.)

The actual performance is a range because it will “depend on how many (GPUs) we can afford to buy," according to Jeff Nichols, ORNL's associate lab director for scientific computing. 20 PFLOPS is achieved if they reach 18,000 XK6 nodes, apparently meaning that all the nodes are XK6s with their GPUs.

All this seems like a straightforward march of progress these days: upgrade and throw in a bunch of Nvidia number-smunchers. Business as usual. The only news, and it is significant, is that it’s actually being done, sold, installed, accepted. Reality is a good thing. (Usually.) And GPUs are, for good reason, the way to go these days. Lots and lots of GPUs.

On The Other Hand:

Oak Ridge has applications totaling at least 5 million lines of code most of which “does not run on GPGPUs and probably never will due to cost and complexity” [emphasis added by me].

That’s what was said at an Intel press briefing at SC11 by Robert Harrison, a corporate fellow at ORNL and director of the National Institute of Computational Sciences hosted at ORNL. He is the person working to get codes ported to Knight’s Ferry, a pre-product software development kit based on Intel’s MIC (May Integrated Core) architecture. (See my prior post MIC and the Knights for a short description of MIC and links to further information.)

Video of that entire briefing is available, but the things I’m referring to are all the way towards the end, starting at about the 50 minute mark. The money slide out of the entire set is page 30:

(And I really wish whoever was making the video didn’t run out of memory, or run out of battery, or have to leave for a potty break, or whatever else right after this page was presented; it's not the last.)

The presenters said that they had actually ported “tens of millions” of lines of code, most functioning within one day. That does not mean they performed well in one day – see MIC and the Knights for important issues there – but he did say that they had decades of experience making vector codes work well, going all the way back to the Cray 1.

What Harrison says in the video about the possibility of GPU use is actually quite a bit more emphatic than the statement on the slide:

Most of this software, I can confidently say since I'm working on them ... will not run on GPGPUs as we understand them right now, in part because of the sheer volume of software, millions of lines of code, and in large part because the algorithms, structures, and so on associated with the applications are just simply don't have the massive parallelism required for fine grain [execution]."

All this is, of course, right up Intel’s alley, since their target for MIC is source compatibility: Change a command-line flag, recompile, done.

I can’t be alone in seeing a disconnect between the Titan hype and these statements. They make it sound like they’re busy building a system they can’t use, and I have too much respect for the folks at ORNL to think that could be true.

So, how do we resolve this conundrum? I can think of several ways, but they’re all speculation on my part. In no particular order:

- The 20 PFLOP number is public relations hype. The contract with Cray is apparently quite flexible, allowing them to buy as many or as few XK6 Tesla-juiced nodes as they like, presumably including zero. That’s highly unlikely, but it does allow a “try some and see if you like it” approach which might result in rather few XK6 nodess installed.

- Harrison is being overly conservative. When people really get down to it, perhaps porting to GPGPUs won’t be all that painful -- particularly compared with the vectorization required to really make MIC hum.

- Those MLOCs aren’t important for Jaguar/Titan. Unless you have a clearance a lot higher than the one I used to have, you have no clue what they are really running on Jaguar/Titan. The codes ported to MIC might not be the ones they need there, or what they run there may slip smoothly onto GPGPUs, or they may be so important a GPGPU porting effort is deemed worthwhile.

- MIC doesn’t arrive on time. MIC is still vaporware, after all, and the Jaguar/Titan upgrade is starting now. (It’s a bit delayed because AMD’s having trouble delivering those Interlagos Opterons, but the target start date is already past.) The earliest firm deployment date I know of for MIC is at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. Its new Stampede system uses MIC and deploys in 2013.

- Upgrading is a lot simpler and cheaper – in direct cost and in operational changes – than installing something that could use MIC. After all, Cray likes AMD, and uses AMD’s inter-CPU interconnect to attach their Gemini inter-node network. This may not hold water, though, since Nvidia isn’t well-liked by AMD anyway, and the Nvidia chips are attached by PCI-e links. PCI-e is what Knight’s Ferry and Knight’s Crossing (the product version) use, so one could conceivably plug them in.

- MIC is too expensive.

That last one requires a bit more explanation. Nvidia Teslas are, in effect, subsidized by the volumes of their plain graphics GPUs. Thise use the same architecture and can to a significant degree re-use chip designs. As a result, the development cost to get Tesla products out the door is spread across a vastly larger volume than the HPC market provides, allowing much lower pricing than would otherwise be the case. Intel doesn’t have that volume booster, and the price might turn out to reflect that.

That Nvidia advantage won’t last forever. Every time AMD sells a Fusion system with GPU built in, or Intel sells one of their chips with graphics integrated onto the silicon, another nail goes into the coffin of low-end GPU volume. (See my post Nvidia-based Cheap Supercomputing Coming to an End; the post turned out to be too optimistic about Intel & AMD graphics performance, but the principle still holds.) However, this volume advantage is still in force now, and may result in a significantly higher cost for MIC-based units. We really have no idea how Intel’s going to price MIC, though, so this is speculation until the MIC vapor condenses into reality.

Some of the resolutions to this Tesla/MIC conflict may be totally bogus, and reality may reflect a combination of reasons, but who knows? As I said above, I’m speculating, a bit caught…

I’m just a little bit caught in the middle

MIC is a dream, and Tesla’s a riddle

I don’t know what to say, can’t believe it all, I tried

I’ve got to let it go

And just enjoy the show.[1]

[1] With apologies to Lenka, the artist who actually wrote the song the girl sings in Moneyball. Great movie, by the way.

MIC and the Knights

2011-10-28T18:00:00.001-06:00

Intel’s Many-Integrated-Core architecture (MIC) was on wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s Ferry (KF) software development kit. Well, I thought it was wide display, but I’m an IDF Newbie. There was mention in two keynotes, a demo in the first booth on the right in the exhibit hall, several sessions, etc. Some old hands at IDF probably wouldn’t consider the display “wide” in IDF terms unless it’s in your face on the banners, the escalators, the backpacks, and the bagels.

Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.

This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.

If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.

An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.

Is It Real?

The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)

More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.

The Big Feature: Source Compatibility

There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.

You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.

Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.

If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.

Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –

Well, not necessarily.

The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)

Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.

On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.

But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.

In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)

Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R² force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)

The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.

Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)

By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.

Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.

Several Other Things

Here are other MIC facts/factlets/opinions, each needing far less discussion.

How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.

What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.

Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.

Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)

Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.

Concluding Thoughts

Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.

Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.

Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.

I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.

Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…

Larrabee was supposed to be real, too.

**************************************************************

Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!

Will Knight’s Corner Be Different? Talking to Intel’s Joe Curley at IDF 2011

2011-10-05T11:08:00.000-06:00

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview Joe Curley, Director, Technical Computing Marketing of Intel’s Datacenter & Connected Systems Group in Hillsboro.

Intel-provided information about Joe:

Joe Curley, serves Intel® Corporation as director of marketing for technical computing in the Data Center Group. The technical computing marketing team manages marketing for high-performance computing (HPC) and workstation product lines as well as future Intel® Many Integrated Core (Intel® MIC) products. Joe joined Intel in 2007 to manage planning activities that lead up to the announcement of the Intel® MIC Architecture in May of 2010. Prior to joining Intel, Joe worked at Dell, Inc. and graphics pioneer Tseng Labs in a series of marketing and engineering leadership roles.

I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! to all who responded.)

This is the last in a series of three such transcripts. Hallelujah! Doing this has been a pain. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews. But some were in the interviews, including here.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

[We began by discovering we had similar deep backgrounds, both starting in graphics hardware. I designed & built a display processor (a prehistoric GPU), he built “the most efficient framework buffer controller you could possibly make”. Guess which one of us is in marketing?]

A: My experience in the [HPC] business really started relatively recently, a little under five years ago, [when] I started working on many-core processors. I won’t be able to go into history, but I can at least tell you what we’re doing and why.

Q: Why don’t we start there? At a high level, what are you doing, and why? High level for what you are doing, and as much detail on “why” as you can provide.

A: We have to narrow the question. So, at Intel, what we’re after first of all is in what we call our Technical Computing Marketing Group inside Data Center Group. That has really three major objectives. The first one is to specify the needs for high performance computing, how we can help our customers and developers build the best high performance computing systems.

Q: Let me stop you for a second right there. My impression for high performance computing is that they are people whose needs are that they want more. Just more.

A: Oh, yes, but more at what cost? What cost of power, what cost of programability, what cost of size. How are we going to build the IO system to handle it affordably or use the fabric of the day.

Q: Yes, they want more, but they want it at two bytes/FLOPS of memory bandwidth and communication bandwidth.

A: There’s an old thing called the Dilbert Spec, which is “I want it all, and by the way, can it be free?” But that’s not really what people tell us they want. People in HPC have actually been remarkably pragmatic about what it takes to develop innovation. So they really want us to do some things, and do them really well.

By the way, to finish what we do, we also have the workstation segment, and the MIC Many Integrated Core product line. The marketing for that is also in our group.

You asked “what are you doing and why.” It would probably take forever to go across all domains, but we could go into any one of them a little bit better.

Q: Can you give me a general “why” for HPC, and a specific “why” for MIC?

A: Well, HPC’s a really good business. I get stunned, somebody must be Asking really weird questions, asking “why are you doing HPC?”

Q: What I’ve heard is that HPC is traditionally 12% of the market.

A: Supercomputing is a relatively small percentage of the market. HPC and technical computing, combined, is, not exactly, but roughly, a third of our data center business. [emphasis added by me] Our data center business is a pretty robust business. And high performance computing is a business that requires very high end, high performance processors. It’s actually a very desirable business to be in, if you can do it, and if your systems work. It’s a business we spend a lot of time working on because it’s a good business.

Now, if you look at MIC, back in 2005 we made a tacit conclusion that the performance of a system will come out of parallelism. Parallelism could be expressed at Intel in a lot of different ways. You can look at it as threads, we have this concept called hyperthreading. You can look at it as cores. And we have the SSE instructions sitting around which are SIMD, that’s a form of parallelism; people argue about the definition, but yes, it is. [I agree.] So you take a look at the basic architectural constructs, ease of programming, you know, a cache-based CISC model, and then scaling on cores, threads, SIMD or vectors, these common attributes have been adopted and well-used by a lot of programmers. There are programs across the continuum of coarse- to fine-grained parallel, embarrassingly parallel, pick your taxonomy. But there are applications that developers would be willing to trade the performance of any particular task or thread for the sum of what you can do inside the power envelope at a given period of time. Lots of people have different ways of defining that, you hear throughput, whatever, but this is the class of applications, and over time they’re growing.

Q: Growing relatively, or, say, compared to commercial processing, or…? Is the segment getting larger?

A: The number of people who have tasks they want to run on that kind of hardware is clearly growing. One of the reasons we’re doing MIC, maybe I should just cut it to the easiest answer, is developers and customers asked us to.

Q: Really?

A: And they came to us with a really simple question. We were struggling in the marketing group with how to position MIC, and one of our developers got worked up, like “Look, you give me the parallel performance of an accelerator, but you give me the ease of CPU programming!” Now, ease is a funny word; you can get into religious arguments about ease. But I think what he means is “I don’t have to re-think my algorithm, I don’t have to reorder my data set, there are some things that I don’t have to do. So that they wanted to have the idea of give me this architecture and get it to scale to be wildly parallel. And that is exactly what we’ve done with the MIC architecture. If you think about what the Kinght’s Ferry STP [? Undoubtedly this is SDP - Software Development Platform; I just heard it wrong on the recording.] is, a 32 core, coherent, on a chip, teraflop part, it’s kind of like Paragon or ASCI Red on a chip. [but it is only a TFLOPS in single precision] And the programming model is, surprisingly, kind of like a bunch of processor cores on a network, which a lot of people understand and can get a lot of utility out of in a very well-understood way. So, in a sense, we’re giving people what they want, and that, generally, is good business. And if you don’t give them what they want, they’ll have to go find someone else. So we’re simply doing what our marketplace asked us for.

Q: Well, let me play a little bit of devil’s advocate here, because MIC is very clearly derivative of Larrabee, and…

A: Knight’s Ferry is.

Q: … Knight’s Ferry is. Not MIC?

A: No. I think you have to take a look at what Larrabee was. Larrabee, by the way, was a really cool project, but what Larrabee was was a tile rendering graphics device, which meant its design point, was first of all the programming model was derived from what you do for graphics. It’s going to be API-based, the answer it’s going to generate is going to be a pixel, the pixel is going to have a defined level of sub-pixel accuracy. It’s a very predictable output. The internal optimizations you would make for a graphics implementation of a general many-core architecture is one very specific implementation. Let’s talk about the needs of the high performance computing market. I need bandwidth. I need memory depth. Larrabee didn’t need memory depth; it didn’t have a frame buffer.

Q: It needed bandwidth to local memory [of which it didn’t have enough; see my post The Problem with Larrabee]

A: Yes, but less than you think, because the cache was the critical element in that architecture [again, see that post] if you look through the academic papers on that…

Q: OK, OK.

A: So, they have a common heritage, they’re both derived out of the thoughts that came out of the Intel Labs terascale research. They’re both many-core. But Knight’s Ferry came out with a few, they’re only a few, modifications. But the programming model is completely different. You don’t program a graphics device like you do a computer, and MIC is a computer.

Q: The higher-level programming model is different.

A: Correct.

Q: But it is a big, wide, cache-coherent SMP.

A: Well, yes, that’s what Knight’s Ferry is, but we haven’t talked about what Knight’s Corner yet, and unfortunately I won’t today, and we haven’t talked about where the product line will go from there, either. But there are many things that will remain the same, because there are things you can take and embellish and work and things that will be really different.

Q: But can you at least give me a hint? Is there a chance that Knight’s Corner will be a substantially different hardware model than Knight’s Ferry?

A: I’m going to really love to talk to you about Knight’s Corner. [his emphasis]

Q: But not today.

A: I’m going to duck it today.

Q: Oh, man…

A: The product is going to be in our 22 nm process, and 22 nm isn’t shipping yet. When we get a little bit closer, when it deserves to have the buzz generated, we’ll start generating buzz. Right now, the big thing is that we’re making the investments in the Knight’s Ferry software development platform, to see how codes scale across the many-core, to get the environment and tools up, to let developers poke at it and find stuff, good stuff, bad stuff, in between stuff, that allow us to adjust the product line for ongoing generations. We’ve done that really well since we announced the architecture about 15 months ago.

Q: I was wondering what else I was going to talk about after having talked to both John Hengeveld and Jim Reinders. This is great. Nobody talked about where it really came from, and even hinted that there were changes to the MIC chip [architecture].

A: Oh, no, no, many things will be the same, many things will be different. If you’re targeting trying to do a pixel-renderer, go do a pixel-renderer. If you’re trying to do a general-purpose computing device, do a general-purpose computing device. You’ll see some things and say “well, it’s all the same” and other things “wow, it’s completely different.” We’ll get around to talking about the part when we’re a little closer.

The most important thing that James and/or John should have been talking about is that the key thing is the ability to not force the developer to completely and utterly re-think their problem to use your hardware. There are two models: In an accelerator model, which is something I spent a lot of my life working with, accelerators have the advantage of optimization. You can say “I want to do one thing really well.” So you can then describe a programming model for the hardware. You can say “build your data this way, write your program this way” and if you do it will work. The problem is that not everything fits into the box. Oh, you have sparse data. Oh, you have recursive code.

Q: And there’s madness in that direction, because if you start supporting that you wind yourself around to a general-purpose machine. […usually, a very odd-looking general-purpose machine. I’ve talked about Sutherland’s “Wheel of Reincarnation” in this blog, haven’t I? Oh, there it is: The Cloud Got GPUs, back in November 2010.]

A: Then it’s not an accelerator any more. The thing that you get in MIC is the performance of one of those accelerators. We’ve shown this. We’ve hit 960GF out of a peak 1.2TF without throwing away precision, without playing any circus tricks, just run the hardware. On Knight’s Ferry we’ve shown that. So you get performance, but you’re getting it out of the general purpose programming model.

Q: That’s running LINPACK, or… ?

A: That was an even more basic thing; I’m just talking about SGEMM [single-precision dense matrix multiply].

Q: I just wanted to ground the number.

A: For LU factorization, I think we showed hybrid LU, really cool, one of the great things about this hybrid…

Q: They’re demo-ing that downstairs.

A: … OK. When the matrix size is small, I keep it on the host; when the matrix size is large, I move it. But it’s all the same code, the same code either place. I’m just deciding where I want to run the code intelligently, based on the size of the matrix. You can get the exact number, but I think it’s on the order of 750GBytes/sec for LU [GFLOPS?], which is actually, for a first-generation part, not shabby. [They were doing 650-750 GF according to the meter I saw. That's single precision; Knight's Ferry was originally a graphics part.]

Q: Yaahh, well, there are a lot of people who can deliver something like that.

A: We’ll keep working on it and making it better and better. So, what are we proving today. All we’ve proven today is that the architecture is capable of performance. We’ve got a lot of work to do before we have a product, but the architecture has shown itself to be capable. The programming model, we have people who will speak for us, like the quotes that came from LRZ [data center for the universities of Munich and the Bavarian Academy of Sciences], from Leibnitz [same place], a code they couldn’t port to other accelerators was running in two hours and optimized in two days. Now, actual mileage may vary, see dealer for…

Q: So, there are things that just won’t run on a CUDA model? Example?

A: Well, perhaps, again, the thing you try to get to is whether there is evidence growing that what you say is real. So we’re having people who are starting to be able to speak to that, and that gives people the confidence that we’re going to be able to get there. The other thing it ends up doing, it’s kind of an odd benefit, as people have started building their code, trying to optimize it for MIC, they’re finding the parallelism, they’re doing what we wanted them to do all along, they’re taking the same code on their current cluster and they’re getting benefits right now.

Q: That’s got a long history. People would have some grotty old FORTRAN code, and want to vectorize it, but the vectorizing compiler couldn’t make crap out of it. So they cleaned it up, made it obvious what was going on, and the vectorizer did its thing well. Then they put it back on the original machine and it ran twice as fast.

A: So, one of the nice things that’s happened is that as people are looking at ways to scale power, performance, they’re finally getting around to dealing with parallelism. The offer that we’re trying to provide is portable, high level, standards-based, and you can use it now.

You said “why.” That’s why. Our customers and developers say “if you can do that, that’s really valuable.” Now. We’re four men and a pudding, we haven’t shipped a product yet, we’ve got a lot of work to do, but the thought and the promise and the early data is really good.

Q: OK. Well, great.

A: Was that a good use of the time?

Q: That’s a very good use of the time. Let me poke on one thing a little bit. Conceptually, it ought to be simpler to write code to that kind of a shared memory model and get parallelism out of the code that way. Now, on the other hand, there was a talk – sorry, I forget his name, he was one of the software guys working on Larrabee [it was Tom Forsyth; see my post The Problem with Larrabee again] said someone on the project had written four renderers, and three of them were for Larrabee. He was having one hell of a time trying to get something that performed well. His big issue, at least what it came down to from what I remember of the talk, was memory bandwidth.

A: Well, first of all, we’ve said Larrabee’s not a product. As I’ve said, one of the things that is critical, you’ve got the compute-bound, you’ve got the memory-bound, and most people are somewhere in between, but you have to be able to handle the two edge cases. We understand that, and we intend to deliver a really good value across the spectrum. Now, Knight’s Ferry has the RVI silicon [RVI? I’m guessing here], it’s a variation off the silicon we used, no one cares about that, but on Knight’s Ferry, the memory bus is 256 bits wide. Relatively narrow, and for a graphics processor, very narrow. There are definitely design decisions in how that chip was made that would limit the bandwidth. And the memory it was designed with is slower than the memory today, you have all of the normal things. But if you went downstairs to the show floor, and talk to Daniel Paul, he’s demonstrating a pretty dramatic ray-tracer.

[What follows is a bit confused. He didn’t mean the Austrian Crown stochastic ray-tracing demo, but rather the real-time ray-tracing demo. As I said in my immediately previous post (Random Things of Interest at IDF 2011), the real-time demo is on a set of Knight’s Ferries attached to a Xeon-based node. At the time of the interview, I hadn’t seen the real-time demo, just the stochastic one; the latter is not on Knight’s Ferry.]

Q: I’ve seen that one. The Austrian Crown?

A: Yes.

Q: I thought that was on a cluster.

A: In the little box behind there, he’s able to scale from one to eight Knight’s Ferries.

Q: He never told me there was a Knight’s Ferry in there.

A: Yes, it’s all Knight’s Ferry.

Q: Well, I’m going to go down there and beat on him a little bit.

A: I’m about to point you to a YouTube site, it got compressed and thrown up on YouTube. You can’t get the impact of the complexity of the rays, but you can at least get the superficial idea of the responsiveness of the system from Knight’s Ferry.

[He didn’t point me to YouTube, or I lost it, but here’s one I found. Ignore the fact that the introduction is in Swedish or something [it's Dutch, actually]; Daniel – and it’s Daniel, not David – speaks English, and gives a good demo. Yes, everybody in the “Labs” part of the showroom wore white lab coats. I did a bit of teasing. I also updated the Random Things of Interest post to directly include it.]

Well, if you believe that what we’re going to do in our mainstream processors is roughly double the FLOPS every generation for the next many generations, that’s our intent. What if we can do that on the MIC line as well? By the time you get to where ray-tracing would be practical, you could see multiple of those being integrated into a single device [added in transcription: Multiple MICs in a single device? Hierarchical MIC?] becomes practical computationally. That won’t be far from now. So, it’s a nice demo. David’s an expert in his field, I didn’t hear what he said, but it you want to see the device downstairs actually running a fairly strenuous graphics workload, take a look at that.

Q: OK. I did go down there and I did see that, I just didn’t know it was Knight’s Ferry. [It’s not, it’s not, still confused here.] On that HDR display that is gorgeous. [Where “it” = stochastically-ray-traced Austrian Crown. It is.]

[At that point, Dave Patterson walked in, which interrupted us. We said hello – I know Dave of old, a bit – thanks were exchanged with Joe, and I departed.]

[I can’t believe this is the end of the last one. I really don’t like transcribing.]

Random Things of Interest at IDF 2011 (Intel Developer Forum)

2011-10-02T19:11:00.001-06:00

I still have one IDF interview to transcribe (Joe Curley), but I’m sick of doing transcriptions. So here are a few other random things I observed at the 2011 Intel Developers Forum. It is nothing like comprehensive. It’s also not yet the promised MIC dump; that will still come.

Exhibit Hall

I found very few products I had a direct interest in, but then again I didn’t look very hard.

On the right, immediately as you enter, was a demo of a Xeon/MIC combination clocking 600-700 GFLOPS (quite assuredly single precision) doing LRU Factorization. Questions to the guys running the demo indicated: (1) They did part on the Xeon, and there may have been two of those, they weren’t sure (the diagram showed two). (2) They really learned how to say “We don’t comment on competitors” and “We don’t comment on unannounced products.”

A 6-legged robot controlled by Atom, controlled by a game controller. I included this here only because it looked funky and I took a picture (q. v.). Also, for some reason it was in constant slight motion, like it couldn’t sit still, ever.

There were three things that were interesting to me in the Intel Labs section:

One Tbit/sec memory stack: To understand why this is interesting, you need to know that the semiconductor manufacturing processes used to make DRAM and logic are quite different. Putting both on the same chip requires compromises in one or the other. The logic that must exist on DRAM chips isn’t quite as good as it could be, for example. In this project, they separated the two onto separate chips in a stack: Logic is on one, the bottom one, that interfaces with the outside world. On top of this are multiple pure memory chips, multiple layers of pure DRAM, no logic. They connect by solder bumps or something (I’m not sure), and there are many (thousands of) “through silicon vias” that go all the way through the memory chips to allow connecting a whole stack to the logic at the bottom with very high bandwidth. This whole idea eliminates the need to compromise on semiconductor processes, so the DRAM can be dense (and fast), and the logic can be fast (and low power). One result is that they can suck 1 Tbit/sec of data out of one of these stacks. This just feels right to me as a direction. Too bad they’re unlikely to use the new IBM/3M thermally conductive glue to suck heat out of the stack.

Stochastic Ray-Tracing: What it says: Ray-tracing, but allows light to be probabilistically scattered off surfaces, so, for example, shiny matte surfaces have realistically blurred reflections on them, and produce more realistic color effects on other surfaces to which they reflect. Shiny matte surfaces like the surface of the golden dome in the center of the Austrian crown, reflecting the jewels in the outer band, which was their demo image. I have a picture here, but it comes nowhere near doing this justice. The large, high dynamic range monitor they had, though – wow. Just wow. Spectacular. A guy was explaining this to me pointing to a normal monitor when I happened to glance up at the HDR one. I was like “shut up already, I just want to look at that.” To run it they used a cluster of four Xeon-based nodes, each apparently about 4U high, and that was not in real time; several seconds were required per update. But wow.

Real-Time Ray-Tracing: This has been done before; I saw it a demo on a Cell processor back in about 2006. This, however, was a much more complex scene than I’d previously viewed. It had the usual shiny classic car, but that was now in the courtyard of a much larger old palace-like building, with lots of columns and crenellations and the like. It ran on a MIC, of course – actually, several of them, all attached to the same Xeon system. Each had a complete copy of the scene data in its memory, which is unrealistic but does serve to make the problem “pleasantly parallel” (which is what I’m told is now the PC way to describe what used to be called “embarrassingly parallel”). However, the demo was still fun. Here's a video of it I found. It apparently was shot at a different event, but still the same technology demonstrated. The intro is in Swedish, or something, but it reverts to English at the demo. And yes, all the Intel Labs guys wore white lab coats. I teased them a bit on that.

Keynotes

Otellini (CEO): Intel is going hot and heavy into supporting the venerable Trusted Platform technology, a collection of technology which might well work, but upon which nobody has yet bitten. This security emphasis clearly fits with the purchase of MacAfee (everybody got a free MacAfee package at registration, good for 3 systems). “2011 may be the year the industry got serious about security!” I remain unconvinced.

Mooley Eden (General Manager, Mobile Platforms): OK. Right now, I have to say that this is the one time in the course of these IDF posts that I am going to bow to Intel’s having paid for me to attend IDF, bite my tongue rather than succumbing to my usual practice of biting the hand that feeds me, and limit my comments to:

Mooley Eden must be an acquired taste.

To learn more of my personal opinions on this subject, you are going to have to buy me a craft beer (dark & hoppy) in a very noisy bar. Since I don’t like noisy bars, and that’s an unusual combination, I consider this unlikely.

Technically… Ultrabooks, ultrabooks, “beyond thin and light.” More security. They had a lame Ninja-garbed guy on stage, trying to hack into trusted-platform-protected system, and of course failing. (Please see this.) There was also a picture of a castle with a moat, and a (deliberately) crude animation of knights trying to cross the moat and falling in. (I mention this only because it’s relevant to something below.)

People never use hibernate, because it takes too long to wake up. The solution is… to have the system wake up regularly. And run sync operations. Eh what? Is this supposed to cause your wakeup to take less time because the wakeup time is actually spent syncing? My own wakeup time is mostly wakeup. All I know is that suspend/resume used to be really fast, reliable, and smart. Then it got transplanted to Windows from BIOS and has been unsatisfactory - slow and dumb - ever since.

This was my first time seeing Windows 8. It looks like Mango phone interface. Is making phones & PCs look alike supposed to help in some way? (Like boost Windows Phone sales?) I’m quite a bit less than intrigued. It means I’m going to have to buy another laptop before Win 8 becomes the standard.

Justin Rattner (CTO): Some of his stuff I covered in my first post on IDF. One I didn’t cover was the massive deal made of CERN and the LHC (Large Hadron Collider) (“the largest machine human beings have ever created”) (everybody please now go “ooooohhh”) using MICs. Look, folks, the major high energy physics apps are embarrassingly parallel: You get a whole lot, like millions, billions, of particle collisions, gather each one’s data, and do an astounding amount of floating-point computing on each completely independent set of collision data. Separately. Hoping to find out that one is a Higgs boson or something. I saw people doing this in the late 1980s at Fermilab on a homebrew parallel system. They even had a good software framework for using it: Write your (serial) code for analyzing a collision your way, and hand it to us; we run it many times in parallel, just handing out each event’s data to an instance of your code. The only thing that would be interesting about this would be if for some reason they actually couldn’t run HEP codes very well indeed. But they can run them well. Which makes it a yawn for me. I’ve no question that the LHC is ungodly impressive, of course. I just wish it were in Texas and called something else.

Intel Fellows Panel

Some interesting questions asked and answered, many questions lame. Like: “Will high-end Xeons pass mainframes?” Silly question. Depends on what “pass” means. In the sense in which most people may mean, they already have, and it doesn’t matter. Here are some others:

Q: Besides MIC, what else is needed for Exascale? A: We’re having to go all the way down to device level. In particular, we’re looking at subthreshold or near-threshold logic. We tried that before, but failed. Devices turn out to be most efficient 20mv above threshold. May have to run at 800MHz. [Implication: A whole lot of parallelism.] Funny how they talked about near-threshold logic, and Justin Rattner just happened to have a demo of that at the next day’s keynote.

Q: Are you running out of rare metals? A: It’s a question of cost. Yes, we always try to move off expensive materials. Rare earths needed, but not much; we only use them in layers like five atoms thick.

Q: Is Moore’s Law going to end? A: This was answered by Kelin J. Kuhn, Fellow & Director of the Technology and Manufacturing Group – i.e., she really knows silicon. She noted that, by observation, at every given generation it always looks like Moore’s Law ends in two generations. But it never has. Every time we see a major impediment to the physics – several examples given, going back to the 1980s and the end of Dennard scaling – something seems to come along to avoid the problem. The exception seems to be right now: Unlike prior eras when it will end in two generations, there don't seem to be any clouds on this particular horizon at all. (While I personally know of no reason to dispute this, keep in mind that this is from Intel, whose whole existence seems tied to Moore's Law, and it's said by the woman who probably has the biggest responsibility to make it all come about.)

An aside concerning the question-taking woman with the microphone on my side of the hall: I apparently reminded her of something she hates. She kept going elsewhere, even after standing right beside me for several minutes while I had my hand raised. What I was going to ask was: This morning in the keynote we saw a castle with a moat, and several knights dropping into the moat. The last two days we also heard a lot about a knight which appears to take a ferry across the moat of PCIe. Why are you strangling a TFLOP of computation with PCIe? Other accelerator vendors don’t have a choice with their accelerators, but you guys own the whole architecture. Surely something better could be done. Does this, perhaps, indicate a lack of integration or commitment to the new architecture across the organization?

Maybe she was fitted with a wiseass detection system.

Anyway, I guess I won’t find out this year.

A Conversation with Intel’s James Reinders at IDF 2011

2011-09-29T16:45:00.000-06:00

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview James Reinders. James is in the Director, Software Evangelist of Intel’s Software and Services Group in Oregon, and the conversation ranged far and wide, from programming languages, to frameworks, to transactional memory, to the use of CUDA, to Matlab, to vectorizing for execution on Intel’s MIC (Many Integrated Core) architecture.

Intel-provided information about James:

James Reinders is an expert on parallel computing. James is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including systolic arrays systems WARP and iWarp, and the world's first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for multiple Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. His most recent book is “Intel Threading Building Blocks” from O'Reilly Media which has been translated to Japanese, Chinese and Korean. James has published numerous articles, contributed to several books and is one of his current projects is as a co-author on a new book on parallel programming to be released in 2012.

This is #2 in a series of three such transcripts. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

Pfister: [Discussing where I’m coming from, crowd-sourced question list, HPC & MIC focus here.] So where would you like to start?

Reinders: Wherever you like. MIC and HPC – HPC is my life, and parallel programming, so do your best. It has been for a long, long time, so hopefully I have a very realistic view of what works and what doesn’t work. I think I surprise some people with optimism about where we’re going, but I have some good reasons to see there’s a lot of things we can do in the architecture and the software that I think will surprise people to make that a lot more approachable than you would expect. Amdahl’s law is still there, but some of the difficulties that we have with the systems in terms of programming, the nondeterminism that gets involved in the programming, which you know really destroys the paradigm of thinking how to debug, those are solvable problems. That surprises people a lot, but we have a lot at our disposal we didn’t have 20 or 30 years ago, computers are so much faster and it benefits the tools. Think about how much more the tools can do. You know, your compiler still compiles in about the same time it did 10 years ago, but now it’s doing a lot more, and now that multicore has become very prevalent in our hardware architecture, there are some hooks that we are going to get into the hardware that will solve some of the debugging problems that debugging tools can’t do by themselves because we can catch the attention of the architects and we understand enough that there’s some give-and-take in areas that might surprise people, that they will suddenly have a tool where people say “how’d you solve that problem?” and it’s over there under the covers. So I’m excited about that.

[OK, so everybody forgive me for not jumping right away on his fix for nondeterminism. What he meant by that was covered later.]

Pfister: So, you’re optimistic?

Reinders: Optimistic that it’s not the end of the world.

Pfister: OK. Let me tell you where I’m coming from on that. A while back, I spent an evening doing a web survey of parallel programming languages, and made a spreadsheet of 101 parallel programming languages [see my much earlier post, 101 Parallel Programming Languages].

Reinders: [laughs] You missed a few.

Pfister: I’m sure I did. It was just one night. But not one of those was being used. MPI and OpenMP, that was it.

Reinders: And Erlang has had some limited popularity, but is dying out. They’re a lot like AI and some other things. They help solve some problems, and then if the idea is really an advance, you’ll see something from that materialize in C or C++, Java, or C#. Those languages teach us something that we then use where we really want it.

Pfister: I understand what you’re saying. It’s like MapReduce being a large-scale version of the old LISP mapcar.

Reinders: Which was around in the early 70s. A lot of people picked up on it, it’s not a secret but it’s still, you know, on the edge.

Pfister: I heard someone say recently that there was a programming crisis in the early 80s: How were you going to program all those PCs? It was solved not by programming, but by having three or four frameworks, such as Excel or Word, that some experts in a dark room wrote, everybody used, and it spread like crazy. Is there anything now like that which we could hope for?

Reinders: You see people talk about Python, you see Matlab. Python is powerful, but I think it’s sort of trapped between general-purpose programming and the Matlab. It may be a big enough area; it certainly has a lot of followers. Matlab is a good example. We see a lot of people doing a lot in Matlab. And then they run up against barriers. Excel has the same thing. You see Excel grow up and people incredibly hairy things. We worked with Microsoft a few years ago, and they’ve added parallelism to Excel, and it’s extremely important to some people. Some people have spreadsheets out there that do unbelievable things. You change one cell, and it would take a computer from just a couple of years ago and just stall it for 30 minutes while it recomputes. [I know of people in the finance industry who go out for coffee for a few hours if they accidentally hit F5.] Now you can do that in parallel. I think people do gravitate towards those frameworks, as you’re saying. So which ones will emerge? I think there’s hope. I think Matlab is one; I don’t know that I’d put my money on that being the huge one. But I do think there’s a lot of opportunity for that to hide this compute power behind it. Yes, I agree with that, Word and Excel spreadsheets, they did that, they removed something that you would have programmed over and over again, made it accessible without it looking like programming.

Pfister: People did spreadsheets without realizing they were programming, because it was so obvious.

Reinders: Yes, you’re absolutely right. I tend to think of it in terms of libraries, because I’m a little bit more of an engineer. I do see development of important libraries that use unbelievable amounts of compute power behind them and then simply do something that anyone could understand. Obviously image processing is one [area], but there are other manipulations that I think people will just routinely be able to throw into an application, but what stands behind them is an incredibly complex library that uses compute power to manipulate that data. You see Apple use a lot of this in their user interface, just doing this [swipes] or that to the screen, I mean the thing behind that uses parallelism quite well.

Pfister: But this [swipes] [meaning the thing you do] is simple.

Reinders: Right, exactly. So I think that’s a lot like moving to spreadsheets; that’s the modern equivalent of using spreadsheets or Word. It’s the user interfaces, and they are demanding a lot behind them. It’s unbelievable the compute power that can use.

Pfister: Yes, it is. And I really wonder how many times you’re going to want to scan your pictures for all the images of Aunt Sadie. You’ll get tired of doing it after a couple of days.

Reinders: Right, but I think rather than that being an activity, it’s just something your computer does for you. It disappears. Most of us don’t want to organize things, we want it just done. And Google’s done that on the web. Instead of keeping a million bookmarks to find something, you do a search.

Pfister: Right. I used to have this incredible tree of bookmarks, and could never find anything there.

Reinders: Yes. You’d marvel at people who kept neat bookmarks, and now nobody keeps them.

Pfister: I remember when it was a major feature of Firefox that it provided searching of your bookmarks.

Reinders: [Laughter]

Pfister: You mentioned nondeterminism. Are there any things in the hardware that you’re thinking of? IBM Blue Gene just said they have transactional memory, in hardware, on a chip. I’m dubious.

Reinders: Yes, the Blue Gene/Q stuff. We’ve been looking at transactional memory a long time, we being the industry, Intel included. At first we hoped “Wow, get rid of locks, we’ll all just use transactional memory, it’ll just work.” Well, the shortest way I can say why it doesn’t work is that software people want transactions to be arbitrarily large, and hardware needs it to be constrained, so it can actually do what you’re asking it to do, like holding a buffer. That’s a nonstarter.

So now what’s happening? Rocks was looking at this in Sun, a hybrid technique, and unfortunately they didn’t bring that to market. Nobody outside the team knows exactly what happened, but the project as a whole failed, rather than saying transactional memory was the death. But they had a hard time figuring out how you engineer that buffering. A lot of smart people are looking at it. IBM’s come up with a solution, but I’ve heard it’s constrained to a single socket. It makes sense to me why a constraint like that would be buildable. The hard part is then how do you wrap that into a programming model. Blue Gene’s obviously a very high end machine, so those developers have more familiarity with constraints and dealing with it. Making it general purpose is a very hard problem, very attractive, but I think that at the end of the day, all transactional memory will do is be another option, that may be less error-prone, to use in frameworks or toolkits. I don’t see a big shift in programming model where people say “Oh, I’m using transactional memory.” It’ll be a piece of infrastructure that toolkits like Threading Building Blocks or OpenMP or Cilk+ use. It’ll be important for us in that it gives better guarantees.

The things I more had in mind is you’re seeing a whole class of tools. We’ve got a tool that can do deadlock and race detection dynamically and find it; a very, very good tool. You see companies like TotalView looking at what they would call replaying, or unwinding, going backwards, with debuggers. The problem with debuggers if your program’s nondeterministic is you run it to a breakpoint and say, whoa, I want to see what happened back here, what we usually do is just pop out of the debugger and run it with an earlier breakpoint, or re-run it. If the program is nondeterministic, you don’t get what you want. So the question is, can the debugger keep enough information to back up? Well, the thing that backing up and debugging, deadlock detection, and race detection, all those things have in common is that they tend to run two or three orders of magnitude slower when you’re using those techniques. Well, that’s not compelling. But, the cool part is, with the software, we’re showing how to detect those – just a thousand times slower than real time.

Now we have the cool engineering problem: Can you make it faster? Is there something you could do in the software or the hardware and make that faster? I think there is, and a lot of people do. I get really excited when you solve a hard problem, can you replay a debug, yeah, it’s too slow. We use it to solve really hard problems, with customers that are really important, where you hunker down for a week or two using a tool that’s a thousand times slower to find the bug, and you’re so happy you found it – I can’t stand out in a booth and market and have a million developers use it. That won’t happen unless we get it closer to real time. I think that will happen. We’re looking at ways to do that. It’s a cooperative thing between hardware and software, and it’s not just an Intel thing; obviously the Blue Gene team worries about these things, Sun’s team as worried about them. There’s actually a lot of discussion between those small teams. There aren’t that many people who understand what transactional memory is or how to implement it in hardware, and the people who do talk to each other across companies.

[In retrospect, while transcribing this, I find the sudden transition back to TM to be mysterious. Possibly james was veering away from unannounced technology, or possibly there’s some link between TM and 1000x speedups of playback. If there is such a link, it’s not exactly instantly obvious to me.]

Pfister: At a minimum, at conferences.

Reinders: Yes, yes, and they’d like to see the software stack on top of them come together, so they know what hardware to build to give whatever the software model is what it needs. One of the things we learned about transactional memory is that the software model is really hard. We have a transactional memory compiler that does it all in software. It’s really good. We found that when people used it, they treated transactional memory like locks and created new problems. They didn’t write a transactional memory program from scratch to use transactional memory, they took code they wrote for locks and tried to use transactional memory instead of locks, and that creates problems.

Pfister: The one place I’ve seen where rumors showed someone actually using it was the special-purpose Java machine Azul. 500 plus processors per rack, multiple racks, point-to-point connections with a very pretty diagram like a rosette. They got into a suit war with Sun. And some of the things they showed were obvious applications of transactional memory.

Reinders: Hmm.

Pfister: Let’s talk about support for things like MIC. One question I had was that things like CUDA, which let you just annotate your code, well, do more than that. But I think CUDA was really a big part of the success of Nvidia.

Reinders: Oh, absolutely. Because how else are you going to get anything to go down your shader pipeline for a computation if you don’t give a model? And by lining up with one model, no matter the pros or cons, or how easy or hard it was, it gave a mechanism, actually a low-level mechanism, that turns out to be predictable because the low-level mechanism isn’t trying to do anything too fancy for you, it’s basically giving you full control. That’s a problem to get a lot of people to program that way, but when a programmer does program that way, they get what the hardware can give them. We don’t need a fancy compiler that gets it right half the time on top of that, right? Now everybody in the world would like a fancy compiler that always got it right, and when you can build that, then CUDA and that sort of thing just poof! Gone. I’m not sure that’s a tractable problem on a device that’s not more general than that type of pipeline.

So, the challenge I see with CUDA, and OpenCL, and even C++AMP is that they’re going down the road of saying look, there are going to be several classes of devices, and we need you the programmer to write a different version of your program for each class of device. Like in OpenCL, you can take a function and write a version for a CPU, for a GPU, a version for an accelerator. So in this terminology, OpenCL is proposing CPU is like a Xeon, GPU is like a Tesla, an accelerator something like MIC. We have a hard enough problem getting one version of an optimized program written. I think that’s a fatal flaw in this thing being widely adopted. I think we can bring those together.

What you really are trying to say is that part of your program is going to be restrictive enough that it can be vectorized, done in parallel. I think there are alternatives to this that will catch on and mitigate the need to write much code in OpenCL and in CUDA. The other flaw with those techniques is that in a world where you have a GPU and a CPU, the GPU’s got a job to do on the user interface, and so far we’ve not described what happens when applications mysteriously try to send some to the GPU, some to the CPU. If you get too many apps pounding on the GPU, the user experience dies. [OK, mea culpa for not interrupting and mentioning Tianhe-1A.] AMD has proposed in their future architectures that they’re going to produce a meta-language that OpenCL targets, and then the hardware can target some to the GPU, and some to the CPU. So I understand the problem, and I don’t know if that solution’s the right one, but it highlights that the problem’s understood if you write too much OpenCL code. I’m personally more of a believer that we find higher-level programming interfaces like Cilk plusses, array notations, add array notations to C that explicitly tells you vectorize and the compiler can figure out whether that’s SSC, is it AVX, is it the 512-bit wide stuff on MIC, a GPU pipeline, whatever is on the hardware. But don’t pollute the programming language by telling the programmer to write three versions of your code. The good news is, though, if you do use OpenCL or CUDA to do that, you have extreme control of the hardware and will get the best hardware results you can, and we learn from that. I just think the learnings are going to drive us to more abstract programming models. That’s why I’m a big believer in the Cilk plus stuff that we’re doing.

Pfister: But how many users of HPC systems are interested in squeezing that last drop of performance out?

Reinders: HPC users are extremely interested in squeezing performance if they can keep a single source code that they can run everywhere. I hear this all the time, you know, you go to Oak Ridge, and they want to run some code. Great, we’ll run it on an Intel machine, or we’ll run it on a machine from IBM or HP or whatever, just don’t tell me it has to be rewritten in a strange language that’s only supported on your machine. It’s pretty consistent. So the success of CUDA, to be used on those machines, it’s limited in a way, but it’s been exciting. But it’s been a strain on the people who have done that because CUDA code because CUDA code’s not going to run on an Intel machine [Well, actually, the Portland Group has a CUDA C/C++ compiler targeting x86. I do not know how good the output code performance is.]. OpenCL offers some opportunities to run everywhere, but then has problems of abstraction. Nvidia will talk about 400X speedups, which aren’t real, well that depends on your definition of “real”.

Pfister: Let’s not start on that.

Reinders: OK, well, what we’re seeing constantly is that vectorization is a huge challenge. You talk to people who have taken their cluster code and moved it to MIC [Cluster? No shared memory?], very consistently they’ll tell us stories like, oh, “We ported in three days.” The Intel marketing people are like “That’s great! Three days!” I ask why the heck did it take you three days? Everybody tells me the same thing: It ran right away, since we support MPI, OpenMP, Fortran, C++. Then they had to spend a few days to vectorize because otherwise performance was terrible. They’re trying to use the 512-bit-wide vectors, and their original code was written using SSE [Xeon SIMD/vector] with intrinsics [explicit calls to the hardware operations]. They can’t automatically translate, you have to restructure the loop because it’s 512 bits wide – that should be automated, and if we don’t get that automated in the next decade we’ve made a huge mistake as an industry. So I’m hopeful that we have solutions to that today, but I think a standardized solution to that will have to come forward.

Pfister: I really wonder about that, because wildly changing the degree of parallelism, at least at a vector level – if it’s not there in the code today, you’ve just got to rewrite it.

Reinders: Right, so we’ve got low-hanging fruit, we’ve got codes that have the parallelism today, we need to give them a better way of specifying it. And then yes, over time, those need to migrate to that [way of specifying parallelism in programs]. But migrating the code where you have to restructure it a lot, and then you do it all in SSE intrinsics, that’s very painful. If it feels more readable, more intuitive, like array extensions to the language, I give it better odds. But it’s still algorithmic transformation. They have to teach people where to find their data parallelism; that’s where all the scaling is in an application. If you don’t know how to expose it or write programs that expose it, you won’t take advantage of this shift in the industry.

Pfister: Yes.

Reinders: I’m supposed to make sure you wander down at about 11:00.

Pfister: Yes, I’ve got to go to the required press briefing, so I guess we need to take off. Thanks an awful lot.

Reinders: Sure. If there are any other questions we need to follow up on, I’ll be happy to talk to you. I hope I’ve knocked off a few of your questions.

Pfister: And then some. Thanks.

[While walking down to the press briefing, I asked James whether the synchronization features he had from the X86 architecture were adequate for MIC. He said that they were OK for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something additional. Interestingly, after the conference, there was an Intel press release about the Intel/Dell “home run” win at TACC – using Knight’s Corner, “an innovative design that includes more than 50 cores.” This dovetails with what Joe Curley told me about Knight’s Corner not being the same as Knight’s Ferry. Stay tuned for the next interview.]

A Conversation with Intel’s John Hengeveld at IDF 2011

2011-09-26T17:50:00.002-06:00

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview John Hengeveld. John is in the Datacenter and Connected Systems Group in Hillsboro.

Intel-provided information about John:

John is responsible for end user and OEM marketing for Intel’s Workstation and HPC businesses and leads an outstanding team of industry visionaries. John has been at Intel for 6 years and was previously the senior business strategist for Intel’s Digital Enterprise Group and the lead strategist for Intel’s Many Core development initiatives. John has 20 years of experience in general management, strategy and marketing leadership roles in high technology.

John is dedicated to life-long learning, he has taught Corporate Strategy and Business Strategy and Policy; Technology Management; and Marketing Research and Strategy for Portland State University’s Master of Business Administration program. John is a graduate of the Massachusetts Institute of Technology and holds his MBA from the University of Oregon.

I recorded our conversation. What follows is a transcript, rather than a summary, since our topics ranged fairly widely and in some cases information is conveyed by the style of the answer. Conditions weren’t optimal for recording; it was in a large open space with many other conversations going on and the “Intel Robotic Orchestra” playing in the background. Hopefully I got all the words right.

I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)

Full disclosure: As I noted in a prior post, Intel paid for me to attend IDF. Thanks, again.

Occurrences of [] indicate words I added for clarification. There aren’t many.

Pfister: What, overall, is HPC to Intel? Is it synonymous with MIC?

Hengeveld: No. Actually, HPC has a research effort, how to scale applications, how to deal with performance and power issues that are upcoming. That’s the labs portion of it. Then we have significant product activity around our mainstream Xeon products, how to support the software and infrastructure when those products are delivered in cluster form to supercomputing activities. In addition to those products also get delivered into what we refer to as the volume HPC market, which is small and medium-sized clusters being used for product design, research activities, such as those in biomed, some in visualization. Then comes the MIC part. So, when we look at MIC, we try to manage and characterize the collection of workloads we create optimized performance for. About 20% of those, and we think these are representative of workloads in the industry, map to what MIC does really well. And the rest, most customers have…

Pfister: What is the distinguishing characteristic?

Hengeveld: There are two distinguishing characteristics. One is what I would refer to as compute density – applications that have relatively small memory footprints but have a high number of compute operations per memory access, and that parallelize well. Then there’s a second set of applications, streaming applications, where size isn’t significant but memory bandwidth is the distinguishing factor. You see some portion of the workload space there.

Pfister: Streaming is something I was specifically going to ask you about. It seems that with the accelerators being used today, there’s this bifurcation in HPC: Things that don’t need, or can’t use, memory streaming; and those that are limited by how fast you can move data to and from memory.

Hengeveld: That’s right. I agree.

Pfister: Is MIC designed for the streaming side?

Hengeveld: MIC will perform well for many streaming applications. Not all. There are some that require a memory access model MIC doesn’t map to particularly well. But a lot of the streaming applications will do very well on MIC in one of the generations. We have a collection of generations of MIC on the roadmap, but we’re not talking about anything beyond the next “Corner” generation [Knight’s Corner, 2012 product successor to the current limited-production Knight’s Ferry software development vehicle]. More beyond that, down the roadmap, you will see more and more effect for that class of application.

Pfister: So you expect that to be competitive in bandwidth and throughput with what comes out of Nvidia?

Hengeveld: Very much so. We’re competing in this market space to be successful; and we understand that we need to be competitive on a performance density, performance per watt basis. The way I kind of think about it is that we have a roadmap with exceptional performance, but, in addition to that, we have a consistent programming model with the rest of the Xeon platforms. The things you do to create an optimized cluster will work in the MIC space pretty much straightforwardly. We’ve done a number of demonstrations of that here and at ISC. That’s the main difference. So we’ll see the performance; we’ll be ahead in the performance. But the real difference is the programming model.

Pfister: But the application has to be amenable.

Hengeveld: The application has to be amenable. For many customers that do a wide range of applications – you know, if you are doing a few things, it’s likely possible that some of those few things will be these highly-parallel, many-core optimized kinds of things. But most customers are doing a range of things. The powerful general-purpose solution is still the mainstream Xeon architecture, which handles the widest range of workloads really robustly, and as we continue with our beat rate in the Xeon space, you know with Sandy Bridge coming out we moved significantly forward with floating-point performance, and you’ll see that again going forward. You see the charts going up and to the right 2X per release.

Pfister: Yes, all marketing charts go up and to the right.

Hengeveld: Yes, all marketing charts go up and to the right, but the point is that there’s a continued investment to drive floating-point performance and effective parallelism and power efficiency in a way that will be useful to HPC customers and mainstream customers.

Pfister: Is MIC going to be something that will continue over time? That you can write code for an expect it to continue to work in the future?

Hengeveld: Absolutely. It’s a major investment on our part on a distinct architectural approach that we expect to continue on as far out as our roadmaps envision today.

Pfister: Can you tell me anything about memory and connectivity? There was some indication at one point of memory being stacked on a MIC chip.

Hengeveld: A lot of research concepts are being explored for future products, and I can’t really talk about much of that kind of thing for things that are out in the roadmap. There’s a lot of work being done around innovative approaches about how to do the system work around this silicon.

Pfister: MIC vs. SCC – Single Chip Cluster.

Hengeveld: SCC! Got it! I thought you meant single chip computer.

Pfister: That would probably be SoC, System on a Chip. Is SCC part of your thinking on this?

Hengeveld: SCC was a research vehicle to try to explore extreme parallelism and some different instruction set architectures. It was a research vehicle. MIC is a series of products. It’s an architecture that underlies them. We always use “MIC” as an adjective: It’s a MIC architecture, MIC products, or something like that. It means Many Integrated Cores, Many Integrated Core architecture is an approach that underlies a collection of products, that are a product mix from Intel. As opposed to SCC, which is a research vehicle. It’s intended to get the academic community thinking about how to solve some of the major problems that remain in parallelism, using computer science to solve problems.

Pfister: One person noted that a big part of NVIDIA’s success in the space is CUDA…

Hengeveld: Yep.

Pfister: …which people can use to get, without too much trouble, really optimized code running on their accelerators. I know there are a lot of other things that can be re-used from Intel architecture – Threaded Building Blocks, etc. – but will CUDA be supported?

Hengeveld: That’s a question you have to ask NVIDIA. CUDA’s not my product. I have a collection of products that have an architectural approach.

Pfister: OpenCL is covered?

Hengeveld: OpenCL is part of our support roadmap, and we announced that previously. So, yes OpenCL.

Pfister: Inside of a MIC, right now, it has dual counter-rotating rings. Are connections other than that being considered? I’m thinking of the SCC mesh and other stuff. Are they in your thinking at this point?

Hengeveld: Yes, so, further out in the roadmap. These are all part of the research concepts. That’s the reason we do SCC and things like that, to see if it makes sense to use that architecture in the longer term products. But that’s a long ways away. Right now we have a fairly reasonable architectural approach that takes us out a bit, and certainly into our first generation of products. We’re not discussing yet how we’re going to use these learnings in future MIC products. But you can imagine that’s part of the thinking.

Pfister: OK.

Hengeveld: So, here’s the key thing. There are problems in exascale that the industry doesn’t know how to solve yet, and we’re working with the industry very actively to try to figure out whether there are architectural breakthroughs, things like mesh architectures. Is that part of the solution to exascale conundrums? Are there workloads in exascale, sort of a wave processing model, that you might see in a mesh architecture, that might make sense. So working with research centers, working with the labs, in part, we’re trying to figure out how to crack some of these nuts. For us it’s about taking all the pieces people are thinking about and seeing what the whole is.

Pfister: I’m glad to hear you express it that way, since the way it seemed to be portrayed at ISC was, from Intel, “Exascale, we’ve got that covered.”

Hengeveld: So, at the very highest strategic level, we have it covered in that we are working closely with a collection of academic and industry partners to try and solve difficult problems. But exascale is a long way off yet. We’re committed to make it happen, committed to solve the problems. That’s the real meat of what Kirk declared at ISC. It’s not that we have the answer; it’s that we have a commitment to make it happen, and to make it happen in a relatively early time period, with a relatively sustainable product architectural approach. But there are many problems to solve in exascale; we can barely get our arms around it.

Pfister: Do you agree with the DARPA targets for exascale, particularly low power, or would you relax those?

Hengeveld: The Intel commit, what we said in the declaration, was not inconsistent with the DARPA thing. It may be slightly relaxed. You can relax one of two things, you can relax time or you can relax DARPA targets. So I think you’re going to reach DARPA’s targets eventually – but when. So the target that Kirk raised is right in there, in the same ballpark. Exascale in 20MW is one set of rational numbers; I’ve heard 10 [MW], I’ve heard 40 [MW], somewhere between those, right? I think 40 [MW] is so easy it’s not worth thinking about. I don’t think it’s economically rational.

Pfister: As you move forward, what do you think are the primary barriers to performance? There are two different axes here, technical barriers, and market barriers.

Hengeveld: The technical barriers are cracking bandwidth and not violating the power budget; tracking how to manage the thread complexity of an exascale system – how many threads are you going to need? A whole lot. So how do you get your arms around that? There are business barriers: How do you get a return on investment through productizing things that apply in the exascale world? This is a John [?] quote, not an Intel quote, but I am far less interested in the first exascale system than I am in the 100^th. I would like a proliferation of exascale applications and performance, and have it be accessible to a wide range of people and applications, some applications that don’t exist today. In any ecosystem-building task, you’ve got to create awareness of the need, and create economic momentum behind serving that need. Those problems are equally complex to solve [equal to the technical ones]. In my camp, I think that maybe in some ways the technical problems are more solvable, since you’re not training people in a new way of thinking and working and solving problems. It takes some time to do that.

Pfister: Yes, in some ways the science is on a totally different time schedule.

Hengeveld: Yes, I agree. I agree entirely. A lot of what I’m talking about today is leaps forward in science as technical computing advances, but as the capability grows, the science will move to match it. How will that science be used? Interesting question. How will it be proliferated? Genome work is a great target for some of this stuff. You probably don’t need exascale for genome. You can make it faster, you can make it more cost-effective.

Pfister: From what I have heard from people working on this at CSU, they have a whole lot more problems with storage than with computing capability.

Hengeveld: That’s exactly right.

Pfister: They throw data away because they have no place to put it.

Hengeveld: That’s a fine example of the business problems you have to crack along with the compute problems that you have to crack. There’s a whole infrastructure around those applications that has to grow up.

Pfister: Looking at other questions I had… You wouldn’t call MIC a transitional architecture, would you?

Hengeveld: No. Heavens no. It’s a design point for a set of workloads in HPC and other areas. We believe MIC fits more things than just HPC. We started with HPC. It’s a design point that has a persistence well beyond as far as we can see on the roadmap. It’s not a transitional product.

Pfister: I have a lot of detailed technical questions which probably aren’t appropriate, like whether each of the MIC cores has equal latency to main memory.

Hengeveld: Yes, that’s a fine example of a question I probably shouldn’t answer.

Pfister: Returning to ultimate limits of computing, there are two that stand out, power and bandwidth, both to memory and between chips. Does either of those stand out to you as the sore thumb?

Hengeveld: Wow. So, the guts of that question gets to workload characterization. One of my favorite topics is “It’s the workload, stupid.” People say “it’s the economy, stupid,” well in this space it’s the workload. There aren’t general statements you can make about all workloads in this market.

Pfister: Yes, HPC is not one market.

Hengeveld: Right, it’s not one market, it’s not one class of usages, it’s not one architecture of solutions, it’s one reason why MIC is required, it’s not invisible. One size doesn’t fit all. Xeon does a great job of solving a lot of it really well, but there are individual workloads that are valuable that we want to dive into with more capability in a more targeted way. There are workloads in the industry where the interconnect bandwidth between processors in a node and nodes in a cluster is the dominant factor in performance. There are other workloads where the bandwidth to memory is the dominant factor in performance. All have to be solved. All have to be moved forward at a reasonable pace. I think the ones that are going to map to exascale best are ones where the memory bandwidth required can be solved well by local memory, and the problems that can be addressed well are those that have rational scaling of interconnect requirement between nodes. You’re not going to see problems that have a massive explosion of communication; the bandwidth won’t exist to keep up with that. You can actually see something I call “well-fed FLOPS,” which is how many FLOPS can you rationally support given the rest of this architecture. That’s something you have to know for each workload. You have to study it for each domain of HPC usage before you get to the answer about which is more important.

Pfister: You probably have to go now. I did want to say that I noticed the brass rat. Mine is somewhere in the Gulf of Mexico.

Hengeveld: That’s terrible. Class of ’80.

Pfister: Class of ’67.

Hengeveld: Wow.

Pfister: Stayed around for graduate school, too.

Hengeveld: When’d you leave?

Pfister: In ’74.

Hengeveld: We just missed overlapping, then. Have you been back recently?

Pfister: Not too recently. But there have been a lot of changes.

Hengeveld: That’s true, a lot of changes.

Pfister: But East Campus is still the same?

Hengeveld: You were in East Campus? Where’d you live?

Pfister: Munroe.

Hengeveld: I was in the black hall of fifth-floor Bemis.

Pfister: That doesn’t ring a bell with me.

Hengeveld: Back in the early 70s, they painted the hall black, and put in red lights in 5^th-floor Bemis.

Pfister: Oh, OK. We covered all the lights with green gel.

Hengeveld: Yes, I heard of that. That’s something that they did even to my time period there.

Pfister: Anyway, thank you.

Hengeveld: A pleasure. Nice talking to you, too.

Impressions of a Newbie at Intel Developer Forum (IDF)

2011-09-18T15:58:00.013-06:00

Out of the blue (which in this case is a pun), I received an invitation from an Intel representative to attend the 2011 Intel Developer Forum (IDF), in San Francisco, at Intel’s expense. Yes, I accepted. Thank you, Intel in general; and thank you in particular to the very nice lady who invited me and shepherded me through the process.

[There are some updates below, marked in this color.]

I’d never attended an IDF before, so I thought I’d spend an initial post on my overall impressions, describing the things that stood out to this wide-eyed IDF newbie. It may be boring to long-time IDF attendees – and there are very long-timers; a friend of mine has been to every domestic IDF for the last 12 years. But what the heck, they impressed me.

I do have some technical gee-whiz later in this post, but I’ll primarily go into more technical detail in subsequent posts. Those will including recountings of the three private interviews that were arranged for me with Intel HPC and MIC (Many-Integrated Core) executives (John Hengeveld, James Reinders, and Joe Curley), as well as other things I picked up along the way, primarily about MIC.

Here are my summary impressions: (1) Big. Very Big. (2) Incredibly slick and polished. (3) A fine attempt at Borgilation.

IDF is gigantic. It doesn’t surpass the mother of all trade shows, the Consumer Electronics show, but I wouldn’t be surprised to find that it is the largest single-company trade show. The Moscone Center West, filled by IDF on all three floors, is almost 300,000 sq. ft. Justin Rattner (Intel Fellow & CTO) said in his keynote that there were over 5,000 attendees, and that hauling in the gear and exhibits required 500 semis. I believe it.

There was of course the usual massive collection of trade-show booths covering one huge exhibit area (see photo of the center aisle of the exhibit area, below). That alone filled 100,000 sq. ft of exhibit space, completely.

In addition, all the large open areas each had their large well-manned pavilion dedicated to one thing or another: One had a bevy of ultrabooks (ultrabook = Intel’s push for a viable non-Apple MacBook Air) that you could play with. Another was an “Extreme Zone” with a battery of four high-end gaming systems (mostly playing what looked like Wolfenstein-y game). Another was a multi-player racing game with several drivers’ seats with steering wheels, etc. Another demoed twenty or thirty so different sizes and shapes of laptops (in addition to the displays in the exhibit area). Another was a contraption of pipes and random stuff spitting plastic balls onto pseudo-xylophones, cymbals, and so on, physically mimicking the famous YouTube video of several years back, demonstrating industrial controllers run by Atom processors. It didn’t actually play the music, but the video’s a pure animation so it’s one up on that. [Intel has a press release on this which seems to indicate that it actually played the music. Didn't seem like it to me, but might be.]

Everywhere could be found fanatic attention to detail and production values, extending down to even small details.

The keynotes were marvels of production; I’ve been to many IBM affairs, and nothing I saw over the years compared with these in slick, polished execution. Movies were theatre-quality cinematic productions (despite typical marketing fluff plots with occasional cheesy humor), and every one queued in at exactly the right instant, no hiccups. Every on-stage demo went right on the money, and even when one crashed – a momentary screen showing a windows driver crash – another was seamlessly switched in what seemed less than 2 seconds; I strongly suspect a hot backup, since no way does Windows recover that fast.

But smaller things had their share of attention, too. The technical sessions I attended all had fluent, personable speakers; meticulously designed slides; and perfect audio with nary a glitch in microphone use or (&deity. forbid) feedback. Even the backpacks handed out were high quality and custom-made. Simple customization is no big deal, but these came with Intel logos on the zipper pulls and a custom lining emblazoned with their chip-layout banner theme (see photos).

Speaking of that banner theme, it blared out at you over the entrance to each hall, on a photo at least 20 ft. high and 100 feet long (photo again), a huge illustration: You are a dull, chalky, dead, white – until Intel’s silicon brings you to vibrant, colored life. Not exactly subtle symbolism, but that’s marketing.

And speaking of marketing, the unmistakable overall message was: We will dominate everything. Everything with a processor in it, that is. Servers, with volumes ever-increasing at huge rates? Check. High-end 10+ core major stompers? Check. Midrange? Check. Low end? Super checkety-check-check-check. Ultrabook (future) with 14-day standby. (Standby? Do we really care?) Even a cell phone, demoed, run by an Intel processor. It’s the little black rectangle at the center-right of this pic:

(I couldn’t get a better picture, since after every keynote there was a “photo opportunity” that produced a paparazzi-dense melee/feeding frenzy on the stage. This is, I'm told, and IDF tradition. I’m not sufficiently a press-banger to elbow my way through that wall of bodies.)

The low-power demo that impressed me, though, was of a two-watt processor in a system showing a squee-worthy kitty video (and something else, but who noticed?), powered by a small solar panel. This was a demo of the future potential of near-threshold voltage operation, also touted (not, I’m sure, by accident) (not at all) in the Intel Fellows’ panel the day before. They used an old Pentium to do it, undoubtedly for reasons I’m not enough of a circuit jocky to understand. There was even what appeared – horrors! – to be an on-stage ad lib (!!) about “dumpster diving” for it. (Hey, eBay! Did they just call you a dumpster? The perils of ad libbing.) Some blatant futurism followed this, talking about 100 GF in that same 2W envelope; no hint when, fantastic if it ever happens.

There are chinks in the armor, though. You have to look seriously to find them, or have some comparisons on your side.

A friend happened to note to me, for example, that this IDF was three keynotes short of the usual full house of six. There was Otellini’s (CEO) general keynote, and Mooley Eden’s laptop ultrabook keynote, and Justin Rattner’s “futures” presentation in which he laughs too much for my taste. Those are regulars at every IDF. However, there was no keynote specifically devoted to Servers; understandable, I suppose, because they’re between big releases and have nothing major to announce (but they said a whole lot about the next-gen Ivy Bridge and the future server market in a media-only briefing). There was also no keynote for Digital Home; they are wrapped up with Sony [and other partners] on that one, and likely it hasn’t any splashes to make at this time (or else everybody’s figured out that connecting your TV to the Internet isn’t yet a world-shaking idea). And… dang, there was a third one historically, but I’ve lost it. Sorry. [The third missing keynote was on softtware and services, traditionally performed by Renee James.] Takeaway: Ambitions seem a bit shrunken, but it may just be circumstances.

A big deal was made in a media briefing about how they were going to improve Intel's Atom SoCs (Systems-On-a-Chip) at double Moore’s Law. (I think you’re supposed to gasp now.) That sounds sexy, but I interpret it as meaning they figured out that Atom really needs to be done in their latest and greatest silicon technology, as opposed to lagging a couple of generations (nodes) back the way it now does particularly now that their highest-end technologies are focused on low power.

So they’re going to catch up. Everybody, including Atom, will be using use the same 14nm technology in 2014. (That’s an estimated, forward-looking 2014, see their prospectus for caveats, etc.) Until then, well, there are iterations. I take “double Moore’s Law” to mean that they can’t steer the massive ship of microprocessor development fast enough to catch up in a single release; and/or (likely) their existing Atom customer base can’t wait without any new Atom products for as long as a single leap would take.

Will this put a dent in ARM's dominance of the low-power arena? Or MIPS's share? Maybe, in time.

Then there was that graph, also in a media briefing, of future server shipments. (Wish I had a pic; can’t find the pdf on the Intel web site.) They extended it to show some trebling or quadrupling of server shipments in the next few years, but…

Maybe they have some data I don’t have. To me, the actual past data on the graph seemed to me to say that curve of shipment volumes recently started flattening out. Extrapolating based on the slope that existed a couple of quarters or years in the past doesn’t seem justified by what I saw purely based on that graph.

Hey, did I mention that I wuz a medium? I got in with media credentials, which was another personal first. (Thanks again!) Talk about being a newbie – I didn’t even know there was a special “media corridor” until half-way through the first day. Dang. I could have had a much better breakfast on the first day.

Now I have this itch to buy a fedora so I can put a press pass into the hatband.

More will come, but I’ve got a trip to Mesa Verde for the next few days, so it won’t be immediate. Sorry. The wait won’t be anywhere near as long as it has been between other recent posts.

IBM Dumps Blue Waters – Final Curtain on the Old Days

2011-08-15T18:39:00.001-06:00

IBM has pulled out of the much-touted Blue Waters supercomputer project of IBM and National Center for Supercomputing Applications at the University of Illinois, an effort which was supposed to produce one petaflops of sustained performance by the end of 2012. Googling “IBM Blue Waters” and selecting “news” will give you a bevy of reports on this, (like this, this, this, this) so I’m going to refrain from reduplicating what everybody else has said.

I don’t have any inside scoop on this, in the sense that I have no under-the-table secret contacts or communications channels back into IBM. However, I can make some connections between dots already out there, based on my experience leading one flashy HPC project (RP3) back in the 1980s (possibly the first IBM did), and being close to such projects after that. My conclusion: There has been a major change in IBM executive management’s attitude towards flashy HPC projects, a change that is probably the drop of the final shoe of the “good old days” of IT architecture research.

I deduce the attitude change from HPCwire’s call to Herb Schultz, marketing manager for IBM's Deep Computing unit, in which he said a while ago that “There is really no appetite in IBM anymore -- with some of the leadership changes over the last few years – for revenue that has no profit with it”.

So, IBM wants to make money on its high performance computing products. What’s wrong with that? Nothing. As every IBM manager is taught in their first management training – at least I was – the purpose of IBM isn’t to advance technology, or make the world a better place, or be a good corporate citizen; it’s to make money. (Those were the multiple choices in a quiz, by the way.) It’s perfectly obvious that any company that doesn’t make money, and thereby stay in business, can’t do anything. It’s like the first and most important rule of breathing I was taught in Tai Chi, which was: Breathe. If you don’t do that, you won’t be around long.

But as everyone should also know, there’s a focus on making money now, directly, measurably; and there’s setting up to make more money in the future. The first is needed; but if done exclusively, without the second, your corporate lifetime is also being limited – rather like living on a tasty but unhealthy diet.

I recall distinctly the response of Ralph Gomory, then IBM Senior VP of Science and Technology, to a cadre of high-level development managers who were complaining about the cost of some HPC project, proposing to kill it. He told them “This will make you money in ways you can’t conceive of” (approximate quote). He was right. What they return isn’t money, directly; it’s column-inches on the front page of the New York Times and similar media.

This works. I’ve recounted in a much earlier post a case I was involved in where an IBM account rep absolutely owned the entire IT account of a large, conservative retailer in the Midwest – because an IBM RISC system was given the credit for beating Kasparov. (Winning Jeopardy! hardly has the same cachet.)

Also, while it may be hard to fathom now, there was a time when computer architecture and hardware development research was simply pursued for its own sake, primarily because we might find something out by doing it, without knowing what that might be.

This also works. My personal example of that is tree saturation^{^[1]} (a.k.a. congestion spreading, but in non-lossy networks), which I and Alan Norton serendipitously discovered in the RP3 project. I distinctly recall involuntarily standing and my whole body stiffening when I looked at the graphs revealing it, and realized what was happening. It was my own personal “eureka!” kind of moment. We’d no clue we’d find that, and it was the occasion of my only recursive award – an award from IBM research for getting an award for the paper on it. Gomory (who, coincidentally, was Research Division president at the time) said that was exactly the kind of thing he had hoped to get from RP3.

However, two things have changed since then: There’s a much stronger focus on showing results today (which the IBM stock price rise duly reflects). And the cost of entry has become quite a bit higher, particularly entries like Blue Waters.

Back when Gomory said what I recounted above, IBM was riding high on steady income from mainframes and their software. Those still bring in substantial money, particularly via drag of software along with them (which the hardware guys aren’t allowed to count… grrr…). Now, though, the software business has moved on to the much more competitive arena of stand-alone software products that run on a variety of platforms. Of course, there is also now the whole service business that practically didn’t exist back then.

In addition, the cost of entry has skyrocketed. Back when I was involved in RP3, we had a contract with DARPA that brought in a whole $1M or so, which paid something like half the real bill. Compare that with El Reg’s estimate that a single Blue Waters rack is an $8M proposition, with over 200 racks needed for the final configuration and you’re over $1B. Those are all rough numbers, and they’re retail, not cost (an impossible number to pin down from outside), but you can see where the table stakes have gotten beyond many of the highest high rollers stash.

So I’m going to label this pull out from Blue Waters as the final ringing down of the last curtain on an era of free-wheeling profit-unconstrained research into computer architecture and systems.

It was fun while it lasted, but now, no matter what you do, the issue is where and when the profit comes out. That’s normal now, but I think we need to remember that it was not always so.

[1] I’d like to give a URL for that, but it was back in the early 80s pre-web. There are lots of papers still out there about avoiding or fixing it (many wrong) that you can find by Googling “tree saturation”, though. Finally figured out how to fix it in InfiniBand. Complicated. Possibly not worth the effort. Added: Since someone asked, here's bibliographical information on the paper: "Hot spot" contention and combining in multistage interconnection networks. GF Pfister, V Norton IEEE TRANS. COMP. 34:1010, 943-948, 1985

Sandy Bridge Graphics Disappoints

2011-05-17T16:40:00.013-06:00

See update and the end of this post: New drivers released.
Well, I'm bummed out.

I was really looking forward to purchasing a new laptop that had one of Intel's new Sandy Bridge chips. That's the chip with integrated graphics which, while it wouldn't exactly rock, would at least be adequate for games at midrange settings. No more fussing around comparing added discrete graphics chips, fewer scorch marks on my lap, and other associated goodness would ensue.

The pre-ship performance estimates and hands'-on trials said that would be possible, as I pointed out in Intel Graphics in Sandy Bridge: Good Enough. This would have had the side effect of pulling the rug out from under Nvidia's volumes for GPUs, causing the HPC market to have to pull its own weight, meaning have traditional HPC price tags (see Nvidia-based Cheap Supercomputing Coming to an End). That would have been an earthquake, since most of the highest-end HPC systems now get their peak speeds from Nvidia CUDA accelerators, a situation not in small part due to their (relatively) low prices arising from high graphics volumes.

Then TechSpot had to go and do a performance comparison of low-end graphics cards, and later, just as a side addition, throw in measurements of Sandy Bridge graphics, too.

Now, I'm sufficiently old-fashioned in my language that I really try to avoid even marginally obscene terms, even if they are in widespread everyday use, but in this case I have to make an exception:

Damn, Sandy Bridge really sucks at graphics.

It's the lowest of the low in every case. It's unusable for every game tested (and they tested quite a few), unless you're on some time-dilation drug that makes less than 15 frames per second seem zippy. Some frame rates – at medium settings – are in single digits.

With Sandy Bridge, Intel has solidly maintained its historic lock on the worst graphics performance in the industry. This, by the way, is with the Intel i7 chips overclocked to 3.4GHz. That should also overclock the graphics (unless Intel is doing something I don't know about with the graphics clock).

Ah, but possibly there is a "3D" fix for this coming soon? Ivy Bridge, the upcoming 22nm shrink of Sandy Bridge (the Intel "tock" following Sandy Bridge "tick"), has those wondrous new much-promoted transistors. Heh. Intel says Ivy Bridge will have – drum roll – 30% faster graphics than Sandy Bridge.

See prior marginal obscenity.

Intel does tend to sandbag future performance estimates, but not by enough to lift 30% up to 200-300%; that's what would be needed to produce what people were saying Sandy Bridge would do. Is that all we get from those "3D" transistors? The way the Intel media guys are going on about 3D, I expected Tri-Gate (which can be two- or five- or whatever-gate) to give me an Avatar-like mind meld or something.

All that stuff about on-chip integrated graphics taking over the low-end high-volume market for discrete graphics just isn't going to happen this year with Sandy Bridge, or later with Ivy Bridge. As a further grain of salt in my wound, Nvidia is even seeing a nice revenue uptick from selling discrete graphics add-ons to new Sandy Bridge systems. It's not that I have anything against Nvidia. I just didn't think that uptick, of all things, was going to happen.

This doesn't change my opinion that GPUs integrated on-chip won't ultimately take over the low-end graphics market. As the real Moore's Law – the law about transistor densities, not clock rates – continues to march on, it's inevitable that on-chip integrated graphics will be just fine for low- and medium-range games. It just won't happen soon with Intel products.

Ah, but what about AMD? Their Fusion chips with integrated graphics, which they call APUs, are supposed to be rather good. Performance information leaked on message boards about their upcoming A4-3400, A6-3650 and A8-3850 APUs make them sound as good as, well, um, as good as Sandy Bridge was supposed to be. Hm.

Several years ago I heard a high-level AMD designer say that people looking for performance with Fusion were going to be disappointed; it was strictly a cost/performance product. That was several years ago, and things could have changed, but chip design lead times are still multi-year.

In any event, this time I think I'll wait until shipped products are tested before declaring victory.

Meanwhile, here I go again, flipping back and forth between laptop specs and GPU specs, as usual.

Sigh.

UPDATE May 23, 2011

Intel has just released new drivers for Sandy Bridge. The press release says they provide “up to 40% performance improvements on select games, support for the latest games like Valve’s Portal 2 and Stereoscopic 3D playback on DisplayPort monitors.”

At this time I don't know of test results that would confirm whether this really makes a difference, but if it’s real, and applies broadly enough, it might be just barely enough to make the Ivy Bridge chip the beginning of the end for low-end discrete graphics.

I'm Also On SemiAccurate Now (and a bit about Apple)

2011-02-22T20:25:00.000-07:00

The good folks over at SemiAccurate have invited me to contribute there, and I've accepted.

I'm certainly not going to abandon this blog, although I'll admit there hasn't been much activity here recently. But posts on some of the types of topics I've covered here will appear there, instead; the theory is that this way they reach a wider audience. Posts with topics that are too far from S|A's area, as well as any longer than they're comfortable with, will still appear here.

I've gone live there already. So, if you would like to better understand why Apple charges 30% of revenue to iPad app developers and subscription services, take a look over there - or, to be more exact, right here. The topic isn't my usual stomping ground, but that had nothing to do with S|A; it's something that just occurred to me while reading some things about Apple's shenanigans.

I guess I can now add "web journalist" to my vita. I wonder when I get a press card to tuck jauntily into my hat band?

Intel-Nvidia Agreement Does Not Portend a CUDABridge or Sandy CUDA

2011-01-11T15:00:00.000-07:00

Intel and Nvidia reached a legal agreement recently in which they cross-license patents, stop suing each other over chipset interfaces, and oh, yeah, Nvidia gets $1.5B from Intel in five easy payments of $300M each.

This has been covered in many places, like here, here, and here, but in particular Ars Technica originally lead with a headline about a Sandy Bridge (Intel GPU integrated on-chip with CPUs; see my post if you like) using Nvidia GPUs as the graphics engine. Ars has since retracted that (see web page referenced above), replacing the original web page. (The URL still reads "bombshell-look-for-nvidia-gpu-on-intel-processor-die.")

Since that's been retracted, maybe I shouldn't bother bringing it up, but let me be more specific about why this is wrong, based on my reading the actual legal agreement (redacted, meaning a confidential part was deleted). Note: I'm not a lawyer, although I've had to wade through lots of legalese over my career; so this is based on an "informed" layman's reading.

Yes, they have cross-licensed each others' patents. So if Intel does something in its GPU that is covered by an Nvidia patent, no suits. Likewise, if Nvidia does something covered by Intel patents, no suits. This is the usual intention of cross-licensing deals: Each side has "freedom of action," meaning they don't have to worry about inadvertently (or not) stepping on someone else's intellectual property.

It does mean that Intel could, in theory, build a whole dang Nvidia GPU and sell it. Such things have happened, historically, but usually without cross-licensing, and are uncommon (IBM mainframe clones, X86 clones), but as a practical matter, wholesale inclusion of one company's processor design into another company's products is a hard job. There is a lot to a large digital widget not covered by the patents – numbers of undocumented implementation-specific corner cases that can mess up full software compatibility, without which there's no point. Finding them all is massive undertaking.

So switching to a CUDA GPU architecture would be a massive undertaking, and furthermore it's a job Intel apparently doesn't want to do. Intel has its own graphics designs, with years of the design / test / fabricate pipeline already in place; and between the ill-begotten Larrabee (now MICA) and its own specific GPUs and media processors Intel has demonstrated that they really want to do graphics in house.

Remember, what this whole suit was originally all about was Nvidia's chipset business – building stuff that connects processors to memory and IO. Intel's interfaces to the chipset were patent protected, and Nvidia was complaining that Intel didn't let Nvidia get at the newer ones, even though they were allegedly covered by a legal agreement. It's still about that issue.

This makes it surprising that, buried down in section 8.1, is this statement:

"Notwithstanding anything else in this Agreement, NVIDIA Licensed Chipsets shall not include any Intel Chipsets that are capable of electrically interfacing directly (with or without buffering or pin, pad or bump reassignment) with an Intel Processor that has an integrated (whether on-die or in-package) main memory controller, such as, without limitation, the Intel Processor families that are code named 'Nehalem', 'Westmere' and 'Sandy Bridge.'"

So all Nvidia gets is the old FSB (front side bus) interfaces. They can't directly connect into Intel's newer processors, since those interfaces are still patent protected, and those patents aren't covered. They have to use PCI, like any other IO device.

So what did Nvidia really get? They get bupkis, that's what. Nada. Zilch. Access to an obsolete bus interface. Well, they get bupkis plus $1.5B, which is a pretty fair sweetener. Seems to me that it's probably compensation for the chipset business Nvidia lost when there was still a chipset business to have, which there isn't now.

And both sides can stop paying lawyers. On this issue, anyway.

Postscript

Sorry, this blog hasn't been very active recently, and a legal dispute over obsolete busses isn't a particularly wonderful re-start. At least it's short. Nvidia's Project Denver – sticking a general-purpose ARM processor in with a GPU – might be an interesting topic, but I'm going to hold off on that until I can find out what the architecture really looks like. I'm getting a little tired of just writing about GPUs, though. I'm not going to stop that, but I am looking for other topics on which I can provide some value-add.

The Varieties of Virtualization

2010-12-06T17:26:00.000-07:00

There appear to be many people for whom the term virtualization exclusively means the implementation of virtual machines à la VMware's products, Microsoft's Hyper-V, and so on. That's certainly a very important and common case, enough so that I covered various ways to do it in a separate series of posts; but it's scarcely the only form of virtualization in use.

There's a hint that this is so in the gaggle of other situations where the word virtualization is used, such as desktop virtualization, application virtualization, user virtualization (I like that one; I wonder what it's like to be a virtual user), and, of course, Java Virtual Machine (JVM). Talking about the latter as a true case of virtualization may cause some head-scratching; I think most people consign it to a different plane of existence than things like VMware.

This turns out not to be the case. They're not only all in the same (boringly mundane) plane, they relate to one another hierarchically. I see five levels to that hierarchy right now, anyway; I wouldn't claim this is the last word.

A key to understanding this is to adopt an appropriate definition of virtualization. Mine is that virtualization is the creation of isolated, idealized platforms on which computing services are provided. Anything providing that, whether it's hardware, software, or a mixture, is virtualization. The adjectives in front of "platform" could have qualifiers: Maybe it's not quite idealized in all cases, and isolation is never total. But lack of qualification is the intent.

Most types of virtualization allow hosting of several platforms on one physical or software resource, but that's not part of my definition because it's not universal; it could be just one, or a single platform could be created spanning multiple physical resources. It's also necessary to not always dwell all that heavily on boundaries between hardware and software. But that's starting to get ahead of the discussion. Let's go through the levels, starting at the bottom.

I'll relate this to the cloud computing's IaaS/PaaS/SaaS levels later.

Level 1: Hardware Partitioning

Some hardware is designed like a brick of chocolate that can be broken apart along various predefined fault lines, each piece a fully functional computer. Sun Microsystems (Oracle, now) famously did this with its .com workhorse, the Enterprise 10000 (UE10000). That system had multiple boards plugged into a memory-bus backplane, each board with processor(s), memory, and IO. Firmware let you set registers allowing or disallowing inter-board memory traffic, cache coherence and IO traffic, allowing you to create partitions of the whole machine built with any number of whole boards. The register setting, etc., is set up so that no code running on any of the processors can alter it or, usually, even tell it's there; a privileged console accesses them, under command of an operator, and that's it. HP, IBM and others have provided similar capabilities in large systems, often with the processors, memory, and IO in separate units, numbers of each assigned to different partitions.

Hardware partitioning has the big advantage that even hardware failures (for the most part) simply cannot propagate among partitions. With appropriate electrical design, you can even power-cycle one partition without affecting others. Software failures are of course also totally isolated within partitions (as long as one isn't performing a service for another, but that issue is on another plane of abstraction).

The big negative of hardware partitioning is that you usually cannot have very many of them. Even a single chip now contains multiple processors, so partitioning even by separate chips is far less granularity than is generally desirable. In fact, it's common to assign just a fraction of one CPU, and that can't be done without bending the notion of a hardware-isolated, power-cycle-able partition to the breaking point. In addition, there is always some hardware in common across the partition. For example, power supplies are usually shared, and whatever interconnects all the parts is shared; failure of that shared hardware cause all partitions to fail. (For more complete high availability, you need multiple completely separate physical computers, not under the same sprinkler head, preferably located on different tectonic plates, etc. depending on your personal level of paranoia.)

Despite its negatives, hardware partitioning is fairly simple to implement, useful, and still used. It or something like it, I speculate, is effectively what will be used for initial "virtualization" of GPUs when that starts appearing.

Level 2: Virtual Machines

This is the level of VMware and its kissin' cousins. All the hardware is shared en masse, and a special layer of software, a hypervisor, creates the illusion of multiple completely separate hardware platforms. Each runs its own copy of an operating system and any applications above that, and (ideally) none even knows that the others exist. I've previously written about how this trick can be performed without degrading performance to any significant degree, so won't go into it here.

The good news here is that you can create as many virtual machines as you like, independent of the number of physical processors and other physical resources – at least until you run out of resources. The hypervisor usually contains a scheduler that time-slices among processors, so sub-processor allocation is available. With the right hardware, IO can also fractionally allocated (again, see my prior posts).

The bad news is that you generally get much less hardware fault isolation than with hardware partitioning; if the hardware croaks, well, it's one basket and those eggs are scrambled. Very sophisticated hypervisors can help with that when there is appropriate hardware support (mainframe customers do get something for their money). In addition, and this is certainly obvious after it's stated: If you put N virtual machines on one physical machine, you are now faced with all the management pain of managing all N copies of the operating system and its applications.

This is the level often used in so-called desktop virtualization. In that paradigm, individuals don't own hardware, their own PC. Instead, they "own" a block of bits back on a server farm that happens to be the description of a virtual machine, and can request that their virtual machine be run from whatever terminal device happens to be handy. It might actually run back on the server, or might run on a local machine after downloading. Many users absolutely loathe this; they want to own and control their own hardware. Administrators like it, a lot, since it lets them own, and control, the hardware.

Level 3: Containers

This level was, as far as I know, originally developed by Sun Microsystems (Oracle), so I'll use their name for it: Containers. IBM (in AIX) and probably others also provide it, under different names.

With containers, you have one copy of the operating system code, but it provides environments, containers, which act like separate copies of the OS. In Unix/Linux terms, each container has its own file system root (including IO), process tree, shared segment naming space, and so on. So applications run as if they were running on their own copy of the operating system – but they are actually sharing one copy of the OS code, with common but separate OS data structures, etc.; this provides significant resource sharing that helps the efficiency of this level.

This is quite useful if you have applications or middleware that were written under the assumption that they were going to run on their own separate server, and as a result, for example, all use the same name for a temporary file. Were they run on the same OS, they would clobber each other in the common /tmp directory; in separate containers, they each have their own /tmp. More such applications exist than one would like to believe; the most quoted case is the Apache web server, but my information on that may be out of date and it may have been changed by now. Or not, since I'm not sure what the motivation to change would be.

I suspect container technology was originally developed in the Full Moon cluster single-system-image project, which needs similar capabilities. See my much earlier post about single-system-image if you want more information on such things.

In addition, there's just one real operating system to manage in this case, so management headaches are somewhat lessened. You do have to manage all those containers, so it isn't an N:1 advantage, but I've heard customers say this is a significant management savings.

A perhaps less obvious example of containerization is the multiuser BASIC systems that flooded the computer education system several decades back. There was one copy of the BASIC interpreter, run on a small minicomputer and used simultaneously by many students, each of whom had their own logon ID and wrote their own code. And each of whom could botch things up for everybody else with the wrong code that soaked up the CPU. (This happened regularly in the "computer lab" I supervised for a while.) I locate this in the container level rather than higher in the stack because the BASIC interpreter really was the OS: It ran on the bare metal, with no supervisor code below it.

Of course, fault isolation at this level is even less than in the prior cases. Now if the OS crashes, all the containers go down. (Or if the wrong thing is done in BASIC…) In comparison, an OS crash in a virtual machine is isolated to that virtual machine.

Level 4: Software Virtual Machines

We've reached the JVM level. It's also the .NET level, the Lisp level, the now more usual BASIC level, and even the CICS (and so on): the level of more-or-less programming-language based independent computing environments. Obviously, multiple of these can be run as applications under a single operating system image, each providing a separate environment for the execution of applications. At least this can be done in theory, and in many cases in practice; some environments were implemented as if they owned the computer they run on.

What you get out of this is, of course, a more standard programming environment that can be portable – run on multiple computer architectures – as well as extensions to a machine environment that provide services simplifying application development. Those extensions are usually the key reason this level is used. There's also a bit of fault tolerance, since if one of those dies of a fault in its support or application code, it need not always affect others, assuming a competent operating system implementation.

Fault isolation at this level is mostly software only; if one JVM (say) crashes, or the code running on it crashes, it usually doesn't affect others. Sophisticated hardware / firmware / OS can inject the ability to keep many of the software VMs up if a failure occurred that only affected one of them. (Mainframe again.)

Level 5: Multitenant / Multiuser Environment

Many applications allow multiple users to log in, all to the same application, with their own profiles, data collections, etc. They are legion. Examples include web-based email, Facebook, Salesforce.com, Worlds of Warcraft, and so on. Each user sees his or her own data, and thinks he / she is doing things isolated from others except at those points where interaction is expected. They see their own virtual system – a very specific, particularized system running just one application, but a system apparently isolated from all others in any event.

The advantages here? Well, people pay to use them (or put up with advertising to use them). Aside from that, there is potentially massive sharing of resources, and, concomitantly, care must be taken in the software and system architecture to avoid massive sharing of faults.

All Together Now

Yes. You can have all of these levels of virtualization active simultaneously in one system: A hardware partition running a hypervisor creating a virtual machine that hosts an operating system with containers that each run several programming environments executing multi-user applications.

It's possible. There may be circumstances where it appears warranted. I don't think I'd want to manage it, myself. Imagining a performance tuning on a 5-layer virtualization cake makes me shudder. I once had a television system that had two volume controls in series: A cable set-top box had its volume control, feeding an audio system with its own. Just those two levels drove me nuts until I hit upon a setting of one of them that let the other, alone, span the range I wanted.

Virtualization and Cloud Computing

These levels relate to the usual IaaS/PaaS/SaaS (Infrastructure / Platform / Software as a Service) distinctions discussed in cloud computing circles, but are at a finer granularity than those.

IaaS relates to the bottom two layers: hardware partitioning and virtual machines. Those two levels, particularly virtual machines, make it possible to serve up raw computing infrastructure (machines) in a way that can utilize the underlying hardware far more efficiently than handing customers whole computers that they aren't going to use 100% of the time. As I've pointed out elsewhere, it is not a logical necessity that a cloud use this or some other form of virtualization; but in many situations, it is an economic necessity.

Software virtual machines are what PaaS serves up. There's a fairly close correspondence between the two concepts.

SaaS is, of course, a Multiuser environment. It may, however, be delivered by using software virtual machines under it.

Containers are a mix of IaaS and PaaS. It's doesn't provide pure hardware, but a plain OS is made available, and that can certainly be considered a software platform. It is, however, a fairly barren environment compared with what software virtual machines provide..

Conclusion

This post has been brought to you by my poor head, which aches every time I encounter yet another discussion over whether and how various forms of cloud computing do or do not use virtualization. Hopefully it may help clear up some of that confusion.

Oh, yes, and the obvious conclusion: There's more than one kind of virtualization, out there, folks.

The Cloud Got GPUs

2010-11-15T17:33:00.000-07:00

Amazon just announced, on the first full day of SC10 (SuperComputing 2010), the availability of Amazon EC2 (cloud) machine instances with dual Nvidia Fermi GPUs. According to Amazon's specification of instance types, this "Cluster GPU Quadruple Extra Large" instance contains:

22 GB of memory

33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core "Nehalem" architecture)

2 x NVIDIA Tesla "Fermi" M2050 GPUs

1690 GB of instance storage

64-bit platform

I/O Performance: Very High (10 Gigabit Ethernet)

So it looks like the future virtualization features of CUDA really are for purposes of using GPUs in the cloud, as I mentioned in my prior post.

One of these XXXXL instances costs $2.10 per hour for Linux; Windows users need not apply. Or, if you reserve an instance for a year – for $5630 – you then pay just $0.74 per hour during that year. (Prices quoted from Amazon's price list as of 11/15/2010; no doubt it will decrease over time.)

This became such hot news that GPU was a trending topic on Twitter for a while.

For those of you who don't watch such things, many of the Top500 HPC sites – the 500 supercomputers worldwide that are the fastest at the Linpack benchmark – have nodes featuring Nvidia Fermi GPUs. This year that list notoriously includes, in the top slot, the system causing the heaviest breathing at present: The Tianhe-1A at the National Supercomputer Center in Tianjin, in China.

I wonder how well this will do in the market. Cloud elasticity – the ability to add or remove nodes on demand – is usually a big cloud selling point for commercial use (expand for holiday rush, drop nodes after). How much it will really be used in HPC applications isn't clear to me, since those are usually batch mode, not continuously operating, growing and shrinking, like commercial web services. So it has to live on price alone. The price above doesn't feel all that inexpensive to me, but I'm not calibrated well in HPC costs these days, and don't know how much it compares with, for example, the cost of running the same calculation on Teragrid. Ad hoc, extemporaneous use of HPC is another possible use, but, while I'm sure it exists, I'm not sure how much exists.

Then again, how about services running games, including the rendering? I wonder if, for example, the communications secret sauce used by OnLive to stream rendered game video fast enough for first-person shooters can operate out of Amazon instances. Even if it doesn't, games that can tolerate a tad more latency may work. Possibly games targeting small screens, requiring less rendering effort, are another possibility. That could crater startup costs for companies offering games over the web.

Time will tell. For accelerators, we certainly are living in interesting times.

Nvidia Past, Future, and Circular

2010-11-11T19:09:00.003-07:00

I'm getting tired about writing about Nvidia and its Fermi GPU architecture (see here and here for recent posts). So I'm going to just dump out some things I've considered for blog entries into this one, getting it all out of the way.

Past Fermi Product Mix

For those of you wondering about how much Nvidia's product mix is skewed to the low end, here's some data for Q3, 2010 from Investor Village:

Also, note that despite the raging hormones of high-end HPC, the caption indicates that their median and mean prices have decreased from Q2: They became more, not less, skewed towards the low end. As I've pointed out, this will be a real problem as Intel's and AMD's on-die GPUs assert some market presence, with "good enough" graphics for free – built into all PC chips. It won't be long now, since AMD has already started shipping its Zacate integrated-GPU chip to manufacturers.

Future Fermis

Recently Fermi's chief executive Jen-Hsun Huang gave an interview on what they are looking at for future features in the Fermi architecture. Things he mentioned were: (a) More development of their CUDA software; (b) virtual memory and pre-emption; (c) directly attaching InfiniBand, the leading HPC high-speed system-to-system interconnect, to the GPU. Taking these in that order:

More CUDA: When asked why not OpenCL, he said because other people are working on OpenCL and they're the only ones doing CUDA. This answer ranks right up there in the stratosphere of disingenuousness. What the question really meant was why they don't work to make OpenCL, a standard, work as well as their proprietary CUDA on their gear? Of course the answer is that OpenCL doesn't get them lock-in, which one doesn't say in an interview.

Virtual memory and pre-emption: A GPU getting a page fault, then waiting while the data is loaded from main memory, or even disk? I wouldn't want to think of the number of threads it would take to cover that latency. There probably is some application somewhere for which this is the ideal solution, but I doubt it's the main driver. This is a cloud play: Cloud-based systems nearly all use virtual machines (for very good reason; see the link), splitting one each system node into N virtual machines. Virtual memory and pre-emption allows the GPU to participate in that virtualization. The virtual memory part is, I would guess, more intended to provide memory mapping, so applications can be isolated from one another reliably and can bypass issues of contiguous memory allocation. It's effectively partitioning the GPU, which is arguably a form of virtualization. [UPDATE: Just after this was published, John Carmak (of Id Software ) wrote a piece laying out the case for paging into GPUs. So that may be useful in games and generally.]

Direct InfiniBand attachment: At first glance, this sounds as useful as tits on a boar hog (as I occasionally heard from older locals in Austin). But it is suggested, a little, by the typical compute cycle among parallel nodes in HPC systems. That often goes like this: (a) Shove data from main memory out to the GPU. (b) Compute on the GPU. (c) Suck data back from GPU into main memory. (d) Using the interconnect between nodes, send part of that data from main memory to the main memory in other compute nodes, while receiving data into your memory from other compute nodes. (e) Merge the new data with what's in main memory. (f) Test to see if everybody's done. (g) If not, done, shove resulting new data mix in main memory out to the GPU, and repeat. At least naively, one might think that the copying to and from main memory could be avoided since the GPUs are the ones doing all the computing: Just send the data from one GPU to the other, with no CPU involvement. Removing data copying is, of course, good. In practice, however, it's not quite that straightforward; but it is at least worth looking at.

So, that's what may be new in Nvidia CUDA / Fermi land. Each of those are at least marginally justifiable, some very much so (like virtualization). But stepping back a little from these specifics, this all reminds me of dueling Nvidia / AMD (ATI) announcements of about a year ago.

That was the time of the Fermi announcement, which compared with prior Nvidia hardware doubled everything, yada yada, and added… ECC. And support for C++ and the like, and good speed double-precision floating-point.

At that time, Tech Report said that the AMD Radeon HD 5870 doubled everything, yada again, and added … a fancy new anisotropic filtering algorithm for smoothing out texture applications at all angles, and supersampling to better avoid antialiasing.

Fine, Nvidia doesn't think much of graphics any more. But haven't they ever heard of the Wheel of Reincarnation?

The Wheel of Reincarnation

The wheel of reincarnation is a graphics system design phenomenon discovered all the way back in 1968 by T. H. Meyers and Ivan Sutherland. There are probably hundreds of renditions of it floating around the web; here's mine.

Suppose you want to use a computer to draw pictures on a display of some sort. How do you start? Well, the most dirt-simple, least hardware solution is to add an IO device which, prodded by the processor with X and Y coordinates on the device, puts a dot there. That will work, and actually has been used in the deep past. The problem is that you've now got this whole computer sitting there, and all you're doing with it is putting stupid little dots on the screen. It could be doing other useful stuff, like figuring out what to draw next, but it can't; it's 100% saturated with this dumb, repetitious job.

So, you beef up your IO device, like by adding the ability to go through a whole list of X, Y locations and putting dots up at each specified point. That helps, but the computer still has to get back to it very reliably every refresh cycle or the user complains. So you tell it to repeat. But that's really limiting. It would be much more convenient if you could tell the device to go do another list all by itself, like by embedding the next list's address in block of X,Y data. This takes a bit of thought, since it means adding a code to everything, so the device can tell X,Y pairs from next-list addresses; but it's clearly worth it, so in it goes.

Then you notice that there are some graphics patterns that you would like to use repeatedly. Text characters are the first that jump out at you, usually. Hmm. That code on the address is kind of like a branch instruction, isn't it? How about a subroutine branch? Makes sense, simplifies lots of things, so in it goes.

Oh, yes, then some of those objects you are re-using would be really more useful if they could be rotated and scaled… Hello, arithmetic.

At some stage it looks really useful to add conditionals, too, so…

Somewhere along the line, to make this a 21^st century system, you get a frame buffer in there, too, but that's kind of an epicycle; you write to that instead of literally putting dots on the screen. It eliminates the refresh step, but that's all.

Now look at what you have. It's a Turing machine. A complete computer. It's got a somewhat strange instruction set, but it works, and can do any kind of useful work.

And it's spending all its time doing nothing but putting silly dots on a screen.

How about freeing it up to do something more useful by adding a separate device to it to do that?

This is the crucial point. You've reached the 360 degree point on the wheel, spinning off a graphics processor on the graphics processor.

Every incremental stage in this process was very well-justified, and Meyers and Sutherland say they saw examples (in 1968!) of systems that were more than twice around the wheel: A graphics processor hanging on a graphics processor hanging on a graphics processor. These multi-cycles are often justified if there's distance involved; in fact, in these terms, a typical PC on the Internet can be considered to be twice around the wheel: It's got a graphics processor on a processor that uses a server somewhere else.

I've some personal experience with this. For one thing, back in the early 70s I worked for Ivan Sutherland at then-startup Evans and Sutherland Computing Corp., out in Salt Lake City; it was a summer job while I was in grad school. My job was to design nothing less than an IO system on their second planned graphics system (LDS-2). It was, as was asked for, a full-blown minicomputer-level IO system, attached to a system whose purpose in life was to do nothing but put dots on a screen. Why an IO system? Well, why bother the main system with trivia like keyboard and mouse (light pen) interrupts? Just attach them directly to the graphics unit, and let it do the job.

Just like Nvidia is talking about attaching InfiniBand directly to its cards.

Also, in the mid-80s in IBM Research, after the successful completion of an effort to build special-purpose parallel hardware system of another type (a simulator), I spent several months figuring out how to bend my brain and software into using it for more general purposes, with various and sundry additions taken from the standard repertoire of general-purpose systems.

Just like Nvidia is adding virtualization to its systems.

Each incremental step is justified – that's always the case with the wheel – just as in the discussion above, I showed a justification for every general-purpose additions to Nvidia architecture are justifiable.

The issue here is not that this is all necessarily bad. It just is. The wheel of reincarnation is a factor in the development over time of every special-purpose piece of hardware. You can't avoid it; but you can be aware that you are on it, like it or not.

With that knowledge, you can look back at what, in its special-purpose nature, made the original hardware successful – and make your exit from the wheel thoughtfully, picking a point where the reasons for your original success aren't drowned out by the complexity added to chase after ever-widening, and ever more shallow, market areas. That's necessary if you are to retain your success and not go head-to-head with people who have, usually with far more resources than you have, been playing the general-purpose game for decades.

It's not clear to me that Nvidia has figured this out yet. Maybe they have, but so far, I don't see it.

RIP, Benoit Mandelbrot, father of fractal geometry

2010-10-17T19:08:00.000-06:00

Benoit Mandelbrot, father of fractal geometry, has died.

See my post about him, and my interaction with him, in my mostly non-technical blog, Random Gorp: RIP, Benoit Mandelbrot, father of fractal geometry.

Intel Graphics in Sandy Bridge: Good Enough

2010-09-04T12:23:00.001-06:00

As I and others expected, Intel is gradually rolling out how much better the graphics in its next generation will be. Anandtech got an early demo part of Sandy Bridge and checked out the graphics, among other things. The results show that the "good enough" performance I argued for in my prior post (Nvidia-based Cheap Supercomputing Coming to an End) will be good enough to sink third party low-end graphics chip sets. So it's good enough to hurt Nvidia's business model, and make their HPC products fully carry their own development burden, raising prices notably.

The net is that for this early chip, with early device drivers, at low, but usable resolution (1024x768) there's adequate performance on games like "Batman: Arkham Asylum," "Call of Duty MW2," and a bunch of others, significantly including "Worlds of Warfare." And it'll play Blue-Ray 3D, too.

Anandtech's conclusion is "If this is the low end of what to expect, I'm not sure we'll need more than integrated graphics for non-gaming specific notebooks." I agree. I'd add desktops, too. Nvidia isn't standing still, of course; on the low end they are saying they'll do 3D, too, and will save power. But integrated graphics are, effectively, free. It'll be there anyway. Everywhere. And as a result, everything will be tuned to work best on that among the PC platforms; that's where the volumes will be.

Some comments I've received elsewhere on my prior post have been along the lines of "but Nvidia has such a good computing model and such good software support – Intel's rotten IGP can't match that." True. I agree. But.

There's a long history of ugly architectures dominating clever, elegant architectures that are superior targets for coding and compiling. Where are the RISC-based CAD workstations of 15+ years ago? They turned into PCs with graphics cards. The DEC Alpha, MIPS, Sun SPARC, IBM POWER and others, all arguably far better exemplars of the computing art, have been trounced by X86, which nobody would call elegant. Oh, and the IBM zSeries, also high on the inelegant ISA scale, just keeps truckin' through the decades, most recently at an astounding 5.2 GHz.

So we're just repeating history here. Volume, silicon technology, and market will again trump elegance and computing model.

PostScript: According to Bloomberg, look for a demo at Intel Developer Forum next week.

Nvidia-based Cheap Supercomputing Coming to an End

2010-08-11T22:05:00.000-06:00

Nvidia's CUDA has been hailed as "Supercomputing for the Masses," and with good reason. Amazing speedups on scientific / technical code have been reported, ranging from a mere 10X through hundreds. It's become a darling of academic computing and a major player in DARPA's Exascale program, but performance alone is not the reason; it's price. For that computing power, they're incredibly cheap. As Sharon Glotzer of UMich noted, "Today you can get 2GF for $500. That is ridiculous." It is indeed. And it's only possible because CUDA is subsidized by sinking the fixed costs of its development into the high volumes of Nvidia's mass market low-end GPUs.

Unfortunately, that subsidy won't last forever; its end is now visible. Here's why:

Apparently ignored in the usual media fuss over Intel's next and greatest, Sandy Bridge, is the integration of Intel's graphics onto the same die as the processor chip.

The current best integration is onto the same package, as illustrated in the photo of the current best, Clarkdale (a.k.a. Westmere), as shown in the photo on the right. As illustrated, the processor is in 32nm silicon technology, and the graphics, with memory controller, is in 45nm silicon technology. Yes, the graphics and memory controller is the larger chip.

Intel has not been touting higher graphics performance from this tighter integration. In fact, Intel's press releasers for Clarkdale claimed that being on two die wouldn't reduce performance because they were in the same package. But unless someone has changed the laws of physics as I know them, that's simply false; at a minimum, eliminating off-chip drivers will reduce latency substantially. Also, being on the same die as the processor implies the same process, so graphics (and memory control) goes all the way from 45nm to 32nm, the same as the processor, in one jump; this certainly will also result in increased performance. For graphics, this is a very loud the Intel "Tock" in its "Tick-Tock" (architecture / silicon) alternation.

So I'll semi-fearlessly predict some demos of midrange games out of Intel when Sandy Bridge is ready to hit the streets, which hasn't been announced in detail aside from being in 2011.

Probably not coincidentally, mid-2011 is when AMD's Llano processor sees daylight. Also in 32nm silicon, it incorporates enough graphics-related processing to be an apparently decent DX11 GPU, although to my knowledge the architecture hasn't been disclosed in detail.

Both of these are lower-end units, destined for laptops, and intent on keeping a tight power budget; so they're not going to run high-end games well or be a superior target for HPC. It seems that they will, however, provide at least adequate low-end, if not midrange, graphics.

Result: All of Nvidia's low-end market disappears by the end of next year.

As long as passable performance is provided, integrated into the processor equates with "free," and you can't beat free. Actually, it equates with cheaper than free, since there's one less chip to socket onto the motherboard, eliminating socket space and wiring costs. The power supply will probably shrink slightly, too.

This means the end of the low-end graphics subsidy of high-performance GPGPUs like Nvidia's CUDA. It will have to pay its own way, with two results:

First, prices will rise. It will no longer have a huge advantage over purpose-built HPC gear. The market for that gear is certainly expanding. In a long talk at the 2010 ISC in Berlin, Intel's Kirk Skaugan (VP of Intel Architecture Group and GM, Data Center Group, USA) stated that HPC was now 25% of Intel's revenue – a number double the HPC market I last heard a few years ago. But larger doesn't mean it has anywhere near the volume of low-end graphics.

DARPA has pumped more money in, with Nvidia leading a $25M chunk of DARPA's Exascale project. But that's not enough to stay alive. (Anybody remember Thinking Machines?)

The second result will be that Nvidia become a much smaller company.

But for users, it's the loss of that subsidy that will hurt the most. No more supercomputing for the masses, I'm afraid. Intel will have MIC (son of Larrabee); that will have a partial subsidy since it probably can re-use some X86 designs, but that's not the same as large low-end sales volumes.

So enjoy your "supercomputing for the masses," while it lasts.

Standards Are About the Money

2010-07-29T16:03:00.009-06:00

Nonstandard Cloud

Standards for cloud computing are a never-ending topic of cloud buzz ranging all over the map: APIs (programming interfaces), system management, legal issues, and so on.

With a few exceptions where the motivation is obvious (like some legal issues in the EU), most of these discussions miss a key point: Standards are implemented and used if and only if they make money for their implementers.

Whether customers think they would like them is irrelevant – unless that liking is strong enough to clearly translate into increased sales, paying back the cost of defining and implementing appropriate standards. "Appropriate" always means "as close to my existing implementation as possible" to minimize implementation cost.

That is my experience, anyway, having spent a number of years as a company representative to the InfiniBand Trade Association and the PCI-SIG, along with some interaction with the PowerPC standard and observation of DMTF and IETF standards processes.

Right now there's an obvious tension, since cloud customers see clear benefits to having an industry-wide, stable implementation target that allows portability among cloud system vendors, a point well-detailed in the Berkeley report on cloud computing.

That's all very nice, but unless the cloud system vendors see where the money is coming from, standards aren't going to be implemented where they count. In particular, when there are major market leaders, like Amazon and Google right now, it has to be worth more to those leaders than the lock-in they get from proprietary interfaces. I've yet to see anything indicating that they will, so am not very positive about cloud standards at present time.

But it could happen. The road to any given standard is very often devious, always political, regularly suffused with all kinds of nastiness, and of course ultimately driven throughout by good old capitalist greed. An example I'm rather familiar with is the way InfiniBand came to be, and semi-failed.

The beginning was a presentation by Justin Rattner at the 1998 Intel Developer Forum, in which he declared Intel's desire for their micros to grow up to be mainframes (mmmm… really juicy profit margins!). He thought they had everything except for IO. Busses were bad. He actually showed a slide with a diagram that could have come right out of an IBM Parallel Sysplex white paper, complete with channels and channel directors (switches) connecting banks of storage with banks of computers. That was where we need to go, he said, at a commodity price point.

Shortly thereafter, Intel founded the Next Generation IO Forum (NGIO), inviting other companies to join in the creation of this new industry IO standard. That sounds fine, and rather a step better than IBM did when trying to foist Microchannel architecture on the world (a dismal failure), until you read the fine print in the membership agreement. There you find a few nasties. Intel had 51% of every vote. Oh, and if you have any intellectual property (IP) (patents) in the area, they now all belonged to Intel. Several companies did join, like Dell; they like to be "tightly integrated" with their suppliers.

A few folks with a tad of IP in the IO area, like IBM and Compaq (RIP), understandably declined to join. But they couldn't just let Intel go off and define something they would then have to license. So a collection of companies – initially Compaq, HP, and IBM – founded the rival Future IO Developer's Forum (FIO). Its membership agreement was much more palatable: One company, one vote; and if you had IP that was used, you had to promise to license it with terms that were "reasonable and nondiscriminatory," a phrase that apparently means something quite specific to IP lawyers.

Over the next several months, there was a steady movement of companies out of NGIO and into FIO. When NGIO became only Intel and Dell (still tightly integrated), the two merged as the InfiniBand Trade Association (IBTA). They even had a logo for the merger itself! (See picture.) The name "InfiniBand" was dreamed up by a multi-company collection of marketing people, by the way; when a technical group member told them he thought it was a great name (a lie) they looked worried. The IBTA had, in a major victory for the FIO crowd, the same key terms and conditions as FIO. In addition, Roberts' rules of order were to be used, and most issues were to be decided by a simple majority (of companies).

Any more questions about where the politics comes in? Let's cover devious and nasty with a sub-story:

While on one of the IBTA groups, during a contentious discussion I happened to be leading for one side, I mentioned I was going on vacation for the next two weeks. The first day I was on vacation a senior-level executive of a company on the other side in the dispute, an executive not at all directly involved in IBTA, sent an email to another senior-level executive in a completely different branch of IBM, a branch with which the other company did a very large amount of business. It complained that I "was not being cooperative" and I had said on the IBTA mailing lists that certain IBM products were bad in some way. The obvious intent was that it be forwarded to my management chain through layers of people who didn't understand (or care) what was really going on, just that I had made this key customer unhappy and had dissed IBM products. At the very least, it would have chewed up my time disentangling the mess left after it wandered around forwards for two weeks (I was on vacation, remember?); at worst, it could have resulted in orders to me to be more "cooperative," and otherwise undermined me within my own company. Fortunately, and to my wife's dismay, I had taken my laptop on vacation and watched email; and a staff guy in the different division intercepted that email, forwarded it directly to me, and asked what was going on. As a result, I could nip it all in the bud.

It's sad and perhaps nearly unbelievable that precisely the same tactic – complain at a high level through an unrelated management chain – had been used by that same company against someone else who was being particularly effective against them.

Another, shorter, story: A neighbor of mine who was also involved in a similar inter-company dispute told me that, while on a trip (and he took lots of trips; he was a regional sales manager) he happened to return to his hotel room after checking out and found people going through his trash, looking for anything incriminating.

Standards can be nasty.

Anyway, after a lot of the dust settled and IB had taken on a fairly firm shape, Intel dropped development of its IB product. Exactly why was never explicitly stated, but the consensus I heard was that compared with others' implementations in progress it was not competitive. Without the veto power of NGIO, Intel couldn't shape the standard to match what it was implementing. With Intel out, Microsoft followed suit, and the end result was InfiniBand as we see it today: A great interconnect for high-end systems that pervades HPC, but not the commodity-volume server part the founders hoped that it would be. I suspect there are folks at Intel who think they would have been more successful at achieving the original purpose if they had their veto, since then it would have matched their inexpensive parts. I tend to doubt that, since in the meantime PCI has turned into a hierarchical switched fabric (PCI Express), eliminating many of the original problems stemming from it being a bus.

All this illustrates what standards are really about, from my perspective. Any relationship with pristine technical discussions or providing the "right thing" for customers is indirect, with all motivations leading through money – with side excursions through political, devious, and just plain nasty.

OnLive Follow-Up: Bandwidth and Cost

2010-07-15T13:23:00.000-06:00

As mentioned earlier in OnLive Works! First Use Impressions, I've tried OnLive, and it works quite well, with no noticeable lag and fine video quality. As I've discussed, this could affect GPU volumes, a lot, if it becomes a market force, since you can play high-end games with a low-end PC. However, additional testing has confirmed that users will run into bandwidth and data usage issues, and the cost is not what I'd like for continued use.

To repeat some background, for completeness: OnLive is a service that runs games on their servers up in the cloud, streaming the video to your PC or Mac. It lets you run the highest-end games on very inexpensive systems, avoiding the cost of a rip-roaring gamer system. I've noted previously that this could hurt the mass market for GPUs, since OnLive doesn't need much graphics on the client. But there were serious questions (see my post Twilight of the GPU?) as to whether they could overcome bandwidth and lag issues: Can OnLive respond to your inputs fast enough for games to be playable? And could its bandwidth requirements be met with a normal household ISP?

As I said earlier, and can re-confirm: Video, check. I found no problems there; no artifacts, including in displayed text. Lag, hence gameplay, is perfectly adequate, at least for my level of skill. Those with sub-millisecond reflexes might feel otherwise; I can't tell. There's confirmation of the low lag from Eurogamer, which measured it at "150ms - similar to playing … locally".

Bandwidth

Bandwidth, on the other hand, does not present a pretty picture.

When I was playing or watching action, OnLive continuously ran at about 5.8% - 6.4% utilization of a 100 Mb/sec LAN card. (OnLive won't run on WiFi, only on a wired connection.) This rate is very consistent. Displayed image resolution didn't cause it to vary outside that range, whether it was full-screen on my 1600 x 900 laptop display, full-screen on my 1920 x 1080 monitor, or windowed to about half the laptop screen area (which was the window size OnLive picked without input from me). When looking at static text displays, like OnLive control panels, it dropped down to a much smaller amount, in the 0.01% range; but that's not what you want to spend time doing with a system like this.

I observed these values playing (Borderlands) and watching game trailers for a collection of "coming soon" games like Deus Ex, Drive, Darksiders, Two Worlds, Driver, etc. If you stand still in a non-action situation, it does go down to about 3% (of 100 Mb/sec) for me, but with action games that isn't the point.

6.4% of 100 Mb/sec is about 2.9 GB (bytes) per hour. That hurts.

My ISP, Comcast, considers over 250 GB/month "excessive usage" and grounds for terminating your account if you keep doing it regularly. That limit and OnLive's bandwidth together mean that over a 30-day period, Comcast customers can't play more than 3 hours a day without being considered "excessive."

Prices

I also found that prices are not a bargain, unless you're counting the money you save using a bargain PC – one that costs, say, what a game console costs.

First, you pay for access to OnLive itself. For now that can be free, but after a year it's slated to be $4.95 a month. That's scarcely horrible. But you can't play anything with just access; you need to also buy a GamePass for each game you want to play.

A Full GamePass, which lets you play it forever (or, presumably, as long as OnLive carries the game) is generally comparable to the price of the game itself, or more for the PC version. For example, the Borderlands Full GamePass is $29.99, and the game can be purchased for $30 or less (one site lists it for $3! (plus about $9 shipping)). F.E.A.R. 2 is $19.99 GamePass, and the purchase price is $19-$12. Assassin's Creed II was a loser, with GamePass for $39.99 and purchased game available for $24-$17. The standalone game prices are from online sources, and don't include shipping, so OnLive can net a somewhat smaller total. And you can play it on a cheap PC, right? Hmmm. Or a console.

There are also, in many cases, 5 day and 3 day passes, typically $9-$7 for 5-day and $4-$6 for 3-day. As a try before you buy, maybe those are OK, but 30 minute free demos are available, too, making a reasonably adequate try available for free.

Not all the prices are that high. There's something called AAAAAAA, which seems to consist entirely of falling from tall buildings, with a full GamePass for $9.99; and Brain Challenge is $4.99. I'll bet Brain Challenge doesn't use much bandwidth, either.

The correspondence between Full GamePass and the retail price is obviously no coincidence. I wouldn't be surprised at all to find that relationship to be wired into the deals OnLive has with game publishers. Speculation, since I just don't know: Do the 5 or 3 day pass prices correspond to normal rental rates? I'd guess yes.

Simplicity & the Mac Factor

A real plus for OnLive is simplicity. Installation is just pure dead simple, and so is starting to play. Not only do you not have to acquire the game, there's no installation and no patching; you just select the game, get a PayPass (zero time with a required pre-registered credit card), and go. Instant gratification.

Then there's the Mac factor. If you have only Apple products – no console and no Windows PC – you are simply shut out of many games unless you pursue the major hassle of BootCamp, which also requires purchasing a copy of Windows and doing the Windows maintenance. But OnLive runs on Macs, so a wide game experience is available to you immediately, without a hassle.

Conclusion

To sum up:

Positive: great video quality, great playability, hassle-free instant gratification, and the Mac factor.

Negative: Marginally competitive game prices (at best) and bandwidth, bandwidth, bandwidth. The cost can be argued, and may get better over time, but your ISP cutting you off for excessive data usage is pretty much a killer.

So where does this leave OnLive and, as a consequence, the market for GPUs? I think the bandwidth issue says that OnLive will have little impact in the near future.

However, this might change. Locally, Comcast TV ads showing off their "Xfinity" rebranding had a small notice indicating that 105 Mb data rates would be available in the future. It seems those have disappeared, so maybe it won't happen. But a 10X data rate improvement wouldn't mean much if you also didn't increase the data usage cap, and a 10X usage cap increase would completely eliminate the bandwidth issue.

Or maybe the Net Neutrality guys will pick this up and succeed. I'm not sure on that one. It seems like trying to get water from a stone if the backbone won't handle it, but who knows?

The proof, however, is in the playing and its market share, so we can just watch to see how this works out. The threat is still there, just masked by bandwidth requirements.

(And I still think virtual worlds should evaluate this technology closely. Installation difficulty is a key inhibitor to several markets there, forcing extreme measures – like shipping laptops already installed – in one documented case; see Living In It: A Tale of Learning in Second Life.)

Who Does the Shoe Fit? Functionally Decomposed Hardware (GPGPUs) vs. Multicore.

2010-07-12T15:11:00.000-06:00

This post is a long reply to the thoughtful comments on my post WNPoTs and the Conservatism of Hardware Development that were made by Curt Sampson and Andrew Richards. The issue is: Is functionally decomposed hardware, like a GPU, much harder to deal with than a normal multicore (SMP) system? (It's delayed. Sorry. For some reason I ended up in a mental deadlock on this subject.)

I agree fully with Andrew and Curt that using functionally decomposed hardware can be straightforward if the hardware performs exactly the function you need in the program. If it does not, massive amounts of ingenuity may have to be applied to use it. I've been there and done that, trying at one point to make some special-purpose highly-parallel hardware simulation boxes do things like chip wire routing or more general computing. It required much brain twisting and ultimately wasn't that successful.

However, GPU designers have been particularly good at making this match. Andrew made this point very well in a video'd debate over on Charlie Demerjian's SemiAccurate blog: Last minute changes that would be completely anathema to GP designs are apparently par for the course with GPU designs.

The embedded systems world has been dealing with functionally decomposed hardware for decades. In fact, a huge part of their methodology is devoted to figuring out where to put a hardware-software split to match their requirements. Again, though, the hardware does exactly what's needed, often through last-minute FPGA-based hardware modifications.

However, there's also no denying that the mainstream of software development, all the guys who have been doing Software Engineering and programming system design for a long time, really don't have much use for anything that's not an obvious Turing Machine onto which they can spin off anything they want. Traditional schedulers have a rough time with even clock speed differences. So, for example, traditional programmers look at Cell SPUs, with their manually-loaded local memory, and think they're misbegotten spawn on the devil or something. (I know I did initially.)

This train of thought made me wonder: Maybe traditional cache-coherent MP/multicore actually is hardware specifically designed for a purpose, like a GPU. That purpose is, I speculate, transaction processing. This is similar to a point I raised long ago in this blog (IT Departments Should NOT Fear Multicore), but a bit more pointed.

Don't forget that SMPs have been around for a very long time, and practically from their inception in the early 1970s were used transparently, with no explicit parallel programming and code very often written by less-than-average programmers. Strongly enabling that was a transaction monitor like IBM's CICS (and lots of others). All code is written as a relatively small chunk (debit this account) (and the cash on hand, and total cash in a bank…). That chunk is automatically surrounded by all locking it needs, called by the monitor when a customer implicitly invokes it, and can be backed out as needed either by facilities built into the monitor or by a back-end database system.

It works, and it works very well right up to the present, even with programmers so bad it's a wonder they don't make the covers fly off the servers. (OK, only a few are that bad, but the point is that genius is not required.)

Of course, transaction monitors aren't a language or traditional programming construct, and also got zero academic notice except perhaps for Jim Gray. But they work, superbly well on SMP / multicore. They can even work well across clusters (clouds) as long as all data is kept in a separate backend store (perhaps logically separate), which model, by the way, is the basis of a whole lot of cloud computing.

Attempts to make multicores/SMPs work in other realms, like HPC, have been fairly successful but have always produced cranky comments about memory bottlenecks, floating-point performance, how badly caches fit the requirements, etc., comments you don't hear from commercial programmers. Maybe this is because it was designed for them? That question is, by the way, deeply sarcastic; performance on transactional benchmarks (like TPC's) are the guiding light and laser focus of most larger multicore / SMP designs.

So, overall, this post makes a rather banal point: If the hardware matches your needs, it will be easy to use. If it doesn't, well, the shoe just doesn't fit, and will bring on blisters. However, the observation that multicore is actually a special purpose device, designed for a specific purpose, is arguably an interesting perspective.

OnLive Works! First Use Impressions

2010-07-06T01:17:00.002-06:00

I've tried OnLive, and it works. At least for the games I tried, it seems to work quite well, with no noticeable lag and fine video quality. But I'm not sure about the bandwidth issue yet, or the cost.

OnLive is a service that runs games on their servers up in the cloud, streaming the video to your PC or Mac. I've noted previously that this could hurt the mass market for GPUs, since it doesn't need much graphics on the client. But there were serious questions (see my post Twilight of the GPU?) as to whether they could overcome bandwidth and lag issues: Can OnLive respond to your inputs fast enough for games to be playable? And could its bandwidth requirements be met with a normal household ISP?

As I said above: Lag, check. Video, check. I found no problems there. Bandwidth, inconclusive. Cost, ditto. More data will answer those, but I've not had the chance to gather it yet. Here's what I did:

I somehow was "selected" from their wait-list as an OnLive founding member, getting me free access for a year – which doesn't mean I play totally free for a year; see below – and tried it out today, playing free 30-minute demos of Assassin's Creed II a little bit, and Borderlands enough for a good impression.

Assassin's Creed II was fine through initial cutscenes and minor initial movement. But when I reached the point where I was reborn as a player in medieval times, I ran into a showstopper. As an introduction to the controls, the game wanted me to press <squiggle_icon> to move my legs. <squiggle_icon>, unfortunately, corresponds to no key on my laptop. I tried everything plus shift, control, and alt variations, and nothing worked. In the process I accidentally created a brag clip, went back to the OnLive dashboard, and did some other obscure things I never did figure out, but never did move my legs. I moved my arms with about four different key combinations, but the game wasn't satisfied with that. So I ditched it. For all I know there's something on the OnLive web site explaining this, but I didn't look enough to find it.

I was much more successful with Borderlands, a post-apocalyptic first-person shooter. I completed the initial training mission, leveled up, and was enjoying myself when the demo time – 30 minutes, which I consider adequately generous – ran out. Targeting and firing seemed to be just as good as native games on my system. I played both in a window and in fullscreen mode, and at no time was there noticeable lag or any visual artifacts. It just played smoothly and nicely.

I wanted to try Dragon Age – I'm more of an RPG guy – but while it shows up on the web site, I couldn't find it among the games available for play on the live system.

This is not to say there weren't hassles and pains involved in getting going. Here are some details.

First, my environment: The system I used is a Sony Vaio VGN-2670N, with Intel Core Duo @ 2.66 GHz, a 1600x900 pixel display, with 4GB RAM and an Nvidia GeForce 9300M; but the Nvidia display adapter wasn't being used. For those of you wondering about speed-of-light delays, my location is just North of Denver, CO, so this was all done more than 1000 miles from the closest server farm they have (Dallas, TX). My ISP is Comcast cable, nominally providing 10 Mb/sec; I have seen it peak as high as 15 Mb/sec in spurts during downloads. My OS is 32-bit Windows Vista. (I know…)

There was a minor annoyance at the start, since their client installer refuses to even try using Google Chrome as the browser. IE, Firefox, and Safari are supported. But that only required me to use IE, which I shun, for the install; it's not used running the client.

The much bigger pain is that OnLive adamantly refuses to run over Wifi. The launcher checks, gives you one option – exit – and points you to a FAQ, which pointer gets a 404 (page not found). I did find the relevant FAQ manually on the web site. There they apologize and say it "does indeed work well with good quality Wi-Fi connections, and in the future OnLive will support wireless" but initially they're scared of bad packet-dropping low-signal-strength crud. I can understand this; they're fighting an uphill battle convincing people this works at all, and do not need a multitude complaining they don't work when the problem is crummy Wi-Fi. (Or WiFi in a coffee shop – a more serious issue; see bandwidth discussion below.)

Nevertheless, this is a pain for me. I had to go down in the basement and set up a chair where my router is, next to my water heater, to get a wired connection. When I did go down there, after convincing Vista (I know!) to actually use the wired connection, things went as described above.

That leaves one question: Bandwidth. My ISP, Comcast, has a 250 GB/month limit beyond which I am an "excessive user" and apparently get a stern talking-to, followed by account termination if I don't mend my excessive ways. Up to now, this has been far from an issue. With OnLive, it may be a significant limitation.

Unfortunately, I didn't monitor my network use carefully when using OnLive, and ran out of time to go back and do better monitoring. I'll report more when I've done that. However, checking some numbers provided by Comcast after the fact, I can see the possibility that averaging four hours a day is all the OnLive I could do and not get terminated, since my hour of use may (just may) have sucked down 2 GB. This could be a significant issue, limiting OnLive to only very casual users, but I need better measurement to be sure.

This also points to a reason for not initially allowing Wifi that they didn't mention: I doubt your local free Wifi hot spot in a Starbucks or McDonald's is really up to the task of serving several OnLive players all day.

Finally, there's cost. What I have free is access to the OnLive system; after a year that's $4.95/month (which may be a "founding member" deal). But to play other than a free demo, I need to purchase a PlayPass for each game played. I didn't do that, and still need to check that cost. Sorry, time limitations again.

So where does this leave the market for GPUs? With the information I have so far, all I can say is that the verdict is inconclusive. I think they really have the lag and display issues licked; those just aren't a problem. If I'm wrong about the bandwidth (entirely possible), and the PlayPasses don't cost too much, it could over time deal a large blow to the mass market for GPUs, which among other problems would sink the volumes that make them relatively inexpensive for HPC use.

On the other hand, if the bandwidth and cost make OnLive suitable only for very casual gaming, there may actually be a positive effect on the GPU market, since OnLive could be used as a very good "try before you buy" facility. It worked for me; I've been avoiding first-person shooters in favor of RPGs, but found the Borderlands demo to be a lot more fun than I expected.

Finally, I'll just note that Second Life recently changed direction and is saying they're going to move to a browser-based client. They, and other virtual world systems, might do well to consider instead a system using this type of technology. It would expand the range of client systems dramatically, and, even though there is a client, simplify use dramatically.

WNPoTs and the Conservatism of Hardware Development

2010-06-14T17:22:00.000-06:00

There are some things about which I am undoubtedly considered a crusty old fogey, the abominable NO man, an ostrich with its head in the sand, and so on. Oh frabjous day! I now have a word for such things, courtesy of Charlie Stross, who wrote:

Just contemplate, for a moment, how you'd react to some guy from the IT sector walking into your place of work to evangelize a wonderful new piece of technology that will revolutionize your job, once everybody in the general population shells out £500 for a copy and you do a lot of hard work to teach them how to use it, And, on closer interrogation, you discover that he doesn't actually know what you do for a living; he's just certain that his WNPoT is going to revolutionize it. Now imagine that this happens (different IT marketing guy, different WNPoT, same pack drill) approximately once every two months for a five year period. You'd learn to tune him out, wouldn't you?

I've been through that pack drill more times than I can recall, and yes, I tune them out. The WNPoTs in my case were all about technology for computing itself, of course. Here are a few examples; they are sure to step on number of toes:

Any new programming language existing only for parallel processing, or any reason other than making programming itself simpler and more productive (see my post 101 parallel languages)
Multi-node single system image (see my post Multi-Multicore Single System Image)
Memristors, a new circuit type. A key point here is that exactly one company (HP) is working on it. Good technologies instantly crystallize consortia around themselves. Also, HP isn't a silicon technology company in the first place.
Quantum computing. Primarily good for just one thing: Cracking codes.
Brain simulation and strong artificial intelligence (really "thinking," whatever that means). Current efforts were beautifully characterized by John Horgan, in a SciAm guest blog: 'Current brain simulations resemble the "planes" and "radios" that Melanesian cargo-cult tribes built out of palm fronds, coral and coconut shells after being occupied by Japanese and American troops during World War II.'

Of course, for the most part those aren't new. They get re-invented regularly, though, and drooled over by ahistorical evalgelists who don't seem to understand that if something has already failed, you need to lay out what has changed sufficiently that it won't just fail again.

The particular issue of retred ideas aside, genuinely new and different things have to face up to what Charlie Stross describes above, in particular the part about not understanding what you do for a living. That point, for processor and system design, is a lot more important than one might expect, due to a seldom-publicized social fact: Processor and system design organizations are incredibly, insanely, conservative. They have good reason to be. Consider:

Those guys are building some of the most, if not the most, intricately complex structures ever created in the history of mankind. Furthermore, they can't be fixed in the field with an endless stream of patches. They have to just plain work – not exactly in the first run, although that is always sought, but in the second or, at most, third; beyond that money runs out.

The result they produce must also please, not just a well-defined demographic, but a multitude of masters from manufacturing to a wide range of industries and geographies. And of course it has to be cost- and performance-competitive when released, which entails a lot of head-scratching and deep breathing when the multi-year process begins.

Furthermore, each new design does it all over again. I'm talking about the "tock" phase for Intel; there's much less development work in the "tick" process shrink phase. Development organizations that aren't Intel don't get that breather. You don't "re-use" much silicon. (I don't think you ever re-use much code, either, with a few major exceptions; but that's a different issue.)

This is a very high stress operation. A huge investment can blow up if one of thousands of factors is messed up.

What they really do to accomplish all this is far from completely documented. I doubt it's even consciously fully understood. (What gets written down by someone paid from overhead to satisfy an ISO requirement is, of course, irrelevant.)

In this situation, is it any wonder the organizations are almost insanely conservative? Their members cannot even conceive of something except as a delta from both the current product and the current process used to create it, because that's what worked. And it worked within the budget. And they have their total intellectual capital invested in it. Anything not presented as a delta of both the current product and process is rejected out of hand. The process and product are intertwined in this; what was done (product) was, with no exceptions, what you were able to do in the context (process).

An implication is that they do not trust anyone who lacks the scars on their backs from having lived that long, high-stress process. You can't learn it from a book; if you haven't done it, you don't understand it. The introduction of anything new by anyone without the tribal scars is simply impossible. This is so true that I know of situations where taking a new approach to processor design required forming a new, separate organization. It began with a high-level corporate Act of God that created a new high-profile organization from scratch, dedicated to the new direction, staffed with a mix of outside talent and a few carefully-selected high-talent open-minded people pirated from the original organization. Then, very gradually, more talent from the old organization was siphoned off and blended into the new one until there was no old organization left other than a maintenance crew. The new organization had its own process, along with its own product.

This is why I regard most WNPoT announcements from a company's "research" arm as essentially meaningless. Whatever it is, it won't get into products without an "Act of God" like that described above. WNPoTs from academia or other outside research? Fuggedaboudit. Anything from outside is rejected unless it was originally nurtured by someone with deep, respected tribal scars, sufficiently so that that person thinks they completely own it. Otherwise it doesn't stand a chance.

Now I have a term to sum up all of this: WNPoT. Thanks, Charlie.

Oh, by the way, if you want a good reason why the Moore's Law half-death that flattened clock speeds produced multi- / many-core as a response, look no further. They could only do more of what they already knew how to do. It also ties into how the very different computing designs that are the other reaction to flat clocks came not from CPU vendors but outsiders – GPU vendors (and other accelerator vendors; see my post Why Accelerators Now?). They, of course, were also doing more of what they knew how to do, with a bit of Sutherland's Wheel of Reincarnation and DARPA funding thrown in for Nvidia. None of this is a criticism, just an observation.

Ten Ways to Trash your Performance Credibility

2010-06-08T21:43:00.002-06:00

Watered by rains of development sweat, warmed in the sunny smiles of ecstatic customers, sheltered from the hailstones of Moore's Law, the accelerator speedup flowers are blossoming.

Danger: The showiest blooms are toxic to your credibility.

(My wife is planting flowers these days. Can you tell?)

There's a paradox here. You work with a customer, and he's happy with the result; in fact, he's ecstatic. He compares the performance he got before you arrived with what he's getting now, and gets this enormous number – 100X, 1000X or more. You quote that customer, accurately, and hear:

"I would have to be pretty drunk to believe that."

Your great, customer-verified, most wonderful results have trashed your credibility.

Here are some examples:

In a recent talk, Prof. Sharon Glotzer just glowed about getting a 100X speedup "overnight" on the molecular dynamics codes she runs.

In an online discussion on LinkedIn, a Cray marketer said his client's task went from taking 12 hours on a Quad-core Intel Westmere 5600 to 1.2 seconds. That's a speedup of 36,000X. What application? Sorry, that's under non-disclosure agreement.

In a video interview, a customer doing cell pathology image analysis reports their task going from 400 minutes to 65 milliseconds, for a speedup of just under 370,000X. (Update: Typo, he really does say "minutes" in the video.)

None of these people are shading the truth. They are doing what is, for them, a completely valid comparison: They're directly comparing where they started with where they ended up. The problem is that the result doesn't pass the drunk test. Or the laugh test. The idea that, by itself, accelerator hardware or even some massively parallel box will produce 5-digit speedups is laughable. Anybody baldly quoting such results will instantly find him- or herself dismissed as, well, the polite version would be that they're living in la-la land or dipping a bit too deeply into 1960s pop pharmacology.

What's going on with such huge results is that the original system was a target-rich zone for optimization. It was a pile of bad, squirrely code, and sometimes, on top of that, interpreted rather than compiled. Simply getting to the point where an accelerator, or parallelism, or SIMD, or whatever, could be applied involved fixing it up a lot, and much of the total speedup was due to that cleanup – not directly to the hardware.

This is far from a new issue. Back in the days of vector supercomputers, the following sequence was common: Take a bunch of grotty old Fortran code and run it through a new super-duper vectorizing optimizing compiler. Result: Poop. It might even slow down. So, OK, you clean up the code so the compiler has a fighting chance of figuring out that there's a vector or two in there somewhere, and Wow! Gigantic speedup. But there's a third step, a step not always done: Run the new version of the code through a decent compiler without vectors or any special hardware enabled, and, well, hmmm. In lots of cases it runs almost as fast as with the special hardware enabled. Thanks for your help optimizing my code, guys, but keep your hardware; it doesn't seem to add much value.

The moral of that story is that almost anything is better than grotty old Fortran. Or grotty, messed-up MATLAB or Java or whatever. It's the "grotty" part that's the killer. A related modernized version of this story is told in a recent paper Believe It or Not! Multi-core CPUs can Match GPU Performance, where they note "The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively." If you really clean up the code and match it to the platform it's using, great things can happen.

This of course doesn't mean that accelerators and other hardware are useless; far from it. The "Believe It or Not!" case wasn't exactly hurt by the fact that Power7 has a macho memory subsystem. It does mean that you should be aware of all the factors that sped up the execution, and using that information, present your results with credit due to the appropriate actions.

The situation we're in is identical to the one that lead someone (wish I remembered who), decades ago, to write a short paper titled, approximately, Ten Ways to Lie about Parallel Processing. I thought I kept a copy, but if I did I can't find it. It was back at the dawn of whatever, and I can't find it now even with Google Scholar. (If anyone out there knows the paper I'm referencing, please let me know.) Got it! It's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, by David H. Bailey. Thank you, Roland!

In the same spirit, and probably duplicating that paper massively, here are my ten ways to lose your credibility:

Only compare the time needed to execute the innermost kernel. Never mind that the kernel is just 5% of the total execution time of the whole task.
Compare your single-precision result to the original, which computed in double precision. Worry later that your double precision is 4X slower, and the increased data size won't fit in your local memory. Speaking of which,
Pick a problem size that just barely fits into the local memory you have available. Why? See #4.
Don't count the time to initialize the hardware and load the problem into its memory. PCI Express is just as fast as a processor's memory bus. Not.
Change the algorithm. Going from a linear to a binary search or a hash table is just good practice.
Rewrite the code from scratch. It was grotty old Fortran, anyway; the world is better off without it.
Allow a slightly different answer. A*(X+Y) equals A*X+A*Y, right? Not in floating point, it doesn't.
Change the operating system. Pick the one that does IO to your device fastest.
Change the libraries. The original was 32 releases out of date! And didn't work with my compiler!
Change the environment. For example, get rid of all those nasty interrupts from the sensors providing the real-time data needed in practice.

This, of course, is just a start. I'm sure there are another ten or a hundred out there.

A truly fair accounting for the speedup provided by an accelerator, or any other hardware, can only be done by comparing it to the best possible code for the original system. I suspect that the only time anybody will be able to do that is when comparing formally standardized benchmark results, not live customer codes.

For real customer codes, my advice would be to list all the differences between the original and the final runs that you can find. Feel free to use the list above as a starting point for finding those differences. Then show that list before you present your result. That will at least demonstrate that you know you're comparing marigolds and peonies, and will help avoid trashing your credibility.

*****************

Thanks to John Melonakos of Accelereyes for discussion and sharing his thoughts on this topic.