tag:blogger.com,1999:blog-31559082281278418622024-03-12T23:38:00.515-06:00The Perils of ParallelA blog about multicore, cloud computing, accelerators, Virtual Worlds, and likely other topics, loosely driven by the effective end of Moore’s Law. It’s also a place to try out topics for my next book.Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.comBlogger74125tag:blogger.com,1999:blog-3155908228127841862.post-39735521709791634602012-11-12T14:00:00.000-07:002012-11-15T11:59:34.514-07:00Intel Xeon Phi Announcement (& me)<br />
1. No, I’m not dead. Not even sick. Been a long time since a post. More on this at the end.<br />
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
2. So, Intel has finally announced a product ancestrally based on the long-ago Larrabee. The architecture became known as MIC (Many Integrated Cores), development vehicles were named after tiny towns (Knights Corner/Knights Ferry – one was to be the product, but I could never keep them straight), and the final product is to be known as the Xeon Phi.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Why Phi? I don’t know. Maybe it’s the start of a convention of naming High-Performance Computing products after Greek letters. After all, they’re used in equations.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A micro-synopsis (see my post <a href="http://perilsofparallel.blogspot.com/2011/10/mic-and-knights.html" target="_blank">MIC and the Knights</a> for a longer discussion): The Xeon Phi is a PCIe board containing 6GB of RAM and a chip with lots (I didn’t find out how many ahead of time) of X86 cores with wide (512 bit) vector units, able to produce over 1 TeraFLOP (more about that later). The X86 cores a programmed pretty much exactly like a traditional “big” single Xeon: All your favorite compilers and be used, and it runs Linux. Note that to accomplish that, the cores must be fully cache-coherent, just like a multi-core “big” Xeon chip. Compiler mods are clearly needed to target the wide vector units, and that Linux undoubtedly had a few tweeks to do anything useful on the 50+ cores there are per chip, but they look normal. Your old code will run on it. <a href="http://perilsofparallel.blogspot.com/2011/10/mic-and-knights.html" target="_blank">As I’vepointed out</a>, modifications are needed to get anything like full performance, but you do not have to start out with a blank sheet of paper. This is potentially a very big deal.<o:p></o:p><br />
<br />
<i><span style="color: blue;">Since I originally published this, Intel has deluged me with links to their information. See the bottom of this post if you want to read them.</span></i></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So, it’s here. Finally, some of us would say, but development processes vary and may have hangups that nobody outside ever hears about.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I found a few things interesting about about the announcement.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Number one is their choice as to the first product. The one initially out of the blocks is, not a lower-performance version, but rather the high end of the current generation: The one that costs more ($2649) and has high performance on double precision floating point. Intel says it’s doing so because that’s what its customers want. This makes it extremely clear that “customers” means the big accounts – national labs, large enterprises – buying lots of them, as opposed to, say, Prof. Joe with his NSF grant or Sub-Department Kacklefoo wanting to try it out. Clearly, somebody wants to see <i>significant</i> revenue <b><i>right now</i></b> out of this project after so many years. They have had a reasonably-sized pre-product version out there for a while, now, so it has seen some trial use. At national labs and (maybe?) large enterprises.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
It costs more than the clear competitor, Nvidia’s Tesla boards. $2649 vs. sub-$2000. For less peak performance. (Note: I've been told that Anantech claims the new Nvidia K20s cost >$3000. I can't otherwise confirm that.) We can argue all day about whether the actual performance is better or worse on real applications, and how much the ability to start from existing code helps, but this pricing still stands out. Not that anybody will actually pay that much; the large customer targets are always highly-negotiated deals. But the Prof. Joes and the Kacklefoos don’t have negotiation leverage.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A second odd point came up in the Q & A period of the pre-announce concall. (I was invited, which is why I’ve come out of my hole to write this.) (Guilt.) Someone asked about memory bottlenecks; it has 310GB/s to its memory, which isn’t bad, but some apps are gluttons. This prompted me to ask about the PCIe bottleneck: Isn’t it also going to be starved for data delivered to it? I was told I was doing it wrong. I was thinking of the main program running on the host, foisting stuff off to the Phi. Wrong. The main program runs on the Phi itself, so the whole program runs on the many (slower) core card.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This means they are, at this time at least, explicitly not taking the path I’ve heard Nvidia evangelists talk about recently: Having lots and lots of tiny cores, along with a number of middle-size cores, and much fewer Great Big cores – and they all live together in a crooked little… Sorry! on the same chip, sharing the same memory subsystem so there is oodles of bandwidth amongst them. This could allow the parts of an application that are limited by single- or few-thread performance to go fast, while the parts that are many-way parallel also go fast, with little impedance mismatch between them. On Phi, if you have a single-thread limited part, it runs on just one of the CPUs, which haven’t been designed for peak single-thread performance. On the other hand, the Nvidia stuff is vaporware and while this kind of multi-speed arrangement has been talked about for a long time, I don’t know of any compiler technology that supports it with any kind of transparency.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A third item, and this seems odd, are the small speedups claimed by the marketing guys: Just 2X-4X. Eh what? 50 CPUs and only 2-4X faster?<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This is incredibly refreshing. The claims of accelerator-foisting companies can be outrageous to the point that they lose all credibility, <a href="http://perilsofparallel.blogspot.com/2010/06/ten-ways-to-trash-your-performance.html" target="_blank">as I’ve written about before</a>.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
On the other hand, it’s slightly bizarre, given that at the same conference Intel has people talking about applications that, when retargeted to Phi, get 6.6X (in figuring out graph connections on big graphs) or 4.8X (analyzing synthetic aperture radar images).<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
On the gripping hand, I really see the heavy hand of Strategic Marketing smacking people around here. Don’t cannibalize sales of the big Xeon E5s! They are known to make Big Money! Someone like me, coming from an IBM background, knows a whole lot about how The Powers That Be can influence how seemingly competitive products are portrayed – or developed. I’ve a sneaking suspicion this influence is why it took so long for something like Phi to reach the market. (Gee, Pete, you’re a really great engineer. Why are you wasting your time on that piddly little sideshow? We’ve got a great position and a raise for you up here in the big leagues…) (Really.)<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
There are rationales presented: They are comparing apples to apples, meaning well-parallelized code on Xeon E5 Big Boys compared with the same code on Phi. This is to be commended. Also, Phi ain’t got no hardware ECC for memory. Doing ECC in software on the Phi saps its strength considerably. (Hmmm, why do you suppose it doesn’t have ECC? (Hey, Pete, got a great position for you…) (Or "Oh, we're not a threat. We don't even have ECC!" Nobody will do serious stuff without ECC.")) <b><span style="color: red;">Note: Since this pre-briefing, data sheets have emerged that indicate Phi has optional ECC. Which raises two questions: Why did they explicitly say otherwise in the pre-briefing? And: What does "optional" mean?</span></b><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Anyway, Larrabee/MIC/Phi has finally hit the streets. Let the benchmark and marketing wars commence.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Now, about me not being dead after all:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I don’t do this blog-writing thing for a living. I’m on the dole, as it were – paid for continuing to breathe. I don’t earn anything from this blog; those Google-supplied ads on the sides haven’t put one dime in my pocket. My wife wants to know why I keep doing it. But note: having no deadlines is <i>wonderful</i>.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So if I feel like taking a year off to play Skyrim, well, I can do that. So I did. It wasn't supposed to be a year, but what the heck. It's a big game. I also participated in some pleasant Grandfatherly activities, paid very close attention to exactly when to exercise some near-expiration stock options, etc.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Finally, while I’ve occasionally poked my head up on Twitter or FaceBook when something interesting happened, there hasn’t been much recently. X added N more processors to the same architecture. Yawn. Y went lower power with more cores. Yawn. If news outlets weren’t paid for how many eyeballs they attracted, they would have been yawning, too, but they are, so every minute twitch becomes an Epoch-Defining Quantum Leap!!! (Complete with ironic use of the word “quantum.”) No judgement here; they have to make a living.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I fear we have closed in on an incrementally-improving era of computing, at least on the hardware and processing side, requiring inhuman levels of effort to push the ball forward another picometer. Just as well I’m not hanging my living on it any more.<br />
<br />
<br />
<div style="border-bottom: dotted windowtext 3.0pt; border: none; mso-element: para-border-div; padding: 0in 0in 1.0pt 0in;">
<div class="MsoNoSpacing" style="border: none; mso-border-bottom-alt: dotted windowtext 3.0pt; mso-padding-alt: 0in 0in 1.0pt 0in; padding: 0in;">
----------------------------------------------------------------------------------------</div>
</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Intel has deluged me with links. Here they are:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Intel® Xeon Phi™ product page: <a href="http://www.intel.com/xeonphi">http://www.intel.com/xeonphi</a><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Intel® Xeon Phi™ Coprocessor product brief: <a href="http://intel.ly/Q8fuR1">http://intel.ly/Q8fuR1</a><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Accelerate Discovery with Powerful New HPC Solutions
(Solution Brief) <a href="http://intel.ly/SHh0oQ">http://intel.ly/SHh0oQ</a><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
An Overview of Programming for Intel® Xeon® processors
and Intel® Xeon Phi™ coprocessors <a href="http://intel.ly/WYsJq9">http://intel.ly/WYsJq9</a><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
YouTube Animation Introducing the Intel® Xeon Phi™
Coprocessor <a href="http://intel.ly/RxfLtP">http://intel.ly/RxfLtP</a><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Intel® Xeon Phi™ Coprocessor Infographic: <a href="http://ow.ly/fe2SP">http://ow.ly/fe2SP</a><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
VIDEO: The History of Many Core, Episode 1: The Power
Wall. <a href="http://bit.ly/RSQI4g">http://bit.ly/RSQI4g</a><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Diane Bryant’s presentation, additional documents and
pictures will be available at <a href="http://newsroom.intel.com/community/intel_newsroom/blog/2012/11/12/intel-delivers-new-architecture-for-discovery-with-intel-xeon-phi-coprocessors">Intel
Newsroom</a><o:p></o:p></div>
</div>
Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com1tag:blogger.com,1999:blog-3155908228127841862.post-57639050252080748192012-02-13T19:49:00.000-07:002012-02-21T14:53:35.024-07:00Transactional Memory in Intel Haswell: The Good, and a Possible Ugly<b><span style="color: red;">Note: Part of this post is being updated based on new information received after it was published. Please check back tomorrow, 2/18/12, for an update. </span></b><br />
<b><span style="color: red;"><br /></span></b><br />
<b><span style="color: red;">Sorry... no update yet (2/21/12). Bad cold. Also seduction / travails of a new computer. Update soon.</span></b><br />
<br />
<div class="MsoNoSpacing">
</div>
<div class="MsoNoSpacing">
I asked James Reinders at the Fall 2011 IDF whether the
synchronization features he had from the X86 architecture were adequate for MIC.
(<a href="http://perilsofparallel.blogspot.com/2011/09/conversation-with-intels-james-reinders.html">Transcript</a>;
see the very end.) He said that they were good for the 30 or so cores in
Knight’s Ferry, but when you got above 40, they would need to do something different.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Now Intel has <a href="http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/">announced</a>
support for transactional memory in Haswell, the chip which follows their Ivy
Bridge chip that is just starting shipment as I write this. So I think I’d now
be willing to take bets that this is what James was vaguely hinting at, and
will appear in <a href="http://perilsofparallel.blogspot.com/2011/10/mic-and-knights.html">Intel’s
MIC HPC architecture</a> as it ships in Knight’s Corner product. I prefer to
take bets on sure things.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
There have been some light discussion of Intel’s
“Transactional Synchronization Extensions” (TSE), as this is formally called, and
a good <a href="http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/">example
of its use</a> from James Reinders. But now that an <a href="http://software.intel.com/en-us/avx/">architecture spec</a> has been
published for TSE, we can get a bit deeper into what, exactly, Intel is
providing, and where there might just be a skeleton in the closet.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
First, background: What is this “transactional memory”
stuff? Why is it useful? Then we’ll get into what Intel has, and the skeleton I
believe is lurking. </div>
<div class="MsoNoSpacing">
<br /></div>
<h3>
Transactional Memory</h3>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The term “transaction” comes from contract law, was
picked up by banking, and from there went into database systems. It refers to a
collection of actions that all happen as a unit; they cannot be divided. If I give
you money and you give me a property deed, for example, that happens as if it
were one action – a transaction. The two parts can’t be (legally) separated;
both happen, or neither. Or, in the standard database example: when I transfer
money from my bank account to Aunt Sadie’s, the subtraction from my account and
the addition to hers must either both happen, or neither; otherwise money is
being either destroyed or created, which would be a bad thing. As you might
imagine, databases have evolved a robust technology to do transactions where
all the changes wind up on stable storage (disk, flash).</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The notion of transactional memory is much the same: a
collection of changes to memory is made all-or-nothing: Either all of them
happen, as seen by every thread, process, processor, or whatever; or none of
them happen. So, for example, when some program plays with the pointers in a
linked list to insert or delete some list member, nobody can get in there when
the update is partially done and follow some pointer to oblivion.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
It applies as much to a collection of accesses – reads –
as it does to changes – writes. The read side is necessary to ensure that a
consistent collection of information is read and acted upon by entities that
may be looking around while another is updating.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
To do this, typically a program will issue something meaning
“Transaction On!” to start the ball rolling. Once that’s done, everything it
writes is withheld from view by all other entities in the system; and anything
it reads is put on special monitoring in case someone else mucks with it. The
cache coherence hardware is mostly re-used to make this monitoring work; cross-system
memory monitoring is what cache coherence does. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This continues, accumulating things read and written,
until the program issues something meaning “Transaction Off!”, typically called
“Commit!.” Then, hraglblargarolfargahglug! All changes are vomited at once into
memory, and the locations read are forgotten about.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
What happens if some other entity does poke its nose into
those saved and monitored locations, changing something the transactor was
depending on or modifying? Well, “Transaction On!” was really “Transaction On! <i>And, by the way, if anything screws up go <b><u>there</u></b></i>.” On reaching <b><i><u>there,</u></i></b>
all the recording of data read and changes made has been thrown away; and <b><i><u>there</u></i></b>
is a block of code that usually sets things up to start again, going back to
the “Transaction On!” point. (The code could also decide “forget it” and not
try over again.) Quitting like this in a controlled manner is called <b>aborting</b> a transaction. It is obviously
better if aborts don’t happen a lot, just like it’s better if a lock is not
subject to a lot of contention. However, note that nobody else has seen any of
the changes made since “On!”, so half-mangled data structures are never seen by
anyone.</div>
<div class="MsoNoSpacing">
<br /></div>
<h3>
Why Transactions Are a Good Thing</h3>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
What makes transactional semantics potentially more
efficient than simple locking is that only those memory locations read or
referenced <i>at run time</i> are maintained
consistently during the transaction. The consistency does not apply to memory
locations that <b>could</b> be referenced,
only the ones that <b>actually are</b>
referenced.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
There are situations where that’s a distinction without a
difference, since everybody who gets into some particular transaction-ized
section of code will bash on exactly the same data every time. Example: A global
counter of how many times some operation has been done by all the processors in
a system. Transactions aren’t any better than locks in those situations. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
But there are cases where the dynamic nature of transactional
semantics can be a huge benefit. The standard example, also used by James
Reinders, is a multi-access hash table, with inserts, deletions, and lookups
done by many processes, etc.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I won’t go through this is detail – you can read <a href="http://software.intel.com/en-us/blogs/2012/02/07/coarse-grained-locks-and-transactional-synchronization-explained/">James’
version</a> if you like; he has a nice diagram of a hash table, which I don’t –
but consider: </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
With the usual lock semantics, you could simply have one
coarse lock around the whole table: Only one person, read or write, gets in at
any time. This works, and is simple, but all access to the table is now
serialized, so will cause a problem as you scale to more processors.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Alternatively, you could have a lock per hash bucket, for
fine-grained rather than coarse locking. That’s a lot of locks. They take up
storage, and maintaining them all correctly gets more complex.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Or you could do either of those – one lock, or many – but
also get out your old textbooks and try once again to understand those multiple
reader / single writer algorithms and their implications, and, by the way, did
you want reader or writer priority? Painful and error-prone.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
On the other hand, suppose everybody – readers and
writers – simply says “Transaction On!” (I keep wanting to write “Flame On!”)
before starting a read or a write; then does a “Commit!” when they exit. This
is only as complicated as the single coarse lock (and sounds a lot like an “atomic”
keyword on a class, hint hint).</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Then what you can bank on is that the probability is tiny
that two simultaneous accesses will look at the same hash bucket; if that
probability is not tiny, you need a bigger hash table anyway. The most likely
thing to happen is that nobody – readers or writers – ever accesses same hash
bucket, so everybody just sails right through, “Commit!”s, and continues, all
in parallel, with no serialization at all. (Not really. See the skeleton, later.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
In the unlikely event that a reader and a writer are
working on the same bucket at the same time, whoever “Commit!”s first wins; the
other aborts and tries again. Since this is highly unlikely, overall the
transactional version of hashing is a big win: it’s both simple and very highly
parallel.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Transactional memory is not, of course, the only way to skin
this particular cat. <a href="http://www.azulsystems.com/">Azul Systems</a> has
published a detailed presentation on a <a href="http://www.azulsystems.com/about_us/presentations/lock-free-hash?reg">Lock-Free
Wait-Free Hash Table</a> algorithm that uses only compare-and-swap
instructions. I got lost somewhere around the fourth state diagram. (Well, OK,
actually I saw the first one and kind of gave up.) Azul has need of such
things. Among other things, they sell massive Java compute appliances, going up
to the <a href="http://www.azulsystems.com/products/vega/specs">Azul Vega 3
7380D</a>, which has 864 processors sharing 768GB of RAM. Think investment
banks: take <i>that</i>, you massively
recomplicated proprietary version of a Black-Sholes option pricing model! In
Java! (Those guys don’t just buy GPUs.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
However, Azul only needs that algorithm on their port of
their software stack to X86-based products. Their Vega systems are based on
their own proprietary 54-core Vega processors, which have shipped with
transactional memory – which they call Speculative Multi-address Atomicity –
since the first system shipped in 2005 (information from Gil Tene, Azul Systems CTO). So, all these notions are not exactly
new news.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Anyway, if you want this wait-free super-parallel hash
table (and other things, obviously) without exploding your head, transactional
memory makes it possible rather simply.</div>
<div class="MsoNoSpacing">
<br /></div>
<h3>
What Intel Has: RTE and HLE</h3>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Intel’s Transactional Synchronization Extensions come in
two flavors: Restricted Transactional Memory (RTE) and Hardware Lock Elision
(HLE). </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
RTE is essentially what I described above: There’s XBEGIN
for “Transaction On!”, XEND for “Commit!” and ABORT if you want to manually
toss in the towel for some reason. XBEGIN must be given a <b><i><u>there</u></i></b> location to go
to in case of an abort. When an abort occurs, the processor state is restored
to what it was at XBEGIN, except that flags are set indicating the reason for
the abort (in EAX).</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
HLE is a bit different. All the documentation I’ve seen
so far always talks about it first, perhaps because it seems like it is more
familiar, or they want to brag (no question, it’s clever). I obviously think
that’s confusing, so didn’t do it in that order.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
HLE lets you take your existing, lock-based, code and
transactional-memory-ify it: Lock-based code now runs without blocking unless
required, as in the hash table example, with minimal, miniscule change that can
probably be done with a compiler and the right flag. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I feel like adding “And… at no time did their fingers
leave their hands!” It sounds like a magic trick.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
In addition to being magical, it’s also clearly strategic
for Intel’s MIC and its Knights SDK HPC accelerators. Those are making a heavy
bet on people just wanting to recompile and run without the rewrites needed for
accelerators like GPGPUs. (See my post <a href="http://perilsofparallel.blogspot.com/2011/10/mic-and-knights.html">MIC
and the Knights</a>.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
HLE works by setting a new instruction prefix – XACQUIRE –
on any instruction you use to try to acquire a lock. Doing so causes there to
be <b>no change</b> to the lock data: the
lock write is “elided.” Instead it (a) takes a checkpoint of the machine state;
(b) saves the address of the instruction that did this; (c) puts the lock
location in the set of data that is transactionally read; and (d) does a “Transaction
On!”</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So everybody goes charging right through the lock without
stopping, but now every location read is continually monitored, and every write
is saved, not appearing in memory.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
If nobody steps on anybody else’s feet – writes someone
else’s monitored location – then when the instruction to release the lock is
done, it uses an XRELEASE prefix. This does a “Commit!” hraglblargarolfargahglug
flush of all the saved writes into memory, forgets everything monitored, and
turns off transaction mode.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
If somebody does write a location someone else has read,
then we get an ABORT with its wayback machine: back to the location that tried
to acquire the lock, restoring the CPU state, so everything is like it was just
before the lock acquisition instruction was done. This time, though, the write
is <b>not</b> elided: The usual semantics
apply, and the code goes through exactly what it did without TSE, the way it
worked before.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So, as I understand it, if you have a hash table and read
is under way, if a write to the same bucket happens then both the read and the
write abort. One of those two gets the lock and does its thing, followed by the
other according to the original code. But other reads or writes that don’t have
conflicts go right through.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This seems like it will work, but I have to say I’d like
to see the data on real code. My gut tells me that anything which changes the
semantics of parallel locking, which HLE does, is going to have a weird effect
somewhere. My guess would be some fun, subtle, intermittent performance bugs.</div>
<div class="MsoNoSpacing">
<br /></div>
<h3>
The Serial Skeleton in the Closet</h3>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This is all powerful stuff that will certainly aid
parallel efficiency in both MIC, with it’s 30-plus cores; and the Xeon line,
with fewer but faster cores. (Fewer faster cores need it too, since serialization
inefficiency gets proportionally worse with faster cores.) But don’t think for
a minute that it eliminates all serialization.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I see is no issue with the part of this that monitors locations
read and written; I don’t know Intel’s exact implementation, but I feel sure it
re-uses the cache coherence mechanisms already present, which operate without
(too) much serialization.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
However, there’s a reason I used a deliberately disgusting
analogy when talking about pushing all the written data to memory on “Commit!”
(XEND, XRELEASE). Recall that the required semantics are “all or nothing”: <i>Every</i> entity in the system sees <i>all</i> of the changes, or <i>every</i> entity sees <i>none</i> of them. (I’ve been saying “entity” because GPUs are now prone
to directly access cache coherent memory, too.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
If the code has changed multiple locations during a
transaction, probably on multiple cache lines, that means those changes have to
be made all at once. If locations A and B both change, nobody can possibly see location
A <i>after</i> it changed but location B <i>before</i> it changed. Nothing, anywhere,
can get between the write of A and the write of B (or the making of both changes
visible outside of cache).</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
As I said, I don’t know Intel’s exact implementation, so
could conceivably be wrong, but that for me that implies that every “Commit!”
requires<i> a whole system serialization
event: Every processor and thread in the whole system has to be not just
halted, but pipelines drained.</i> Everything must come to a dead stop. Once
that stop is done, then all the changes can be made visible, and everything
restarted. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Note that Intel’s TSE architecture spec says nothing
about these semantics being limited to one particular chip or region. This is
very good; software exploitation would be far harder otherwise. But it implies
that in a multi-chip, multi-socket system, this halt and drain applies to every
processor in every chip in every socket. It’s a dead stop of <i>everything</i>.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Well, OK, but lock acquire and release instructions
always did this dead stop anyway, so likely the aggregate amount of
serialization is reduced. (Wait a minute, they always did this anyway?! What
the… Yeah. Dirty little secret of the hardware dudes.) </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
But lock acquire and release only involve one cache line
at a time. “Commit!” may involve many. Writes involve letting everybody else
know you’re writing a particular line, so they can invalidate it in their
cache(s). Those notifications all have to be sent out, serially, and
acknowledgements received. They can be pipelined, and probably are, but the
process is still serial, and must be done while at a dead stop.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So, if your transactional segment of code modifies, say,
128KB spread over 512K cache lines, you can expect a noticeable bit of
serialization time when you “Commit!”. Don’t forget this issue now includes all
your old-style locking, thanks to HLE, where the original locking involved
updating just one cache line. This is another reason I want to see some real
running code with HLE. Who knows what evil lurks between the locks?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
But as I said, I don’t know the implementation. Could
Intel folks have found a way around this? Maybe; I’m writing this, as I’ve
indicated, speculatively. Perhaps real magic is involved. We’ll find out when Haswell
ships.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Enjoy your bright, shiny, new, non-blocking transactional
memory when it ships. It’ll probably work really well. But beware the dreaded hraglblargarolfargahglug.
It bites.</div>
<br />Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com8tag:blogger.com,1999:blog-3155908228127841862.post-3219092461982206082012-01-09T16:37:00.000-07:002012-01-09T16:37:25.648-07:0020 PFLOPS vs. 10s of MLOC: An Oak Ridge Conundrum<br />
<div class="MsoBodyText">
On The One Hand:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Oak Ridge National Laboratories (ORNL) is heading for a
20 PFLOPS system, getting there by using Nvidia GPUs. Lots of GPUs. Up to
18,000 GPUs.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This is, of course, neither a secret nor news. Look <a href="http://insidehpc.com/2011/10/11/20-petaflop-titan-super-at-ornl-to-be-accelerated-with-18000-nvidia-gpus/">here</a>,
or <a href="http://www.xbitlabs.com/news/other/display/20111012223953_ORNL_s_Titan_Supercomputer_to_Deliver_10_20_PetaFLOPS_Performance.html">here</a>,
or <a href="http://blogs.knoxnews.com/munger/2011/12/the-grand-move-toward-titan.html">here</a>
if you haven’t heard; it was particularly trumpeted at <a href="http://sc11.supercomputing.org/">SC11</a> last November. They’re
upgrading the U.S. Department of Energy's largest computer, Jaguar, from a mere
2.3 petaflops. It will grow into a system to be known as Titan, boasting a roaring
10 to 20 petaflops. Jaguar and Titan are shown below. Presumably there will be more interesting panel art ultimately provided for Titan.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeVt2_3jOz0V2WQ6iFYKyCpPL9owoXojFZdyoYfjstT1R3nR3_SDF18eqBJb81qCgoOB9CmSl_Dkx-3Rl-Sd1LG74KtL3M8m2N32sGwHNu3myRoKvIo4DqEw9VqUCyXeGUU9Xa20q1k_5N/s1600/ORNL+jaguar.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeVt2_3jOz0V2WQ6iFYKyCpPL9owoXojFZdyoYfjstT1R3nR3_SDF18eqBJb81qCgoOB9CmSl_Dkx-3Rl-Sd1LG74KtL3M8m2N32sGwHNu3myRoKvIo4DqEw9VqUCyXeGUU9Xa20q1k_5N/s320/ORNL+jaguar.jpg" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgPXIlot6dEDRLFd8IlLQlqqv2bB-NktWBugSqxxa1kAH4fFUOSQbSr7_UTu4XdbOKjbbHA8cU2RRlZIt-iv_yPD_s_GB2b5CMDwUITpngcwOAxgj2l0BbDm2OuAr5WthDvQG9KeTdb0Gp/s1600/ORNL+Titan.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="172" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgPXIlot6dEDRLFd8IlLQlqqv2bB-NktWBugSqxxa1kAH4fFUOSQbSr7_UTu4XdbOKjbbHA8cU2RRlZIt-iv_yPD_s_GB2b5CMDwUITpngcwOAxgj2l0BbDm2OuAr5WthDvQG9KeTdb0Gp/s320/ORNL+Titan.jpg" width="320" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The upgrade of the Jaguar Cray XT5 system will introduce new Cray
XE6 nodes with AMD’s 16-core Interlagos Opteron 6200. However, the big
performance numbers come from new <a href="http://www.theregister.co.uk/2011/05/24/cray_xk6_gpu_supercomputer/">XK6
nodes</a>, which replace two (half) of the AMDs with Nvidia Tesla 3000-series Kepler
compute accelerator GPUs, as shown in the diagram. (The blue blocks are Cray’s Gemini
inter-node communications.)</div>
<div class="MsoNoSpacing">
<o:p><br /></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMU85oF1nqufK82BgJr7EFGXVdMaYuNfs6TuO8Vz_XfAEPDEv18AL-Zidzf14AxA4uUuYq7gQD2Lge_Q_XJSnBmowuC2cFJPkjrPVVDQu45aWxwNiB1rXImdejQV044lVomJpeXhZYz0hi/s1600/cray_xe6_xk6_schematic.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="235" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMU85oF1nqufK82BgJr7EFGXVdMaYuNfs6TuO8Vz_XfAEPDEv18AL-Zidzf14AxA4uUuYq7gQD2Lge_Q_XJSnBmowuC2cFJPkjrPVVDQu45aWxwNiB1rXImdejQV044lVomJpeXhZYz0hi/s320/cray_xe6_xk6_schematic.jpg" width="320" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The actual performance is a range because it will “depend
on how many (GPUs) we can afford to buy," <a href="http://blogs.knoxnews.com/munger/2011/12/the-grand-move-toward-titan.html">according
to Jeff Nichols</a>, ORNL's associate lab director for scientific computing. 20
PFLOPS is achieved if they reach 18,000 XK6 nodes, apparently meaning that all
the nodes are XK6s with their GPUs.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
All this seems like a straightforward march of progress
these days: upgrade and throw in a bunch of Nvidia number-smunchers. Business
as usual. The only news, and it is significant, is that it’s actually being
done, sold, installed, accepted. Reality is a good thing. (Usually.) And GPUs
are, for good reason, the way to go these days. Lots and lots of GPUs.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
On The Other Hand:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Oak Ridge has applications totaling at least 5 million
lines of code <b><i>most of which “does not run on GPGPUs and probably never will due to
cost and complexity”</i></b> [emphasis added by me]. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
That’s what was said at an Intel press briefing at SC11
by Robert Harrison, a corporate fellow at ORNL and director of the National
Institute of Computational Sciences hosted at ORNL. He is the person working to get codes ported to
Knight’s Ferry, a pre-product software development kit based on Intel’s MIC
(May Integrated Core) architecture. (See my prior post <a href="http://perilsofparallel.blogspot.com/2011/10/mic-and-knights.html">MIC
and the Knights</a> for a short description of MIC and links to further
information.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<a href="http://insidehpc.com/2011/11/16/video-intels-knights-corner-does-1-teraflop-on-a-single-chip-at-sc11/">Video</a>
of that entire briefing is available, but the things I’m referring to are all
the way towards the end, starting at about the 50 minute mark. The money slide
out of the <a href="http://newsroom.intel.com/servlet/JiveServlet/download/38-6968/Intel_SC11_presentation.pdf">entire
set</a> is page 30:</div>
<div class="MsoNoSpacing">
<o:p><br /></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcHSanvZBnmNDbn4nT_FQB48ZFCTpfSi4R9tpbPjeBQmDWwmTQC8zrndzfHOk6U-1Yz4m9cnkseBfYa8gZD9i4_SGahRvzdrpVLR5z0qu4nIg7d_v1r2VJIcSYWpjzqPbo47ZNEN_p1yG0/s1600/Slide13.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcHSanvZBnmNDbn4nT_FQB48ZFCTpfSi4R9tpbPjeBQmDWwmTQC8zrndzfHOk6U-1Yz4m9cnkseBfYa8gZD9i4_SGahRvzdrpVLR5z0qu4nIg7d_v1r2VJIcSYWpjzqPbo47ZNEN_p1yG0/s400/Slide13.jpg" width="400" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
(And I really wish whoever was making the video didn’t
run out of memory, or run out of battery, or have to leave for a potty break, or whatever else
right after this page was presented; it's not the last.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The presenters said that they had actually ported “tens
of millions” of lines of code, most functioning within one day. That does not
mean they performed well in one day – see <a href="http://perilsofparallel.blogspot.com/2011/10/mic-and-knights.html">MIC
and the Knights</a> for important issues there – but he did say that they had
decades of experience making vector codes work well, going all the way back to
the Cray 1.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
What Harrison says in the video about the possibility of
GPU use is actually quite a bit more emphatic than the statement on the slide:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="margin-bottom: .0001pt; margin-bottom: 0in; margin-left: .6in; margin-right: .6in; margin-top: 0in;">
Most of this software, I can
confidently say since I'm working on them ... will not run on GPGPUs as we
understand them right now, in part because of the sheer volume of software,
millions of lines of code, and in large part because the algorithms,
structures, and so on associated with the applications are just simply don't
have the massive parallelism required for fine grain [execution]."</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
All this is, of course, right up Intel’s alley, since
their target for MIC is source compatibility: Change a command-line flag, recompile,
done.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I can’t be alone in seeing a disconnect between the Titan
hype and these statements. They make it sound like they’re busy building a
system they can’t use, and I have too much respect for the folks at ORNL to
think that could be true.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So, how do we resolve this conundrum? I can think of
several ways, but they’re all speculation on my part. In no particular order:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
- <b>The 20 PFLOP
number is public relations hype.</b> The contract with Cray is apparently quite
flexible, allowing them to buy as many or as few XK6 Tesla-juiced nodes as they
like, presumably including zero. That’s highly unlikely, but it does allow a “try
some and see if you like it” approach which might result in rather few XK6 nodess
installed.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
- <b>Harrison is
being overly conservative.</b> When people really get down to it, perhaps
porting to GPGPUs won’t be all that painful -- particularly compared with the
vectorization required to really make MIC hum.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
- <b>Those MLOCs aren’t
important for Jaguar/Titan.</b> Unless you have a clearance a lot higher than
the one I used to have, you have no clue what they are really running on
Jaguar/Titan. The codes ported to MIC might not be the ones they need there, or
what they run there may slip smoothly onto GPGPUs, or they may be so important a
GPGPU porting effort is deemed worthwhile.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
- <b>MIC doesn’t
arrive on time.</b> MIC is still vaporware, after all, and the Jaguar/Titan
upgrade is starting now. (It’s a bit delayed because AMD’s <a href="http://www.theregister.co.uk/2012/01/05/cray_q4_whacked_by_amd/">having
trouble delivering</a> those Interlagos Opterons, but the target start date is
already past.) The earliest firm deployment date I know of for MIC is at the Texas
Advanced Computing Center (TACC) at The University of Texas at Austin. Its new
Stampede system <a href="http://www.tacc.utexas.edu/news/press-releases/2011/stampede">uses MIC</a>
and deploys in 2013.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
- <b>Upgrading is a
lot simpler and cheaper</b> – in direct cost and in operational changes – than installing
something that could use MIC. After all, Cray likes AMD, and uses AMD’s
inter-CPU interconnect to attach their Gemini inter-node network. This may not
hold water, though, since Nvidia isn’t well-liked by AMD anyway, and the Nvidia
chips are attached by PCI-e links. PCI-e is what Knight’s Ferry and Knight’s
Crossing (the product version) use, so one could conceivably plug them in.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
- <b>MIC is too
expensive.</b> </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
That last one requires a bit more explanation. Nvidia Teslas
are, in effect, subsidized by the volumes of their plain graphics GPUs. Thise
use the same architecture and can to a significant degree re-use chip designs.
As a result, the development cost to get Tesla products out the door is spread
across a vastly larger volume than the HPC market provides, allowing much lower
pricing than would otherwise be the case. Intel doesn’t have that volume
booster, and the price might turn out to reflect that. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
That Nvidia advantage won’t last forever. Every time AMD
sells a Fusion system with GPU built in, or Intel sells one of their chips with
graphics integrated onto the silicon, another nail goes into the coffin of low-end
GPU volume. (See my post <a href="http://perilsofparallel.blogspot.com/2010/08/nvidia-based-cheap-supercomputing.html">Nvidia-based
Cheap Supercomputing Coming to an End</a>; the post turned out to be too
optimistic about Intel & AMD graphics performance, but the principle still
holds.) However, this volume advantage is still in force now, and may result in
a significantly higher cost for MIC-based units. We really have no idea how
Intel’s going to price MIC, though, so this is speculation until the MIC vapor
condenses into reality.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Some of the resolutions to this Tesla/MIC conflict may be
totally bogus, and reality may reflect a combination of reasons, but who knows?
As I said above, I’m speculating, a bit caught…</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="margin-left: .5in;">
<i>I’m just a little bit caught
in the middle<o:p></o:p></i></div>
<div class="MsoNoSpacing" style="margin-left: .5in;">
<i>MIC is a dream, and Tesla’s a
riddle<o:p></o:p></i></div>
<div class="MsoNoSpacing" style="margin-left: .5in;">
<i>I don’t know what to say, can’t
believe it all, I tried<o:p></o:p></i></div>
<div class="MsoNoSpacing" style="margin-left: .5in;">
<i>I’ve got to let it go<o:p></o:p></i></div>
<div class="MsoNoSpacing" style="margin-left: .5in;">
<i>And just enjoy the show.<a href="file:///C:/Users/gpfister/Desktop/Knight%20at%20SC11/Perils%20ORNL%20MIC%20Tesla.docx#_ftn1" name="_ftnref1" title=""><span class="MsoFootnoteReference"><span class="MsoFootnoteReference"><b><span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;">[1]</span></b></span></span></a><o:p></o:p></i></div>
<div class="MsoNoSpacing">
<br /></div>
<div>
<br />
<hr align="left" size="1" width="33%" />
<div id="ftn1">
<div class="MsoFootnoteText">
<a href="file:///C:/Users/gpfister/Desktop/Knight%20at%20SC11/Perils%20ORNL%20MIC%20Tesla.docx#_ftnref1" name="_ftn1" title=""><span class="MsoFootnoteReference"><span class="MsoFootnoteReference"><span style="font-family: Calibri, sans-serif; font-size: 10pt; line-height: 115%;">[1]</span></span></span></a>
With apologies to Lenka, the artist who actually <a href="http://starcasm.net/archives/121713">wrote the song</a> the girl sings in
Moneyball. Great movie, by the way.</div>
</div>
</div>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com6tag:blogger.com,1999:blog-3155908228127841862.post-4027866662241832522011-10-28T18:00:00.001-06:002011-10-28T21:05:57.574-06:00MIC and the KnightsIntel’s Many-Integrated-Core architecture (MIC) was on
wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s
Ferry (KF) software development kit. Well, I thought it was wide display, but I’m
an IDF Newbie. There was mention in two keynotes, a demo in the first booth on
the right in the exhibit hall, several sessions, etc. Some old hands at IDF
probably wouldn’t consider the display “wide” in IDF terms unless it’s in your
face on the banners, the escalators, the backpacks, and the bagels.<br />
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Also, there was much attempted discussion of the 2012
product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion
was much attempted by me, anyway, with decidedly limited success. There were
some hints, and some things can be deduced, but the real KC hasn’t stood up
yet. That reticence is probably a turn for the better, since KF is the direct
descendant of Intel’s Larrabee graphics engine, which was quite prematurely
trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only
to eventually be dropped – to become KF. A bit more circumspection is now
certainly called for.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This circumspection does, however, make it difficult to
separate what I learned into neat KF or KC buckets; KC is just too well hidden
so far. Here is my best guesses, answering questions I received from Twitter
and elsewhere as well as I can.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
If you’re unfamiliar with MIC or KF or KC, you can call
up a plethora of resources on the web that will tell you about it; I won’t be
repeating that information here. Here’s a relatively recent one: <a href="http://www.brightsideofnews.com/news/2011/6/20/intel-larrabee-take-two-knights-corner-in-20122c-exascale-in-2018.aspx">Intel
Larraabee Take Two</a>. In short summary, MIC is the widest X86 shared-memory multicore
anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one
chip. KC has “50 or more.” In addition, and crucially for much of the
discussion below, each core has an enhanced and expanded vector / SIMD unit. You
can think of that as an extension of SSE or AVX, but 512 bits wide and with
many more operations available.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
An aside: Intel’s department of code names is fond of
using place names – towns, rivers, etc. – for the externally-visible names of
development projects. “Knight’s Ferry” follows that tradition; it’s a town up
in the Sierra Nevada Mountains in central California. The only “Knight’s
Corner” I could find, however, is a “populated area,” not even a real town,
probably a hamlet or development, in central Massachusetts. This is at best an
unlikely name source. I find this odd; I wish I’d remembered to ask about it.</div>
<div class="MsoNoSpacing">
<br /></div>
<h2>
Is It Real?</h2>
<div class="MsoNoSpacing">
The MIC architecture is apparently as real as it can be.
There are multiple generations of the MIC chip in roadmaps, and Intel has
committed to supply KC (product-level) parts to the University of Texas <a href="http://communities.intel.com/community/openportit/server/blog/2011/09/22/intel-mic-scores-1st-home-run-with-10-petaflop-stampede-supercomputer">TACC
by January 2013</a>, so at least the second generation is as guaranteed to be
real as a contract makes it. I was repeatedly told by Intel execs I interviewed
that it is as real as it gets, that the MIC architecture is a long-term
commitment by Intel, and it is not transitional – not a step to other,
different things. This is supposed to be <b><i>the</i></b> Intel highly-parallel technical
computing accelerator architecture, period, a point emphasized to me by several
people. (They still see a role for Xeon, of course, so they don't think of MIC as the only
technical computing architecture.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
More importantly, Joe Curley (Intel HPC marketing) gave
me a reason why MIC is real, and intended to be architecturally stable: HPC and
general technical computing are about a third of Intel’s server business. Further,
that business tends to be a very profitable third since those customers tend to
buy high-end parts. MIC is intended to slot directly into that business,
obviously taking the money that is now increasingly spent on other accelerators
(chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as
discussed below, Intel’s intention for MIC is to greatly widen the pool of
customers for accelerators.</div>
<div class="MsoNoSpacing">
<br /></div>
<h2>
The Big Feature: Source Compatibility</h2>
<div class="MsoNoSpacing">
There is absolutely no question that Intel regards source
compatibility as a primary, key feature of MIC: Take your existing programs,
recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag),
and they run on KF. I have zero doubt that this will also be true of KC and is
planned for every future release in their road map. I suspect it’s why there is
a MIC – why they did it, rather than just burying Larrabee six feet deep. No
binary compatibility, though; you need to recompile.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
You do need to be on Linux; I heard no word about
Microsoft Windows. However, Microsoft Windows 8 has a <a href="http://blogs.msdn.com/b/b8/archive/2011/10/27/using-task-manager-with-64-logical-processors.aspx">new
task manager display</a> changed to be a better visualization of many more – up
to 640 – cores. So who knows; support is up to Microsoft.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Clearly, to get anywhere, you also need to be
parallelized in some form; KF has support for MPI (messaging), OpenMP (shared
memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading
Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s
a real Linux, by the way, that runs on a few of the MIC processors; I was told
“you can SSH to it.” The rest of the cores run some form of microkernel. I see
no reason they would want any of that to become more restrictive on KC.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
If you can pull off source compatibility, you have something
that is wonderfully easy to sell to a whole lot of customers. For example, Sriram
Swaminarayan of LANL <a href="http://insidehpc.com/2011/09/16/video-future-programming-models/">has
noted</a> (really interesting video there) that over 80% of HPC codes have,
like him, a very large body of legacy codes they need to carry into the future.
“Just recompile” promises to bring back the good old days of clock speed
increases when you just compiled for a new architecture and went faster. At
least it does if you’ve already gone parallel on X86, which is far from
uncommon. No messing with newfangled, brain-bending languages (like CUDA or
OpenCL) unless you really want to. This collection of customers is large,
well-funded, and not very well-served by existing accelerator architectures.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Right. Now, for all those readers screaming at me “OK, it
<b><i>runs</i></b>,
but does it <b><i>perform?</i></b>” – </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Well, not necessarily.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The problem is that to get MIC – certainly KF, and it
might be more so for KC – to really perform, on many applications you must get its
512-bit-wide SIMD / vector unit cranking away. <a href="http://perilsofparallel.blogspot.com/2011/09/conversation-with-intels-james-reinders.html">Jim
Reinders regaled</a> me with a tale of a four-day port to MIC, where, surprised
it took that long (he said), he found that it took one day to make it run (just
recompile), and then three days to enable wider SIMD / vector execution. I
would not be at all surprised to find that this is pleasantly optimistic. After
all, Intel cherry-picked the recipients of KF, like CERN, which has one of the
world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications
in the known universe. (See my post <a href="http://perilsofparallel.blogspot.com/2011/10/random-things-of-interest-at-idf-2011.html">Random
Things of Interest at IDF 2011</a>.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Where, on this SIMD/vector issue, are the 80% of folks
with monster legacy codes? Well, Sriram (see above) commented that when LANL
tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes
with the horsepower coming from attached IBM Cell blades – they had a problem
because to perform well, the Cell SPUs needed crank up their <b><i>two</i></b>-way
SIMD / vector units. Furthermore, they still have difficulty using earlier
Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s
8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
On the other hand, getting good performance on other accelerators,
like Nvidia’s, requires much wider SIMD; they need 100s of units cranking,
minimally. Full-bore SIMD may in some cases be simpler to exploit than
SIMD/vector instructions. But even going through gigabytes of grotty old
FORTRAN code just to insert notations saying “do this loop in parallel,” without
breaking the code, can be arduous. The programming language, by the way, is not
the issue. Sriram reminded me of the old saying that great FORTRAN coders, who
wrote the bulk of those old codes, can write FORTRAN in any language.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
But wait! How can these guys be choking on 2-way
parallelism when they have obviously exploited thousands of cluster nodes in
parallel? The answer is that we have here two different forms of parallelism;
the node-level one is based on scaling the amount of data, while the SIMD-level
one isn’t. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
In physical simulations, which many of these codes
perform, what happens in <i>this</i> simulated
galaxy, or <i>this</i> airplane wing, bomb,
or atmosphere column over <i>here</i> has a
relatively limited effect on what happens in <i>that</i> galaxy, wing, bomb or column <i>way over there</i>. The effects that do travel can be added as
perturbations, smoothed out with a few more global iterations. That’s the basis
of the node-level parallelism, with communication between nodes. It can also
readily be the basis of processor/core-level parallelism across the cores of a
single multiprocessor. (One basis of those kinds of parallelism, anyway; other
techniques are possible.) </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Inside any given galaxy, wing, bomb, or atmosphere column,
however, quantities tend to be much more tightly coupled to each other. (Consider,
for example, R<sup>2</sup> force laws; irrelevant when sufficiently far, dominant
when close.) Changing the way those tightly-coupled calculations and done can
often strongly affect the precision of the results, the mathematical properties
of the solution, or even whether you ever converge to any solution. That part
may not be simple at all to parallelize, even two-way, and exploiting SIMD /
vector forces you to work at that level. (For example, you can get into trouble
when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to
Gauss-Jacobi simulation. I went into this in more detail way back in my book <i>In Search of Clusters</i>, (Prentice-Hall), Chapter
9, “Basic Programming Models and Issues.”) To be sure, not all applications
have this problem; those that don’t often can easily spin up into thousands of
operations in parallel at all levels. (Also, multithreaded “real” SIMD, as
opposed to vector SIMD, <i>can in some cases</i>
avoid some of those problems. Note italicized words.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The difficulty of exploiting parallelism in tightly-coupled
local computations implies that those 80% are in deep horse puckey no matter
what. You have to carefully consider everything (even, in some cases,
parenthesization of expressions, forcing order of operations) when changing
that code. Needing to do this to exploit MIC’s SIMD suggests an opening for
rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually
necessary for Intel, too, and if you do it our way you get” tons more
performance / lower power / whatever.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Can compilers help here? Sure, they can always eliminate a
pile of gruntwork. Automatically vectorizing compilers have been working quite
well since the 80s, and progress continues to be made in disentangling the aliasing
problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or
semi-commercial) products from people like <a href="http://www.irisa.fr/caps/">CAPS</a>
and <a href="http://www.pgroup.com/">The Portland Group</a> get better results
if you tell them what’s what, with annotations. Those, of course, must be very
carefully applied across mountains of old codes. (They even emit CUDA and
OpenCL these days.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
By the way, at least some of the parallelism often exploited
by SIMD accelerators (as opposed to SIMD / vector) derives from what I called
node-level parallelism above.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Returning to the main discussion, Intel’s MIC has the
great advantage that you immediately get a simply ported, working program; and,
in the cases that don’t require SIMD operations to hum, that may be all you
need. Intel is pushing this notion hard. One IDF session presentation was
titled “Program the SAME Here and Over There” (caps were in the title). This is
a very big win, and can be sold easily because customers want to believe that
they need do little. Furthermore, you will probably always need less SIMD /
vector width with MIC than with GPGPU-style accelerators. Only experience over
time will tell whether that really matters in a practical sense, but I suspect
it does.</div>
<div class="MsoNoSpacing">
<br /></div>
<h3>
Several Other Things</h3>
<div class="MsoNoSpacing">
Here are other MIC facts/factlets/opinions, each needing
far less discussion.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
How do you get from one MIC to another MIC? MIC, both KF
and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does
not have a PCIe root complex, so cannot source PCIe. It must be attached to a
standard compute node. So all anybody was talking about was going down PCIe to
node memory, then back up PCIe to a different MIC, all at least partially under
host control. Maybe one could use peer-to-peer PCIe device transfers, although
I didn’t hear that mentioned. I heard nothing about separate busses directly
connecting MICs, like the ones that can connect dual GPUs. This PCIe use is
known to be a bottleneck, um, I mean, “known to require using MIC on
appropriate applications.” Will MIC be that way for ever and ever? Well, “no
announcement of future plans”, but “typically what Intel has done with
accelerators is eventually integrate them onto a package or chip.” They are
“working with others” to better understand “the optimal arrangement” for
connecting multiple MICs.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
What kind of memory semantics does MIC have? All I heard
was flat cache coherence across all cores, with ordering and synchronizing
semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple
nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that
I think of it, did that count threads? I don’t know.) I asked whether the other
cores ran a micro-kernel and got a nod of assent. It is not the same Linux that
they run on Xeons. In some ways that’s obvious, since those microkernels on
other nodes have to be managed; whether other things changed I don’t know. Each
core has a private cache, and all memory is globally accessible. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Synchronization will likely change in KC. That’s how I
interpret <a href="http://perilsofparallel.blogspot.com/2011/09/conversation-with-intels-james-reinders.html">Jim
Reinders’ comment</a> that current synchronization is fine for 32-way, but over
40 will require some innovation. KC has been said to be 50 cores or more, so
there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100%
necessary for source code to <b>run</b> (as
opposed to <b>perform</b>), I think that
might be a candidate for the chopping block at some point.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Is there adequate memory bandwidth for apps that strongly
stream data? The answer was that they were definitely going to be competitive,
which I interpret as saying they aren’t going to break any records, but will be
good enough for less stressful cases. Some quite knowledgeable people I know
(non-Intel) have expressed the opinion that memory chips will be used in stacks
next to (not on top of) the MIC chip in the product, KC. Certainly that would
help a lot. (This kind of stacking also appears in a leaked picture of a “<a href="http://semiaccurate.com/2011/10/27/amd-far-future-prototype-gpu-pictured/">far
future prototype</a>” from Nvidia, as well as <a href="http://perilsofparallel.blogspot.com/2011/10/random-things-of-interest-at-idf-2011.html">an
Intel Labs demo at IDF</a>.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Power control: Each core is individually controllable,
and you can run all cores flat out, in their highest power state, without
melting anything. That’s definitely true for KF; I couldn’t find out whether
it’s true for KC. Better power controls than used in KF are now present in
Sandy Bridge, so I would imagine that at least that better level of support
will be there in KC.</div>
<div class="MsoNoSpacing">
<br /></div>
<h2>
Concluding Thoughts</h2>
<div class="MsoNoSpacing">
Clearly, I feel the biggest point here are Intel’s
planned commitment over time to a stable architecture that is source code
compatible with Xeon. Stability and source code compatibility are clear selling
points to the large fraction of the HPC and technical computing market that
needs to move forward a large body of legacy applications; this fraction is not
now well-served by existing accelerators. Also important is the availability of
familiar tools, and more of them, compared with popular accelerators available
now. There’s also a potential win in being able to evolve existing programmer skill,
rather than replacing them. Things do change with the much wider core- and
SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that
required by current accelerator products, and it starts in a familiar place.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Will MIC win in the marketplace? Big honking SIMD units,
like Nvidia ships, will always produce more peak performance, which makes it
easy to grab more press. But Intel’s architectural disadvantage in peak juice
is countered by process advantage: They’re always two generations ahead of the
fabs others use; KC is a 22nm part, with those famous “3D” transistors. It
looks to me like there’s room for both approaches.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Finally, don’t forget that Nvidia in particular is here
now, steadily increasing its already massive momentum, while a product version
of MIC remains pie in the sky. What happens when the rubber meets the road with
real MIC products is unknown – and the track record of Larrabee should give
everybody pause until reality sets well into place, including SIMD issues,
memory coherence and power (neither discussed here, but not trivial), etc. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I think a lot of people would, or should, want MIC to
work. Nvidia is hard enough to deal with in reality that <b><i>two</i></b> best paper awards
were given at the recently concluded <a href="http://www.ipdps.org/">IPDPS</a>
2011 conference – the largest and most prestigious academic parallel computing conference
– for papers that may as well have been titled “How I actually managed to do
something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown <a href="http://techtalks.tv/events/54/">here</a>.) Granted, things like
a shortest-path graph algorithm (PHAST) are not exactly what one typically expects
to run well on a GPGPU. Nevertheless, this is not a good sign. People should
not have to do work at the level of intellectual academic accolades to get something
done – anything! – on a truly useful computer architecture. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Hope aside, a lot of very difficult hardware and software
still has to come together to make MIC work. And…</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Larrabee was supposed to be real, too.<br />
<br />
**************************************************************<br />
<br />
Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!</div>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com12tag:blogger.com,1999:blog-3155908228127841862.post-10393861514991502732011-10-05T11:08:00.000-06:002011-10-21T17:26:37.022-06:00Will Knight’s Corner Be Different? Talking to Intel’s Joe Curley at IDF 2011<br />
<div class="MsoNoSpacing">
At the recent Intel Developer Forum (IDF), I was given
the opportunity to interview Joe Curley, Director, Technical Computing Marketing
of Intel’s Datacenter & Connected Systems Group in Hillsboro.</div>
<div class="MsoNoSpacing">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjA65XrzEQ0535UID9r4MMjMBos8hGKAd_7QcJWq9ER5IhqYo6NM8zv5if5LIyJJVt1qFIFlQPCJ2mQVVE_P_KKcy1MpoKmh8la76gG2FnmRMe3P_KdwsZ9ONdQcUm_VM2szAp922XFVo6d/s1600/joe-curley-230x142.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjA65XrzEQ0535UID9r4MMjMBos8hGKAd_7QcJWq9ER5IhqYo6NM8zv5if5LIyJJVt1qFIFlQPCJ2mQVVE_P_KKcy1MpoKmh8la76gG2FnmRMe3P_KdwsZ9ONdQcUm_VM2szAp922XFVo6d/s1600/joe-curley-230x142.jpg" /></a></div>
<br /></div>
<div class="MsoNoSpacing">
Intel-provided information about Joe:</div>
<div style="border-bottom: solid #4F81BD 1.0pt; border: none; margin-left: .65in; margin-right: .65in; mso-border-bottom-alt: solid #4F81BD .5pt; mso-border-bottom-themecolor: accent1; mso-border-bottom-themecolor: accent1; mso-element: para-border-div; padding: 0in 0in 4.0pt 0in;">
<div class="MsoIntenseQuote" style="margin-bottom: 14.0pt; margin-left: 0in; margin-right: 0in; margin-top: 10.0pt;">
Joe Curley, serves Intel® Corporation as director of
marketing for technical computing in the Data Center Group. The technical
computing marketing team manages marketing for high-performance computing (HPC)
and workstation product lines as well as future Intel® Many Integrated Core
(Intel® MIC) products. Joe joined Intel in 2007 to manage planning activities
that lead up to the announcement of the Intel® MIC Architecture in May of 2010.
Prior to joining Intel, Joe worked at Dell, Inc. and graphics pioneer Tseng
Labs in a series of marketing and engineering leadership roles.</div>
</div>
<div class="MsoNoSpacing">
I recorded our conversation; what follows is a transcript.
Also, I used Twitter to crowd-source questions, and some of my comments refer
to picking questions out of the list that generated. (Thank you! to all who
responded.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This is the last in a series of three such transcripts. Hallelujah!
Doing this has been a pain. I’ll have at least one additional post about IDF
2011, summarizing the things I learned about MIC and the Intel “Knight’s”
accelerator boards using them, since some important things learned were outside
the interviews. But some were in the interviews, including here.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Full disclosure: As I originally noted in a <a href="http://perilsofparallel.blogspot.com/2011/09/impressions-of-newbie-at-intel.html">prior
post</a>, Intel paid for me to attend IDF. Thanks, again. It was a great
experience, since I’d never before attended.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Occurrences of [] indicate words I added for
clarification or comment post-interview.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
[We began by discovering we had similar deep backgrounds, both
starting in graphics hardware. I designed & built a display processor (a
prehistoric GPU), he built “the most efficient framework buffer controller you
could possibly make”. Guess which one of us is in marketing?]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: My experience in the [HPC] business really started relatively
recently, a little under five years ago, [when] I started working on many-core
processors. I won’t be able to go into history, but I can at least tell you
what we’re doing and why.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Why don’t we start there? At a high level, what are
you doing, and why? High level for what you are doing, and as much detail on “why”
as you can provide.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: We have to narrow the question. So, at Intel, what we’re
after first of all is in what we call our Technical Computing Marketing Group
inside Data Center Group. That has really three major objectives. The first one
is to specify the needs for high performance computing, how we can help our
customers and developers build the best high performance computing systems.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Let me stop you for a second right there. My
impression for high performance computing is that they are people whose needs
are that they want more. Just more.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Oh, yes, but more at what cost? What cost of power,
what cost of programability, what cost of size. How are we going to build the
IO system to handle it affordably or use the fabric of the day.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Yes, they want more, but they want it at two
bytes/FLOPS of memory bandwidth and communication bandwidth.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: There’s an old thing called the Dilbert Spec, which is
“I want it all, and by the way, can it be free?” But that’s not really what
people tell us they want. People in HPC have actually been remarkably pragmatic
about what it takes to develop innovation. So they really want us to do some
things, and do them really well.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
By the way, to finish what we do, we also have the
workstation segment, and the MIC Many Integrated Core product line. The
marketing for that is also in our group.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
You asked “what are you doing and why.” It would probably
take forever to go across all domains, but we could go into any one of them a
little bit better.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Can you give me a general “why” for HPC, and a
specific “why” for MIC?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Well, HPC’s a really good business. I get stunned,
somebody must be Asking really weird questions, asking “why are you doing HPC?”</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: What I’ve heard is that HPC is traditionally 12% of
the market.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Supercomputing is a relatively small percentage of the
market. <b>HPC and technical computing,
combined, is, not exactly, but roughly, a third of our data center business.</b>
<b><i>[emphasis added by me]</i></b> Our data center business is a pretty robust business.
And high performance computing is a business that requires very high end, high
performance processors. It’s actually a very desirable business to be in, if
you can do it, and if your systems work. It’s a business we spend a lot of time
working on because it’s a good business.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Now, if you look at MIC, back in 2005 we made a tacit conclusion
that the performance of a system will come out of parallelism. Parallelism
could be expressed at Intel in a lot of different ways. You can look at it as
threads, we have this concept called hyperthreading. You can look at it as
cores. And we have the SSE instructions sitting around which are SIMD, that’s a
form of parallelism; people argue about the definition, but yes, it is. [I agree.]
So you take a look at the basic architectural constructs, ease of programming,
you know, a cache-based CISC model, and then scaling on cores, threads, SIMD or
vectors, these common attributes have been adopted and well-used by a lot of
programmers. There are programs across the continuum of coarse- to fine-grained
parallel, embarrassingly parallel, pick your taxonomy. But there are
applications that developers would be willing to trade the performance of any
particular task or thread for the sum of what you can do inside the power
envelope at a given period of time. Lots of people have different ways of
defining that, you hear throughput, whatever, but this is the class of
applications, and over time they’re growing.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Growing relatively, or, say, compared to commercial
processing, or…? Is the segment getting larger?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: The number of people who have tasks they want to run
on that kind of hardware is clearly growing. One of the reasons we’re doing
MIC, maybe I should just cut it to the easiest answer, is developers and
customers asked us to.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Really?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: And they came to us with a really simple question. We
were struggling in the marketing group with how to position MIC, and one of our
developers got worked up, like “Look, you give me the parallel performance of
an accelerator, but you give me the ease of CPU programming!” Now, ease is a
funny word; you can get into religious arguments about ease. But I think what
he means is “I don’t have to re-think my algorithm, I don’t have to reorder my
data set, there are some things that I don’t have to do. So that they wanted to
have the idea of give me this architecture and get it to scale to be wildly
parallel. And that is exactly what we’ve done with the MIC architecture. If you
think about what the Kinght’s Ferry STP [? <span class="Apple-style-span" style="color: purple;">Undoubtedly this is SDP - Software Development Platform; I just heard it wrong on the recording.</span>] is, a 32 core, coherent, on a chip,
teraflop part, it’s kind of like Paragon or ASCI Red on a chip. [but it is only a TFLOPS in single precision] And the
programming model is, surprisingly, kind of like a bunch of processor cores on
a network, which a lot of people understand and can get a lot of utility out of
in a very well-understood way. So, in a sense, we’re giving people what they
want, and that, generally, is good business. And if you don’t give them what
they want, they’ll have to go find someone else. So we’re simply doing what our
marketplace asked us for.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Well, let me play a little bit of devil’s advocate
here, because MIC is very clearly derivative of Larrabee, and…</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Knight’s Ferry is.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: … Knight’s Ferry is. Not MIC?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: No. I think you have to take a look at what Larrabee
was. Larrabee, by the way, was a really cool project, but what Larrabee was was
a tile rendering graphics device, which meant its design point, was first of
all the programming model was derived from what you do for graphics. It’s going
to be API-based, the answer it’s going to generate is going to be a pixel, the
pixel is going to have a defined level of sub-pixel accuracy. It’s a very
predictable output. The internal optimizations you would make for a graphics implementation
of a general many-core architecture is one very specific implementation. Let’s
talk about the needs of the high performance computing market. I need
bandwidth. I need memory depth. Larrabee didn’t need memory depth; it didn’t have
a frame buffer.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: It needed bandwidth to local memory [of which it didn’t
have enough; see my post <a href="http://perilsofparallel.blogspot.com/2010/01/problem-with-larrabee.html">The
Problem with Larrabee</a>]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Yes, but less than you think, because the cache was
the critical element in that architecture [again, see <a href="http://perilsofparallel.blogspot.com/2010/01/problem-with-larrabee.html">that
post</a>] if you look through the academic papers on that…</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: OK, OK.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: So, they have a common heritage, they’re both derived
out of the thoughts that came out of the Intel Labs terascale research. They’re
both many-core. But Knight’s Ferry came out with a few, they’re only a few,
modifications. But the programming model is completely different. You don’t
program a graphics device like you do a computer, and MIC is a computer.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: The higher-level programming model is different.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Correct.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: But it is a big, wide, cache-coherent SMP.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Well, yes, that’s what Knight’s Ferry is, but we haven’t
talked about what Knight’s Corner yet, and unfortunately I won’t today, and we
haven’t talked about where the product line will go from there, either. But
there are many things that will remain the same, because there are things you
can take and embellish and work and things that will be really different.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: But can you at least give me a hint? Is there a chance
that Knight’s Corner will be a substantially different hardware model than
Knight’s Ferry?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: I’m going to <b><i>really</i></b> love to talk to you about
Knight’s Corner. [his emphasis]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: But not today.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: I’m going to duck it today.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Oh, man…</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: The product is going to be in our 22 nm process, and
22 nm isn’t shipping yet. When we get a little bit closer, when it deserves to
have the buzz generated, we’ll start generating buzz. Right now, the big thing is
that we’re making the investments in the Knight’s Ferry software development
platform, to see how codes scale across the many-core, to get the environment
and tools up, to let developers poke at it and find stuff, good stuff, bad stuff, in between stuff, that
allow us to adjust the product line for ongoing generations. We’ve done that
really well since we announced the architecture about 15 months ago.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: I was wondering what else I was going to talk about
after having talked to both John Hengeveld and Jim Reinders. This is great.
Nobody talked about where it really came from, and even hinted that there were
changes to the MIC chip [architecture].</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Oh, no, no, many things will be the same, many things
will be different. If you’re targeting trying to do a pixel-renderer, go do a
pixel-renderer. If you’re trying to do a general-purpose computing device, do a
general-purpose computing device. You’ll see some things and say “well, it’s
all the same” and other things “wow, it’s completely different.” We’ll get
around to talking about the part when we’re a little closer.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The most important thing that James and/or John should
have been talking about is that the key thing is the ability to not force the
developer to completely and utterly re-think their problem to use your hardware.
There are two models: In an accelerator model, which is something I spent a lot
of my life working with, accelerators have the advantage of optimization. You
can say “I want to do one thing really well.” So you can then describe a
programming model for the hardware. You can say “build your data this way,
write your program this way” and if you do it will work. The problem is that
not everything fits into the box. Oh, you have sparse data. Oh, you have
recursive code.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: And there’s madness in that direction, because if you
start supporting that you wind yourself around to a general-purpose machine. […usually,
a very odd-looking general-purpose machine. I’ve talked about Sutherland’s “Wheel
of Reincarnation” in this blog, haven’t I? Oh, there it is: <a href="http://perilsofparallel.blogspot.com/2010/11/cloud-got-gpus.html">The
Cloud Got GPUs</a>, back in November 2010.]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Then it’s not an accelerator any more. The thing that
you get in MIC is the performance of one of those accelerators. We’ve shown
this. We’ve hit 960GF out of a peak 1.2TF without throwing away precision,
without playing any circus tricks, just run the hardware. On Knight’s Ferry we’ve
shown that. So you get performance, but you’re getting it out of the general
purpose programming model.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: That’s running LINPACK, or… ?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: That was an even more basic thing; I’m just talking
about SGEMM [<a href="http://en.wikipedia.org/wiki/General_Matrix_Multiply">single-precision
dense matrix multiply</a>]. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: I just wanted to ground the number.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: For LU factorization, I think we showed hybrid LU,
really cool, one of the great things about this hybrid… </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: They’re demo-ing that downstairs.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: … OK. When the matrix size is small, I keep it on the
host; when the matrix size is large, I move it. But it’s all the same code, the
same code either place. I’m just deciding where I want to run the code
intelligently, based on the size of the matrix. You can get the exact number,
but I think it’s on the order of 750GBytes/sec for LU [GFLOPS?], which is
actually, for a first-generation part, not shabby. [They were doing 650-750 GF according to the meter I saw. That's single precision; Knight's Ferry was originally a graphics part.]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Yaahh, well, there are a lot of people who can deliver
something like that.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: We’ll keep working on it and making it better and
better. So, what are we proving today. All we’ve proven today is that the
architecture is capable of performance. We’ve got a lot of work to do before we
have a product, but the architecture has shown itself to be capable. The
programming model, we have people who will speak for us, like the quotes that
came from <a href="http://www.thinq.co.uk/2011/6/20/intel-pushes-hpc-space-knights-corner/">LRZ</a>
[data center for the universities of Munich and the Bavarian Academy of
Sciences], from Leibnitz [same place], a code they couldn’t port to other
accelerators was running in two hours and optimized in two days. Now, actual
mileage may vary, see dealer for…</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: So, there are things that just won’t run on a CUDA
model? Example?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Well, perhaps, again, the thing you try to get to is
whether there is evidence growing that what you say is real. So we’re having
people who are starting to be able to speak to that, and that gives people the
confidence that we’re going to be able to get there. The other thing it ends up
doing, it’s kind of an odd benefit, as people have started building their code,
trying to optimize it for MIC, they’re finding the parallelism, they’re doing
what we wanted them to do all along, they’re taking the same code on their
current cluster and they’re getting benefits right now.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: That’s got a long history. People would have some
grotty old FORTRAN code, and want to vectorize it, but the vectorizing compiler
couldn’t make crap out of it. So they cleaned it up, made it obvious what was
going on, and the vectorizer did its thing well. Then they put it back on the
original machine and it ran twice as fast.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: So, one of the nice things that’s happened is that as
people are looking at ways to scale power, performance, they’re finally getting
around to dealing with parallelism. The offer that we’re trying to provide is
portable, high level, standards-based, and you can use it now.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
You said “why.” That’s why. Our customers and developers
say “if you can do that, that’s really valuable.” Now. We’re four men and a
pudding, we haven’t shipped a product yet, we’ve got a lot of work to do, but
the thought and the promise and the early data is really good.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: OK. Well, great.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Was that a good use of the time?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: That’s a very good use of the time. Let me poke on one
thing a little bit. Conceptually, it ought to be simpler to write code to that
kind of a shared memory model and get parallelism out of the code that way.
Now, on the other hand, there was a talk – sorry, I forget his name, he was one
of the software guys working on Larrabee [it was Tom Forsyth; see my post <a href="http://perilsofparallel.blogspot.com/2010/01/problem-with-larrabee.html">The
Problem with Larrabee</a> again] said someone on the project had written four
renderers, and three of them were for Larrabee. He was having one hell of a
time trying to get something that performed well. His big issue, at least what
it came down to from what I remember of the talk, was memory bandwidth.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Well, first of all, we’ve said Larrabee’s not a
product. As I’ve said, one of the things that is critical, you’ve got the
compute-bound, you’ve got the memory-bound, and most people are somewhere in
between, but you have to be able to handle the two edge cases. We understand
that, and we intend to deliver a really good value across the spectrum. Now,
Knight’s Ferry has the RVI silicon [RVI? I’m guessing here], it’s a variation
off the silicon we used, no one cares about that, but on Knight’s Ferry, the memory bus is 256
bits wide. Relatively narrow, and for a graphics processor, very narrow. There
are definitely design decisions in how that chip was made that would limit the
bandwidth. And the memory it was designed with is slower than the memory today,
you have all of the normal things. But if you went downstairs to the show
floor, and talk to Daniel Paul, he’s demonstrating a pretty dramatic
ray-tracer.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
[What follows is a bit confused. He didn’t mean the
Austrian Crown stochastic ray-tracing demo, but rather the real-time
ray-tracing demo. As I said in my immediately previous post (<a href="http://perilsofparallel.blogspot.com/2011/10/random-things-of-interest-at-idf-2011.html">Random
Things of Interest at IDF 2011</a>), the real-time demo is on a set of Knight’s
Ferries attached to a Xeon-based node. At the time of the interview, I hadn’t
seen the real-time demo, just the stochastic one; the latter is not on Knight’s
Ferry.]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: I’ve seen that one. The Austrian Crown?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Yes.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: I thought that was on a cluster.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: In the little box behind there, he’s able to scale
from one to eight Knight’s Ferries.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: He never told me there was a Knight’s Ferry in there.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: Yes, it’s all Knight’s Ferry.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Well, I’m going to go down there and beat on him a
little bit.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A: I’m about to point you to a YouTube site, it got
compressed and thrown up on YouTube. You can’t get the impact of the complexity
of the rays, but you can at least get the superficial idea of the responsiveness
of the system from Knight’s Ferry. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
[He didn’t point me to YouTube, or I lost it, but <a href="http://www.youtube.com/watch?v=4i3uc_SSQ9E">here’s one</a> I found.
Ignore the fact that the introduction is in Swedish or something <span class="Apple-style-span" style="color: #666666;"><i>[it's Dutch, actually]</i></span>; Daniel – and it’s
Daniel, not David – speaks English, and gives a good demo. Yes, everybody in
the “Labs” part of the showroom wore white lab coats. I did a bit of teasing. I also updated the <a href="http://perilsofparallel.blogspot.com/2011/10/random-things-of-interest-at-idf-2011.html">Random Things of Interest</a> post to directly include it.]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Well, if you believe that what we’re going to do in our
mainstream processors is roughly double the FLOPS every generation for the next
many generations, that’s our intent. What if we can do that on the MIC line as
well? By the time you get to where ray-tracing would be practical, you could
see multiple of those being integrated into a single device [added in transcription:
Multiple MICs in a single device? Hierarchical MIC?] becomes practical
computationally. That won’t be far from now. So, it’s a nice demo. David’s an
expert in his field, I didn’t hear what he said, but it you want to see the
device downstairs actually running a fairly strenuous graphics workload, take a
look at that.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: OK. I did go down there and I did see that, I just
didn’t know it was Knight’s Ferry. [It’s not, it’s not, still confused here.]
On that HDR display that is gorgeous. [Where “it” = stochastically-ray-traced Austrian
Crown. It is.]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
[At that point, Dave Patterson walked in, which
interrupted us. We said hello – I know Dave of old, a bit – thanks were
exchanged with Joe, and I departed.]</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
[I can’t believe this is the end of the last one. I
really don’t like transcribing.]</div>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com7tag:blogger.com,1999:blog-3155908228127841862.post-39632508959792592292011-10-02T19:11:00.001-06:002011-10-08T12:04:27.223-06:00Random Things of Interest at IDF 2011 (Intel Developer Forum)<br />
<div class="MsoNoSpacing">
I still have one IDF interview to transcribe (Joe Curley),
but I’m sick of doing transcriptions. So here are a few other random things I
observed at the 2011 Intel Developers Forum. It is nothing like comprehensive.
It’s also not yet the promised MIC dump; that will still come.</div>
<h2>
Exhibit Hall</h2>
<div class="MsoNoSpacing">
I found very few products I had a direct interest in, but
then again I didn’t look very hard.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
On the right, immediately as you enter, was a demo of a
Xeon/MIC combination clocking 600-700 GFLOPS (quite assuredly single precision) doing LRU Factorization. Questions
to the guys running the demo indicated: (1) They did part on the Xeon, and
there may have been two of those, they weren’t sure (the diagram showed two).
(2) They really learned how to say “We don’t comment on competitors” and “We
don’t comment on unannounced products.”</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh39h5vzni2t9H9IV0bagTyqWLMIvH8GZb1ymQ-65NLHmaB5p1v87IS_0VmQR0D3DqEocShwWkQ5JtIAz74ya3BkS7YnI79i-skqHkLXFM1GYCFTgg47jd0r6caSL0pn20k5gbcHQ7U2ENR/s1600/0913011617.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh39h5vzni2t9H9IV0bagTyqWLMIvH8GZb1ymQ-65NLHmaB5p1v87IS_0VmQR0D3DqEocShwWkQ5JtIAz74ya3BkS7YnI79i-skqHkLXFM1GYCFTgg47jd0r6caSL0pn20k5gbcHQ7U2ENR/s200/0913011617.jpg" width="200" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A 6-legged robot controlled by Atom, controlled by a game
controller. I included this here only because it looked funky and I took a picture
(q. v.). Also, for some reason it was in constant slight motion, like it couldn’t
sit still, ever. </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
There were three things that were interesting to me in
the Intel Labs section:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZ97F1Hj9u0dMqQRTXtTcmW7Yk_GgndjCUEICPeIg4WL1aof3kie-3_X8T6MJlACI5_LsKbJ2c84yLnKQb4EpSV1q8p_l-a2d3TKuJAtdyfCEDoriZ6eXiOd7c8qyEVOogAt2Wy00KvmZb/s1600/0913011654.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZ97F1Hj9u0dMqQRTXtTcmW7Yk_GgndjCUEICPeIg4WL1aof3kie-3_X8T6MJlACI5_LsKbJ2c84yLnKQb4EpSV1q8p_l-a2d3TKuJAtdyfCEDoriZ6eXiOd7c8qyEVOogAt2Wy00KvmZb/s320/0913011654.jpg" width="320" /></a></div>
<div class="MsoNoSpacing">
One Tbit/sec memory stack: To understand why this is
interesting, you need to know that the semiconductor manufacturing processes
used to make DRAM and logic are quite different. Putting both on the same chip
requires compromises in one or the other. The logic that must exist on DRAM
chips isn’t quite as good as it could be, for example. In this project, they
separated the two onto separate chips in a stack: Logic is on one, the bottom
one, that interfaces with the outside world. On top of this are multiple pure
memory chips, multiple layers of pure DRAM, no logic. They connect by solder
bumps or something (I’m not sure), and there are many (thousands of) “through silicon
vias” that go all the way through the memory chips to allow connecting a whole
stack to the logic at the bottom with very high bandwidth. This whole idea eliminates
the need to compromise on semiconductor processes, so the DRAM can be dense
(and fast), and the logic can be fast (and low power). One result is that they
can suck 1 Tbit/sec of data out of one of these stacks. This just feels right
to me as a direction. Too bad they’re unlikely to use the new IBM/3M <a href="http://www.zurich.ibm.com/news/07/cooling.html">thermally conductive glue</a>
to suck heat out of the stack.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgy9-4HNb1uaFa0GURxGUwikAYnBDii8qE8phA7pSzXCp3JpxG-WlmljW13KPOP65zcLGYYCuOjMlccpwBcmfgpUYJnGJNdCInunVXEnazuPWMdCvO2d_MPwxlPvglxIyLFx7m8DvhbYh5L/s1600/0913011635.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgy9-4HNb1uaFa0GURxGUwikAYnBDii8qE8phA7pSzXCp3JpxG-WlmljW13KPOP65zcLGYYCuOjMlccpwBcmfgpUYJnGJNdCInunVXEnazuPWMdCvO2d_MPwxlPvglxIyLFx7m8DvhbYh5L/s320/0913011635.jpg" width="320" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Stochastic Ray-Tracing:
What it says: Ray-tracing, but allows light to be probabilistically
scattered off surfaces, so, for example, shiny matte surfaces have realistically
blurred reflections on them, and produce more realistic color effects on other
surfaces to which they reflect. Shiny matte surfaces like the surface of the
golden dome in the center of the Austrian crown, reflecting the jewels in the
outer band, which was their demo image. I have a picture here, but it comes
nowhere near doing this justice. The large, high dynamic range monitor they
had, though – wow. Just wow. Spectacular. A guy was explaining this to me pointing
to a normal monitor when I happened to glance up at the HDR one. I was like “shut
up already, I just want to look at that.” To run it they used a cluster of four Xeon-based nodes, each apparently about 4U high, and that was not in real time; several seconds were required per
update. But wow.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Real-Time Ray-Tracing: This has been done before; I saw
it a demo on a Cell processor back in about 2006. This, however, was a much
more complex scene than I’d previously viewed. It had the usual shiny classic
car, but that was now in the courtyard of a much larger old palace-like
building, with lots of columns and crenellations and the like. It ran on a MIC,
of course – actually, several of them, all attached to the same Xeon system. Each had a complete copy of the scene
data in its memory, which is unrealistic but does serve to make the problem “pleasantly
parallel” (which is what I’m told is now the PC way to describe what used to be
called “embarrassingly parallel”). However, the demo was still fun. Here's a video of it I found. It apparently was shot at a different event, but still the same technology demonstrated. The intro is in Swedish, or something, but it reverts to English at the demo. And yes, all the Intel Labs guys wore white lab coats. I teased them a bit on that.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<object class="BLOGGER-youtube-video" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" data-thumbnail-src="http://1.gvt0.com/vi/4i3uc_SSQ9E/0.jpg" height="266" width="320"><param name="movie" value="http://www.youtube.com/v/4i3uc_SSQ9E&fs=1&source=uds" />
<param name="bgcolor" value="#FFFFFF" />
<embed width="320" height="266" src="http://www.youtube.com/v/4i3uc_SSQ9E&fs=1&source=uds" type="application/x-shockwave-flash"></embed></object></div>
<br /></div>
<h2>
Keynotes</h2>
<div class="MsoNoSpacing">
Otellini (CEO): Intel is going hot and heavy into
supporting the venerable Trusted Platform technology, a collection of
technology which might well work, but upon which nobody has yet bitten. This security
emphasis clearly fits with the purchase of MacAfee (everybody got a free
MacAfee package at registration, good for 3 systems). “2011 may be the year the
industry got serious about security!” I remain unconvinced.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Mooley Eden (General Manager, Mobile Platforms): OK. Right
now, I have to say that this is the one time in the course of these IDF posts
that I am going to bow to Intel’s having paid for me to attend IDF, bite my
tongue rather than succumbing to my usual practice of biting the hand that
feeds me, and limit my comments to:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Mooley Eden must be an acquired taste.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
To learn more of my personal opinions on this subject, you are going to
have to buy me a craft beer (dark & hoppy) in a very noisy bar. Since I don’t
like noisy bars, and that’s an unusual combination, I consider this unlikely.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Technically… Ultrabooks, ultrabooks, “beyond thin and
light.” More security. They had a lame Ninja-garbed guy on stage, trying to
hack into trusted-platform-protected system, and of course failing. (Please see
<a href="http://xkcd.com/538/">this</a>.) There was also a picture of a castle
with a moat, and a (deliberately) crude animation of knights trying to cross the moat and falling
in. (I mention this only because it’s relevant to something below.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
People never use hibernate, because it takes too long to
wake up. The solution is… to have the system wake up regularly. And run sync
operations. Eh what? Is this supposed to cause your wakeup to take less time
because the wakeup time is actually spent syncing? My own wakeup time is mostly wakeup. All I know is that
suspend/resume used to be really fast, reliable, and smart. Then it got transplanted to Windows from BIOS and has been unsatisfactory - slow and dumb - ever since.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
This was my first time seeing Windows 8. It looks like
Mango phone interface. Is making phones & PCs look alike supposed to help in
some way? (Like boost Windows Phone sales?) I’m quite a bit less than intrigued. It
means I’m going to have to buy another laptop before Win 8 becomes the
standard.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Justin Rattner (CTO): Some of his stuff I covered in my
first post on IDF. One I didn’t cover was the massive deal made of CERN and the
LHC (Large Hadron Collider) (“the largest machine human
beings have ever created”) (everybody please now go “ooooohhh”) using MICs. Look,
folks, the major high energy physics apps are embarrassingly parallel: You get
a whole lot, like millions, billions, of particle collisions, gather each one’s
data, and do an astounding amount of floating-point computing on each completely independent
set of collision data. Separately. Hoping to find out that one is a Higgs boson or something. I saw people doing this in the late 1980s at
Fermilab on a homebrew parallel system. They even had a good software framework
for using it: Write your (serial) code for analyzing a collision your way, and
hand it to us; we run it many times in parallel, just handing out each event’s
data to an instance of your code. The only thing that would be interesting
about this would be if for some reason they actually <i>couldn’t</i> run HEP codes
very well indeed. But they can run them well. Which makes it a yawn for me. I’ve
no question that the LHC is ungodly impressive, of course. I just wish it were
in Texas and called something else.</div>
<h2>
Intel Fellows Panel</h2>
<div class="MsoNoSpacing">
Some interesting questions asked and answered, many
questions lame. Like: “Will high-end Xeons pass mainframes?” Silly question.
Depends on what “pass” means. In the sense in which most people may mean, they
already have, and it doesn’t matter. Here are some others:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Besides MIC, what else is needed for Exascale? A: We’re
having to go all the way down to device level. In particular, we’re looking at
subthreshold or near-threshold logic. We tried that before, but failed. Devices
turn out to be most efficient 20mv above threshold. May have to run at 800MHz.
[Implication: A <b><i>whole lot</i></b> of parallelism.] Funny how they talked about
near-threshold logic, and Justin Rattner just happened to have a demo of that
at the next day’s keynote.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Are you running out of rare metals? A: It’s a question
of cost. Yes, we always try to move off expensive materials. Rare earths
needed, but not much; we only use them in layers like five atoms thick.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Q: Is Moore’s Law going to end? A: This was answered by Kelin
J. Kuhn, Fellow & Director of the Technology and Manufacturing Group – i.e.,
she <i>really</i> knows silicon. She noted that, by observation, at every given
generation it always looks like Moore’s Law ends in two generations. But it
never has. Every time we see a major impediment to the physics – several examples
given, going back to the 1980s and the end of Dennard scaling – something seems
to come along to avoid the problem. The exception seems to be right now: Unlike prior eras when it will end in two generations, there don't seem to be any clouds on this particular horizon at all. (While I personally know of no reason to dispute this, keep in mind that this is from Intel, whose whole existence seems tied to Moore's Law, and it's said by the woman who probably has the biggest responsibility to make it all come about.)</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
An aside concerning the question-taking woman with the microphone
on my side of the hall: I apparently reminded her of something she hates. She kept going elsewhere, even after standing right beside me for several minutes while I had my
hand raised. What I was going to ask was: This morning in the keynote we saw a
castle with a moat, and several knights dropping into the moat. The last two
days we also heard a lot about a knight which appears to take a ferry across
the moat of PCIe. Why are you strangling a TFLOP of computation with PCIe? Other accelerator vendors
don’t have a choice with their accelerators, but you guys own the whole
architecture. Surely something better could be done. Does this, perhaps, indicate
a lack of integration or commitment to the new architecture across the
organization?</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Maybe she was fitted with a wiseass detection system.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Anyway, I guess I won’t find out this year.</div>
Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com0tag:blogger.com,1999:blog-3155908228127841862.post-42637700415587000882011-09-29T16:45:00.000-06:002011-09-29T16:45:07.057-06:00A Conversation with Intel’s James Reinders at IDF 2011<br />
<div class="MsoBodyText">
At the recent Intel Developer Forum (IDF), I was given the
opportunity to interview James Reinders. James is in the Director, Software
Evangelist of Intel’s Software and Services Group in Oregon, and the
conversation ranged far and wide, from programming languages, to frameworks, to
<a href="http://en.wikipedia.org/wiki/Transactional_memory">transactional
memory</a>, to the use of <a href="http://www.nvidia.com/object/cuda_home_new.html">CUDA</a>, to Matlab, to vectorizing
for execution on Intel’s <a href="http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html">MIC
(Many Integrated Core)</a> architecture.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhmRyMT_o4yvWozYxdxc_SGJwQh9x5Up3dI-Zko6SHaUwm5fHtyQOiOds0cRhwGzBfvon_4NG7sI03u0-9NbqALQ8jd6sberN0mnPIuVmBwmEPlr18_uKuFnFF-TjxKU0vBc2DlMKIJlly1/s1600/james_reinders.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhmRyMT_o4yvWozYxdxc_SGJwQh9x5Up3dI-Zko6SHaUwm5fHtyQOiOds0cRhwGzBfvon_4NG7sI03u0-9NbqALQ8jd6sberN0mnPIuVmBwmEPlr18_uKuFnFF-TjxKU0vBc2DlMKIJlly1/s1600/james_reinders.jpg" /></a></div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
Intel-provided information about James:</div>
<div style="border-bottom: solid #4F81BD 1.0pt; border: none; margin-left: .65in; margin-right: .65in; mso-border-bottom-alt: solid #4F81BD .5pt; mso-border-bottom-themecolor: accent1; mso-border-bottom-themecolor: accent1; mso-element: para-border-div; padding: 0in 0in 4.0pt 0in;">
<div class="MsoIntenseQuote" style="margin-bottom: 14.0pt; margin-left: 0in; margin-right: 0in; margin-top: 10.0pt;">
James Reinders is an expert on parallel computing.
James is a senior engineer who joined Intel Corporation in 1989 and has
contributed to projects including systolic arrays systems WARP and iWarp, and
the world's first TeraFLOP supercomputer (ASCI Red), as well as compilers and
architecture work for multiple Intel processors and parallel systems. James has
been a driver behind the development of Intel as a major provider of software
development products, and serves as their chief software evangelist. His most
recent book is “Intel Threading Building Blocks” from O'Reilly Media which has
been translated to Japanese, Chinese and Korean. James has published numerous
articles, contributed to several books and is one of his current projects is as
a co-author on a new book on parallel programming to be released in 2012.</div>
</div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
I recorded our conversation; what follows is a transcript.
Also, I used Twitter to crowd-source questions, and some of my comments refer
to picking questions out of the list that generated. (Thank you! To all who
responded.)</div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
This is #2 in a series of three such transcripts. I’ll
have at least one additional post about IDF 2011, summarizing the things I
learned about MIC and the Intel “Knight’s” accelerator boards using them, since
some important things learned were outside the interviews.</div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
Full disclosure: As I originally noted in a <a href="http://perilsofparallel.blogspot.com/2011/09/impressions-of-newbie-at-intel.html">prior
post</a>, Intel paid for me to attend IDF. Thanks, again. It was a great
experience, since I’d never before attended.</div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
Occurrences of [] indicate words I added for clarification
or comment post-interview.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong>
[Discussing where I’m coming from, crowd-sourced question list, HPC & MIC
focus here.] So where would you like to start?</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Wherever you like. MIC and HPC – HPC is my life, and parallel programming, so
do your best. It has been for a long, long time, so hopefully I have a very
realistic view of what works and what doesn’t work. I think I surprise some
people with optimism about where we’re going, but I have some good reasons to
see there’s a lot of things we can do in the architecture and the software that
I think will surprise people to make that a lot more approachable than you
would expect. Amdahl’s law is still there, but some of the difficulties that we
have with the systems in terms of programming, the nondeterminism that gets
involved in the programming, which you know really destroys the paradigm of
thinking how to debug, those are solvable problems. That surprises people a
lot, but we have a lot at our disposal we didn’t have 20 or 30 years ago,
computers are so much faster and it benefits the tools. Think about how much
more the tools can do. You know, your compiler still compiles in about the same
time it did 10 years ago, but now it’s doing a lot more, and now that multicore
has become very prevalent in our hardware architecture, there are some hooks
that we are going to get into the hardware that will solve some of the
debugging problems that debugging tools can’t do by themselves because we can
catch the attention of the architects and we understand enough that there’s
some give-and-take in areas that might surprise people, that they will suddenly
have a tool where people say “how’d you solve that problem?” and it’s over
there under the covers. So I’m excited about that.</div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
[OK, so everybody forgive me for not jumping right away on
his fix for nondeterminism. What he meant by that was covered later.]</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> So,
you’re optimistic?</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Optimistic that it’s not the end of the world.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> OK.
Let me tell you where I’m coming from on that. A while back, I spent an evening
doing a web survey of parallel programming languages, and made a spreadsheet of
101 parallel programming languages [see my much earlier post, <a href="http://perilsofparallel.blogspot.com/2008/09/101-parallel-languages-part-1.html">101
Parallel Programming Languages</a>].</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong> [laughs]
You missed a few.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> I’m
sure I did. It was just one night. But not one of those was being used. MPI and
OpenMP, that was it.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong> And
Erlang has had some limited popularity, but is dying out. They’re a lot like AI
and some other things. They help solve some problems, and then if the idea is
really an advance, you’ll see something from that materialize in C or C++,
Java, or C#. Those languages teach us something that we then use where we
really want it.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> I
understand what you’re saying. It’s like MapReduce being a large-scale version
of the old LISP mapcar.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Which was around in the early 70s. A lot of people picked up on it, it’s not a
secret but it’s still, you know, on the edge.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> I
heard someone say recently that there was a programming crisis in the early
80s: How were you going to program all those PCs? It was solved not by
programming, but by having three or four frameworks, such as Excel or Word,
that some experts in a dark room wrote, everybody used, and it spread like
crazy. Is there anything now like that which we could hope for?</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong> You
see people talk about Python, you see Matlab. Python is powerful, but I think
it’s sort of trapped between general-purpose programming and the Matlab. It may
be a big enough area; it certainly has a lot of followers. Matlab is a good
example. We see a lot of people doing a lot in Matlab. And then they run up
against barriers. Excel has the same thing. You see Excel grow up and people
incredibly hairy things. We worked with Microsoft a few years ago, and they’ve
added parallelism to Excel, and it’s extremely important to some people. Some
people have spreadsheets out there that do unbelievable things. You change one
cell, and it would take a computer from just a couple of years ago and just
stall it for 30 minutes while it recomputes. [I know of people in the finance
industry who go out for coffee for a few hours if they accidentally hit F5.] Now
you can do that in parallel. I think people do gravitate towards those
frameworks, as you’re saying. So which ones will emerge? I think there’s hope.
I think Matlab is one; I don’t know that I’d put my money on that being the huge
one. But I do think there’s a lot of opportunity for that to hide this compute
power behind it. Yes, I agree with that, Word and Excel spreadsheets, they did
that, they removed something that you would have programmed over and over
again, made it accessible without it looking like programming.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong>
People did spreadsheets without realizing they were programming, because it was
so obvious.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Yes, you’re absolutely right. I tend to think of it in terms of libraries,
because I’m a little bit more of an engineer. I do see development of important
libraries that use unbelievable amounts of compute power behind them and then
simply do something that anyone could understand. Obviously image processing is
one [area], but there are other manipulations that I think people will just
routinely be able to throw into an application, but what stands behind them is
an incredibly complex library that uses compute power to manipulate that data.
You see Apple use a lot of this in their user interface, just doing this
[swipes] or that to the screen, I mean the thing behind that uses parallelism
quite well.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> But
this [swipes] [meaning the thing you do] is simple.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Right, exactly. So I think that’s a lot like moving to spreadsheets; that’s the
modern equivalent of using spreadsheets or Word. It’s the user interfaces, and
they are demanding a lot behind them. It’s unbelievable the compute power that
can use.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Yes,
it is. And I really wonder how many times you’re going to want to scan your
pictures for all the images of Aunt Sadie. You’ll get tired of doing it after a
couple of days.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Right, but I think rather than that being an activity, it’s just something your
computer does for you. It disappears. Most of us don’t want to organize things,
we want it just done. And Google’s done that on the web. Instead of keeping a
million bookmarks to find something, you do a search.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong>
Right. I used to have this incredible tree of bookmarks, and could never find
anything there.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Yes. You’d marvel at people who kept neat bookmarks, and now nobody keeps them.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> I
remember when it was a major feature of Firefox that it provided searching of
your bookmarks.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
[Laughter] </div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> You
mentioned nondeterminism. Are there any things in the hardware that you’re
thinking of? IBM Blue Gene just said they have transactional memory, in
hardware, on a chip. I’m dubious.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Yes, the Blue Gene/Q stuff. We’ve been looking at transactional memory a long
time, we being the industry, Intel included. At first we hoped “Wow, get rid of
locks, we’ll all just use transactional memory, it’ll just work.” Well, the
shortest way I can say why it doesn’t work is that software people want
transactions to be arbitrarily large, and hardware needs it to be constrained,
so it can actually do what you’re asking it to do, like holding a buffer.
That’s a nonstarter. </div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
So now what’s happening? Rocks was looking at this in Sun,
a hybrid technique, and unfortunately they didn’t bring that to market. Nobody
outside the team knows exactly what happened, but the project as a whole failed,
rather than saying transactional memory was the death. But they had a hard time
figuring out how you engineer that buffering. A lot of smart people are looking
at it. IBM’s come up with a solution, but I’ve heard it’s constrained to a
single socket. It makes sense to me why a constraint like that would be
buildable. The hard part is then how do you wrap that into a programming model.
Blue Gene’s obviously a very high end machine, so those developers have more
familiarity with constraints and dealing with it. Making it general purpose is
a very hard problem, very attractive, but I think that at the end of the day, all
transactional memory will do is be another option, that may be less
error-prone, to use in frameworks or toolkits. I don’t see a big shift in
programming model where people say “Oh, I’m using transactional memory.” It’ll
be a piece of infrastructure that toolkits like Threading Building Blocks or
OpenMP or Cilk+ use. It’ll be important for us in that it gives better
guarantees.</div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
The things I more had in mind is you’re seeing a whole
class of tools. We’ve got a tool that can do deadlock and race detection
dynamically and find it; a very, very good tool. You see companies like
TotalView looking at what they would call replaying, or unwinding, going
backwards, with debuggers. The problem with debuggers if your program’s
nondeterministic is you run it to a breakpoint and say, whoa, I want to see
what happened back here, what we usually do is just pop out of the debugger and
run it with an earlier breakpoint, or re-run it. If the program is
nondeterministic, you don’t get what you want. So the question is, can the
debugger keep enough information to back up? Well, the thing that backing up
and debugging, deadlock detection, and race detection, all those things have in
common is that they tend to run two or three orders of magnitude slower when
you’re using those techniques. Well, that’s not compelling. But, the cool part
is, with the software, we’re showing how to detect those – just a thousand
times slower than real time. </div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
Now we have the cool engineering problem: Can you make it
faster? Is there something you could do in the software or the hardware and
make that faster? I think there is, and a lot of people do. I get really
excited when you solve a hard problem, can you replay a debug, yeah, it’s too
slow. We use it to solve really hard problems, with customers that are really
important, where you hunker down for a week or two using a tool that’s a
thousand times slower to find the bug, and you’re so happy you found it – I
can’t stand out in a booth and market and have a million developers use it.
That won’t happen unless we get it closer to real time. I think that will
happen. We’re looking at ways to do that. It’s a cooperative thing between
hardware and software, and it’s not just an Intel thing; obviously the Blue
Gene team worries about these things, Sun’s team as worried about them. There’s
actually a lot of discussion between those small teams. There aren’t that many
people who understand what transactional memory is or how to implement it in
hardware, and the people who do talk to each other across companies. </div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
[In retrospect, while transcribing this, I find the sudden
transition back to TM to be mysterious. Possibly james was veering away from
unannounced technology, or possibly there’s some link between TM and 1000x
speedups of playback. If there is such a link, it’s not exactly instantly obvious
to me.]</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> At a
minimum, at conferences.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Yes, yes, and they’d like to see the software stack on top of them come
together, so they know what hardware to build to give whatever the software
model is what it needs. One of the things we learned about transactional memory
is that the software model is really hard. We have a transactional memory
compiler that does it all in software. It’s really good. We found that when
people used it, they treated transactional memory like locks and created new
problems. They didn’t write a transactional memory program from scratch to use
transactional memory, they took code they wrote for locks and tried to use
transactional memory instead of locks, and that creates problems.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> The
one place I’ve seen where rumors showed someone actually using it was the
special-purpose Java machine Azul. 500 plus processors per rack, multiple
racks, point-to-point connections with a very pretty diagram like a rosette.
They got into a suit war with Sun. And some of the things they showed were
obvious applications of transactional memory.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Hmm.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Let’s
talk about support for things like MIC. One question I had was that things like
CUDA, which let you just annotate your code, well, do more than that. But I
think CUDA was really a big part of the success of Nvidia.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong> Oh,
absolutely. Because how else are you going to get anything to go down your
shader pipeline for a computation if you don’t give a model? And by lining up
with one model, no matter the pros or cons, or how easy or hard it was, it gave
a mechanism, actually a low-level mechanism, that turns out to be predictable
because the low-level mechanism isn’t trying to do anything too fancy for you,
it’s basically giving you full control. That’s a problem to get a lot of people
to program that way, but when a programmer does program that way, they get what
the hardware can give them. We don’t need a fancy compiler that gets it right
half the time on top of that, right? Now everybody in the world would like a
fancy compiler that always got it right, and when you can build that, then CUDA
and that sort of thing just poof! Gone. I’m not sure that’s a tractable problem
on a device that’s not more general than that type of pipeline. </div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
So, the challenge I see with CUDA, and OpenCL, and even
C++AMP is that they’re going down the road of saying look, there are going to
be several classes of devices, and we need you the programmer to write a
different version of your program for each class of device. Like in OpenCL, you
can take a function and write a version for a CPU, for a GPU, a version for an
accelerator. So in this terminology, OpenCL is proposing CPU is like a Xeon,
GPU is like a Tesla, an accelerator something like MIC. We have a hard enough
problem getting one version of an optimized program written. I think that’s a
fatal flaw in this thing being widely adopted. I think we can bring those together.
</div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
What you really are trying to say is that part of your
program is going to be restrictive enough that it can be vectorized, done in
parallel. I think there are alternatives to this that will catch on and
mitigate the need to write much code in OpenCL and in CUDA. The other flaw with
those techniques is that in a world where you have a GPU and a CPU, the GPU’s
got a job to do on the user interface, and so far we’ve not described what
happens when applications mysteriously try to send some to the GPU, some to the
CPU. If you get too many apps pounding on the GPU, the user experience dies. [OK,
<i>mea culpa</i> for not interrupting and
mentioning Tianhe-1A.] AMD has proposed in their future architectures that
they’re going to produce a meta-language that OpenCL targets, and then the
hardware can target some to the GPU, and some to the CPU. So I understand the
problem, and I don’t know if that solution’s the right one, but it highlights
that the problem’s understood if you write too much OpenCL code. I’m personally
more of a believer that we find higher-level programming interfaces like Cilk
plusses, array notations, add array notations to C that explicitly tells you
vectorize and the compiler can figure out whether that’s SSC, is it AVX, is it
the 512-bit wide stuff on MIC, a GPU pipeline, whatever is on the hardware. But
don’t pollute the programming language by telling the programmer to write three
versions of your code. The good news is, though, if you do use OpenCL or CUDA
to do that, you have extreme control of the hardware and will get the best
hardware results you can, and we learn from that. I just think the learnings
are going to drive us to more abstract programming models. That’s why I’m a big
believer in the Cilk plus stuff that we’re doing.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> But
how many users of HPC systems are interested in squeezing that last drop of
performance out?</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong> HPC
users are extremely interested in squeezing performance if they can keep a
single source code that they can run everywhere. I hear this all the time, you
know, you go to Oak Ridge, and they want to run some code. Great, we’ll run it
on an Intel machine, or we’ll run it on a machine from IBM or HP or whatever,
just don’t tell me it has to be rewritten in a strange language that’s only
supported on your machine. It’s pretty consistent. So the success of CUDA, to
be used on those machines, it’s limited in a way, but it’s been exciting. But
it’s been a strain on the people who have done that because CUDA code because
CUDA code’s not going to run on an Intel machine [Well, actually, the <a href="http://www.pgroup.com/">Portland Group</a> has a CUDA C/C++ compiler
targeting x86. I do not know how good the output code performance is.]. OpenCL
offers some opportunities to run everywhere, but then has problems of
abstraction. Nvidia will talk about 400X speedups, which aren’t real, well that
depends on your definition of “real”.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong>
Let’s not start on that.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong> OK,
well, what we’re seeing constantly is that vectorization is a huge challenge.
You talk to people who have taken their cluster code and moved it to MIC
[Cluster? No shared memory?], very consistently they’ll tell us stories like,
oh, “We ported in three days.” The Intel
marketing people are like “That’s great! Three days!” I ask why the heck
did it take you three days? Everybody tells me the same thing: It ran right
away, since we support MPI, OpenMP, Fortran, C++. Then they had to spend a few
days to vectorize because otherwise performance was terrible. They’re trying to
use the 512-bit-wide vectors, and their original code was written using SSE
[Xeon SIMD/vector] with intrinsics [explicit calls to the hardware operations].
They can’t automatically translate, you have to restructure the loop because
it’s 512 bits wide – that should be automated, and if we don’t get that
automated in the next decade we’ve made a huge mistake as an industry. So I’m
hopeful that we have solutions to that today, but I think a standardized
solution to that will have to come forward. </div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> I
really wonder about that, because wildly changing the degree of parallelism, at
least at a vector level – if it’s not there in the code today, you’ve just got
to rewrite it.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Right, so we’ve got low-hanging fruit, we’ve got codes that have the
parallelism today, we need to give them a better way of specifying it. And then
yes, over time, those need to migrate to that [way of specifying parallelism in
programs]. But migrating the code where you have to restructure it a lot, and
then you do it all in SSE intrinsics, that’s very painful. If it feels more
readable, more intuitive, like array extensions to the language, I give it
better odds. But it’s still algorithmic transformation. They have to teach
people where to find their data parallelism; that’s where all the scaling is in
an application. If you don’t know how to expose it or write programs that
expose it, you won’t take advantage of this shift in the industry.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Yes.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong> I’m
supposed to make sure you wander down at about 11:00.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Yes,
I’ve got to go to the required press briefing, so I guess we need to take off.
Thanks an awful lot.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Reinders:</span></strong>
Sure. If there are any other questions we need to follow up on, I’ll be happy
to talk to you. I hope I’ve knocked off a few of your questions.</div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;"><br /></span></strong></div>
<div class="MsoBodyText">
<strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> And
then some. Thanks.</div>
<div class="MsoBodyText">
<br /></div>
<div class="MsoBodyText">
[While walking down to the press briefing, I asked James
whether the synchronization features he had from the X86 architecture were
adequate for MIC. He said that they were OK for the 30 or so cores in Knight’s Ferry,
but when you got above 40, they would need to do something additional.
Interestingly, after the conference, there was an <a href="http://communities.intel.com/community/openportit/server/blog/2011/09/22/intel-mic-scores-1st-home-run-with-10-petaflop-stampede-supercomputer">Intel
press release</a> about the Intel/Dell “home run” win at TACC – using Knight’s
Corner, “an innovative design that includes more than 50 cores.” This dovetails
with what Joe Curley told me about Knight’s Corner not being the same as
Knight’s Ferry. Stay tuned for the next interview.]</div>
Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com7tag:blogger.com,1999:blog-3155908228127841862.post-27514569062531300322011-09-26T17:50:00.002-06:002011-09-26T18:56:43.753-06:00A Conversation with Intel’s John Hengeveld at IDF 2011<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5WP8ovvsWKuNCvU1uv_R6AI0gFQvob6gWobmzfgEhdCLEwcwBaQeD-jpo-sbprSY32V0cGGgabxF3biq7f-8zShye6DvSvL5Ox_qpkxnC95o6Pxg452XbbK0lWPT_Pl8uPVKcjxVwunQw/s1600/Hengeveld.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5WP8ovvsWKuNCvU1uv_R6AI0gFQvob6gWobmzfgEhdCLEwcwBaQeD-jpo-sbprSY32V0cGGgabxF3biq7f-8zShye6DvSvL5Ox_qpkxnC95o6Pxg452XbbK0lWPT_Pl8uPVKcjxVwunQw/s200/Hengeveld.jpg" width="133" /></a></div><div class="MsoNormal">At the recent Intel Developer Forum (IDF), I was given the opportunity to interview John Hengeveld. John is in the Datacenter and Connected Systems Group in Hillsboro.</div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Intel-provided information about John:</div><div style="border-bottom: solid #4F81BD 1.0pt; border: none; margin-left: .65in; margin-right: .65in; mso-border-bottom-alt: solid #4F81BD .5pt; mso-border-bottom-themecolor: accent1; mso-border-bottom-themecolor: accent1; mso-element: para-border-div; padding: 0in 0in 4.0pt 0in;"><div class="MsoIntenseQuote" style="margin-bottom: 14.0pt; margin-left: 0in; margin-right: 0in; margin-top: 10.0pt;">John is responsible for end user and OEM marketing for Intel’s Workstation and HPC businesses and leads an outstanding team of industry visionaries. John has been at Intel for 6 years and was previously the senior business strategist for Intel’s Digital Enterprise Group and the lead strategist for Intel’s Many Core development initiatives. John has 20 years of experience in general management, strategy and marketing leadership roles in high technology. <o:p></o:p></div><div class="MsoIntenseQuote" style="margin-bottom: 14.0pt; margin-left: 0in; margin-right: 0in; margin-top: 10.0pt;">John is dedicated to life-long learning, he has taught Corporate Strategy and Business Strategy and Policy; Technology Management; and Marketing Research and Strategy for Portland State University’s Master of Business Administration program. John is a graduate of the Massachusetts Institute of Technology and holds his MBA from the University of Oregon. </div></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">I recorded our conversation. What follows is a transcript, rather than a summary, since our topics ranged fairly widely and in some cases information is conveyed by the style of the answer. Conditions weren’t optimal for recording; it was in a large open space with many other conversations going on and the “<a href="http://newsroom.intel.com/community/intel_newsroom/free_press/blog/2011/09/16/robotic-orchestra-hits-right-notes-for-industrial-control">Intel Robotic Orchestra</a>” playing in the background. Hopefully I got all the words right.</div><div class="MsoNormal"><br />
</div><div class="MsoNormal">I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)</div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Full disclosure: As I noted in a <a href="http://perilsofparallel.blogspot.com/2011/09/impressions-of-newbie-at-intel.html">prior post</a>, Intel paid for me to attend IDF. Thanks, again.</div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Occurrences of [] indicate words I added for clarification. There aren’t many.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> What, overall, is HPC to Intel? Is it synonymous with MIC?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> No. Actually, HPC has a research effort, how to scale applications, how to deal with performance and power issues that are upcoming. That’s the labs portion of it. Then we have significant product activity around our mainstream Xeon products, how to support the software and infrastructure when those products are delivered in cluster form to supercomputing activities. In addition to those products also get delivered into what we refer to as the volume HPC market, which is small and medium-sized clusters being used for product design, research activities, such as those in biomed, some in visualization. Then comes the MIC part. So, when we look at MIC, we try to manage and characterize the collection of workloads we create optimized performance for. About 20% of those, and we think these are representative of workloads in the industry, map to what MIC does really well. And the rest, most customers have…</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> What is the distinguishing characteristic?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> There are two distinguishing characteristics. One is what I would refer to as compute density – applications that have relatively small memory footprints but have a high number of compute operations per memory access, and that parallelize well. Then there’s a second set of applications, streaming applications, where size isn’t significant but memory bandwidth is the distinguishing factor. You see some portion of the workload space there.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Streaming is something I was specifically going to ask you about. It seems that with the accelerators being used today, there’s this bifurcation in HPC: Things that don’t need, or can’t use, memory streaming; and those that are limited by how fast you can move data to and from memory.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> That’s right. I agree.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Is MIC designed for the streaming side?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> MIC will perform well for many streaming applications. Not all. There are some that require a memory access model MIC doesn’t map to particularly well. But a lot of the streaming applications will do very well on MIC in one of the generations. We have a collection of generations of MIC on the roadmap, but we’re not talking about anything beyond the next “Corner” generation [Knight’s Corner, 2012 product successor to the current limited-production Knight’s Ferry software development vehicle]. More beyond that, down the roadmap, you will see more and more effect for that class of application.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> So you expect that to be competitive in bandwidth and throughput with what comes out of Nvidia?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Very much so. We’re competing in this market space to be successful; and we understand that we need to be competitive on a performance density, performance per watt basis. The way I kind of think about it is that we have a roadmap with exceptional performance, but, in addition to that, we have a consistent programming model with the rest of the Xeon platforms. The things you do to create an optimized cluster will work in the MIC space pretty much straightforwardly. We’ve done a number of demonstrations of that here and at ISC. That’s the main difference. So we’ll see the performance; we’ll be ahead in the performance. But the real difference is the programming model.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> But the application has to be amenable.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> The application has to be amenable. For many customers that do a wide range of applications – you know, if you are doing a few things, it’s likely possible that some of those few things will be these highly-parallel, many-core optimized kinds of things. But most customers are doing a range of things. The powerful general-purpose solution is still the mainstream Xeon architecture, which handles the widest range of workloads really robustly, and as we continue with our beat rate in the Xeon space, you know with Sandy Bridge coming out we moved significantly forward with floating-point performance, and you’ll see that again going forward. You see the charts going up and to the right 2X per release.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Yes, all marketing charts go up and to the right.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Yes, all marketing charts go up and to the right, but the point is that there’s a continued investment to drive floating-point performance and effective parallelism and power efficiency in a way that will be useful to HPC customers and mainstream customers.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Is MIC going to be something that will continue over time? That you can write code for an expect it to continue to work in the future?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Absolutely. It’s a major investment on our part on a distinct architectural approach that we expect to continue on as far out as our roadmaps envision today.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Can you tell me anything about memory and connectivity? There was some indication at one point of memory being stacked on a MIC chip.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> A lot of research concepts are being explored for future products, and I can’t really talk about much of that kind of thing for things that are out in the roadmap. There’s a lot of work being done around innovative approaches about how to do the system work around this silicon.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> MIC vs. SCC – Single Chip Cluster.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> SCC! Got it! I thought you meant single chip computer.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> That would probably be SoC, System on a Chip. Is SCC part of your thinking on this?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> SCC was a research vehicle to try to explore extreme parallelism and some different instruction set architectures. It was a research vehicle. MIC is a series of products. It’s an architecture that underlies them. We always use “MIC” as an adjective: It’s a MIC architecture, MIC products, or something like that. It means Many Integrated Cores, Many Integrated Core architecture is an approach that underlies a collection of products, that are a product mix from Intel. As opposed to SCC, which is a research vehicle. It’s intended to get the academic community thinking about how to solve some of the major problems that remain in parallelism, using computer science to solve problems.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> One person noted that a big part of NVIDIA’s success in the space is CUDA…</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Yep.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> …which people can use to get, without too much trouble, really optimized code running on their accelerators. I know there are a lot of other things that can be re-used from Intel architecture – Threaded Building Blocks, etc. – but will CUDA be supported?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> That’s a question you have to ask NVIDIA. CUDA’s not my product. I have a collection of products that have an architectural approach.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> OpenCL is covered?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> OpenCL is part of our support roadmap, and we announced that previously. So, yes OpenCL.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Inside of a MIC, right now, it has dual counter-rotating rings. Are connections other than that being considered? I’m thinking of the SCC mesh and other stuff. Are they in your thinking at this point?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Yes, so, further out in the roadmap. These are all part of the research concepts. That’s the reason we do SCC and things like that, to see if it makes sense to use that architecture in the longer term products. But that’s a long ways away. Right now we have a fairly reasonable architectural approach that takes us out a bit, and certainly into our first generation of products. We’re not discussing yet how we’re going to use these learnings in future MIC products. But you can imagine that’s part of the thinking.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> OK.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> So, here’s the key thing. There are problems in exascale that the industry doesn’t know how to solve yet, and we’re working with the industry very actively to try to figure out whether there are architectural breakthroughs, things like mesh architectures. Is that part of the solution to exascale conundrums? Are there workloads in exascale, sort of a wave processing model, that you might see in a mesh architecture, that might make sense. So working with research centers, working with the labs, in part, we’re trying to figure out how to crack some of these nuts. For us it’s about taking all the pieces people are thinking about and seeing what the whole is.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> I’m glad to hear you express it that way, since the way it seemed to be portrayed at ISC was, from Intel, “Exascale, we’ve got that covered.”</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> So, at the very highest strategic level, we have it covered in that we are working closely with a collection of academic and industry partners to try and solve difficult problems. But exascale is a long way off yet. We’re committed to make it happen, committed to solve the problems. That’s the real meat of what Kirk declared at ISC. It’s not that we have the answer; it’s that we have a commitment to make it happen, and to make it happen in a relatively early time period, with a relatively sustainable product architectural approach. But there are many problems to solve in exascale; we can barely get our arms around it.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Do you agree with the DARPA targets for exascale, particularly low power, or would you relax those?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> The Intel commit, what we said in the declaration, was not inconsistent with the DARPA thing. It may be slightly relaxed. You can relax one of two things, you can relax time or you can relax DARPA targets. So I think you’re going to reach DARPA’s targets eventually – but when. So the target that Kirk raised is right in there, in the same ballpark. Exascale in 20MW is one set of rational numbers; I’ve heard 10 [MW], I’ve heard 40 [MW], somewhere between those, right? I think 40 [MW] is so easy it’s not worth thinking about. I don’t think it’s economically rational. </div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> As you move forward, what do you think are the primary barriers to performance? There are two different axes here, technical barriers, and market barriers.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> The technical barriers are cracking bandwidth and not violating the power budget; tracking how to manage the thread complexity of an exascale system – how many threads are you going to need? A whole lot. So how do you get your arms around that? There are business barriers: How do you get a return on investment through productizing things that apply in the exascale world? This is a John [?] quote, not an Intel quote, but I am far less interested in the first exascale system than I am in the 100<sup>th</sup>. I would like a proliferation of exascale applications and performance, and have it be accessible to a wide range of people and applications, some applications that don’t exist today. In any ecosystem-building task, you’ve got to create awareness of the need, and create economic momentum behind serving that need. Those problems are equally complex to solve [equal to the technical ones]. In my camp, I think that maybe in some ways the technical problems are more solvable, since you’re not training people in a new way of thinking and working and solving problems. It takes some time to do that.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Yes, in some ways the science is on a totally different time schedule.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Yes, I agree. I agree entirely. A lot of what I’m talking about today is leaps forward in science as technical computing advances, but as the capability grows, the science will move to match it. How will that science be used? Interesting question. How will it be proliferated? Genome work is a great target for some of this stuff. You probably don’t need exascale for genome. You can make it faster, you can make it more cost-effective.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> From what I have heard from people working on this at CSU, they have a whole lot more problems with storage than with computing capability.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> That’s exactly right.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> They throw data away because they have no place to put it.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> That’s a fine example of the business problems you have to crack along with the compute problems that you have to crack. There’s a whole infrastructure around those applications that has to grow up.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Looking at other questions I had… You wouldn’t call MIC a transitional architecture, would you?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> No. Heavens no. It’s a design point for a set of workloads in HPC and other areas. We believe MIC fits more things than just HPC. We started with HPC. It’s a design point that has a persistence well beyond as far as we can see on the roadmap. It’s not a transitional product.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> I have a lot of detailed technical questions which probably aren’t appropriate, like whether each of the MIC cores has equal latency to main memory.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Yes, that’s a fine example of a question I probably shouldn’t answer.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Returning to ultimate limits of computing, there are two that stand out, power and bandwidth, both to memory and between chips. Does either of those stand out to you as <b>the</b> sore thumb?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Wow. So, the guts of that question gets to workload characterization. One of my favorite topics is “It’s the workload, stupid.” People say “it’s the economy, stupid,” well in this space it’s the workload. There aren’t general statements you can make about all workloads in this market.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Yes, HPC is not one market.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Right, it’s not one market, it’s not one class of usages, it’s not one architecture of solutions, it’s one reason why MIC is required, it’s not invisible. One size doesn’t fit all. Xeon does a great job of solving a lot of it really well, but there are individual workloads that are valuable that we want to dive into with more capability in a more targeted way. There are workloads in the industry where the interconnect bandwidth between processors in a node and nodes in a cluster is the dominant factor in performance. There are other workloads where the bandwidth to memory is the dominant factor in performance. All have to be solved. All have to be moved forward at a reasonable pace. I think the ones that are going to map to exascale best are ones where the memory bandwidth required can be solved well by local memory, and the problems that can be addressed well are those that have rational scaling of interconnect requirement between nodes. You’re not going to see problems that have a massive explosion of communication; the bandwidth won’t exist to keep up with that. You can actually see something I call “well-fed FLOPS,” which is how many FLOPS can you rationally support given the rest of this architecture. That’s something you have to know for each workload. You have to study it for each domain of HPC usage before you get to the answer about which is more important.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> You probably have to go now. I did want to say that I noticed the brass rat. Mine is somewhere in the Gulf of Mexico.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> That’s terrible. Class of ’80.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Class of ’67.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Wow.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Stayed around for graduate school, too.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> When’d you leave?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> In ’74.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> We just missed overlapping, then. Have you been back recently?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Not too recently. But there have been a lot of changes.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> That’s true, a lot of changes.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> But East Campus is still the same?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> You were in East Campus? Where’d you live?</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Munroe.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> I was in the black hall of fifth-floor Bemis.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> That doesn’t ring a bell with me.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Back in the early 70s, they painted the hall black, and put in red lights in 5<sup>th</sup>-floor Bemis.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Oh, OK. We covered all the lights with green gel.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> Yes, I heard of that. That’s something that they did even to my time period there.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Pfister:</span></strong> Anyway, thank you.</div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;"><br />
</span></strong></div><div class="MsoNormal"><strong><span style="font-family: Calibri, sans-serif;">Hengeveld:</span></strong> A pleasure. Nice talking to you, too.</div>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com0tag:blogger.com,1999:blog-3155908228127841862.post-67471192464628957092011-09-18T15:58:00.013-06:002011-09-25T12:40:03.081-06:00Impressions of a Newbie at Intel Developer Forum (IDF)<div class="MsoBodyText">Out of the blue (which in this case is a pun), I received an invitation from an Intel representative to attend the 2011 Intel Developer Forum (IDF), in San Francisco, at Intel’s expense. Yes, I accepted. Thank you, Intel in general; and thank you in particular to the very nice lady who invited me and shepherded me through the process.<br />
<br />
<span class="Apple-style-span" style="color: purple;">[There are some updates below, marked in this color.]</span></div><div class="MsoBodyText"><br />
</div><div style="border-bottom: solid #4F81BD 1.0pt; border: none; mso-border-bottom-themecolor: accent1; mso-element: para-border-div; padding: 0in 0in 4.0pt 0in;">I’d never attended an IDF before, so I thought I’d spend an initial post on my overall impressions, describing the things that stood out to this wide-eyed IDF newbie. It may be boring to long-time IDF attendees – and there are very long-timers; a friend of mine has been to every domestic IDF for the last 12 years. But what the heck, they impressed me.<br />
<br />
</div><div class="MsoBodyText">I do have some technical gee-whiz later in this post, but I’ll primarily go into more technical detail in subsequent posts. Those will including recountings of the three private interviews that were arranged for me with Intel HPC and MIC (Many-Integrated Core) executives (John Hengeveld, James Reinders, and Joe Curley), as well as other things I picked up along the way, primarily about MIC.</div><div class="MsoBodyText"><br />
Here are my summary impressions: (1) Big. Very Big. (2) Incredibly slick and polished. (3) A fine attempt at Borgilation.</div><div class="MsoBodyText"><br />
IDF is gigantic. It doesn’t surpass the mother of all trade shows, the Consumer Electronics show, but I wouldn’t be surprised to find that it is the largest single-company trade show. The Moscone Center West, filled by IDF on all three floors, is almost 300,000 sq. ft. Justin Rattner (Intel Fellow & CTO) said in his keynote that there were over 5,000 attendees, and that hauling in the gear and exhibits required 500 semis. I believe it.</div><div class="MsoBodyText"><br />
There was of course the usual massive collection of trade-show booths covering one huge exhibit area (see photo of the center aisle of the exhibit area, below). That alone filled 100,000 sq. ft of exhibit space, completely. <br />
<br />
</div><div class="MsoBodyText"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGVO-BJCB8ACOauDAqYRm7vJOkdwU70VQePxTSPLfGeQhZBgTxhRR6gw8U34QIYbQBqFcQexP8b3oIu1RSjBfq2jMWqzNnUfwkwBX1PPffPqM0-1XyOju74jgCuYHucpkWwMjAs9pfBe6o/s1600/0912011729.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGVO-BJCB8ACOauDAqYRm7vJOkdwU70VQePxTSPLfGeQhZBgTxhRR6gw8U34QIYbQBqFcQexP8b3oIu1RSjBfq2jMWqzNnUfwkwBX1PPffPqM0-1XyOju74jgCuYHucpkWwMjAs9pfBe6o/s320/0912011729.jpg" width="320" /></a></div><br />
<br />
In addition, all the large open areas each had their large well-manned pavilion dedicated to one thing or another: One had a bevy of ultrabooks (ultrabook = Intel’s push for a viable non-Apple MacBook Air) that you could play with. Another was an “Extreme Zone” with a battery of four high-end gaming systems (mostly playing what looked like Wolfenstein-y game). Another was a multi-player racing game with several drivers’ seats with steering wheels, etc. Another demoed twenty or thirty so different sizes and shapes of laptops (in addition to the displays in the exhibit area). Another was a contraption of pipes and random stuff spitting plastic balls onto pseudo-xylophones, cymbals, and so on, physically mimicking the <a href="http://www.youtube.com/watch?v=rANjsNG2n5o">famous YouTube video</a> of several years back, demonstrating industrial controllers run by Atom processors. It didn’t actually play the music, but the video’s a pure animation so it’s one up on that. <span class="Apple-style-span" style="color: purple;">[Intel has a <a href="http://newsroom.intel.com/community/intel_newsroom/free_press/blog/2011/09/16/robotic-orchestra-hits-right-notes-for-industrial-control">press release</a> on this which seems to indicate that it actually played the music. Didn't seem like it to me, but might be.]</span></div><div class="MsoBodyText"><br />
Everywhere could be found fanatic attention to detail and production values, extending down to even small details. </div><div class="MsoBodyText"><br />
The keynotes were marvels of production; I’ve been to many IBM affairs, and nothing I saw over the years compared with these in slick, polished execution. Movies were theatre-quality cinematic productions (despite typical marketing fluff plots with occasional cheesy humor), and every one queued in at exactly the right instant, no hiccups. Every on-stage demo went right on the money, and even when one crashed – a momentary screen showing a windows driver crash – another was seamlessly switched in what seemed less than 2 seconds; I strongly suspect a hot backup, since no way does Windows recover that fast.</div><div class="MsoBodyText"><br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDYVrfZ5m3g0OThxOAssqID3Wy0KHmdejtFAM1uqnPvu4yZjVsRYhMSY-y-8mHr5vcRX2KmN-tirnblPijEX-8MBoeSuGSddg3CTwgrtJ15-zfeJixVg7uEMq69mnpVpDE41R2HuIDv42C/s1600/IMG_0977.JPG" imageanchor="1" style="clear: right; display: inline !important; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDYVrfZ5m3g0OThxOAssqID3Wy0KHmdejtFAM1uqnPvu4yZjVsRYhMSY-y-8mHr5vcRX2KmN-tirnblPijEX-8MBoeSuGSddg3CTwgrtJ15-zfeJixVg7uEMq69mnpVpDE41R2HuIDv42C/s200/IMG_0977.JPG" width="200" /></a>But smaller things had their share of attention, too. The technical sessions I attended all had fluent, personable speakers; meticulously designed slides; and perfect audio with nary a glitch in microphone use or (&deity. forbid) feedback. Even the backpacks handed out were high quality and custom-made. Simple customization is no big deal, but these came with Intel logos on the zipper pulls and a custom lining emblazoned with their chip-layout banner theme (see photos).<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhP2cahLSlcvfCTyRUw3lYyyRbUH09smyTt2yShVZspFalGkbBVgrb5ilEtyi-6Bh7THS0XuUKgAsYK9QXUtOPAmikBXGNhSbPZw6QwFz4mktDhlMnWDFGBzNKdfIakAi2_m6-A1vZ9-i_K/s1600/IMG_0978.JPG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhP2cahLSlcvfCTyRUw3lYyyRbUH09smyTt2yShVZspFalGkbBVgrb5ilEtyi-6Bh7THS0XuUKgAsYK9QXUtOPAmikBXGNhSbPZw6QwFz4mktDhlMnWDFGBzNKdfIakAi2_m6-A1vZ9-i_K/s200/IMG_0978.JPG" width="200" /></a></div><br />
</div><div class="MsoBodyText">Speaking of that banner theme, it blared out at you over the entrance to each hall, on a photo at least 20 ft. high and 100 feet long (photo again), a huge illustration: You are a dull, chalky, dead, white – until Intel’s silicon brings you to vibrant, colored life. Not exactly subtle symbolism, but that’s marketing.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik-JStPQhx5iz7ojlUZoqHhOnbaMKENGkTzeTQKVPuvgLrF7NAFHPilQgGaxkCcbHnA2YHME406WogBtmgTsmUOVsi5FMYU0UvOBpzZfCrOA5UpAD-iWrCY-wPMM5QyR1PePyfI8MjztnJ/s1600/0913010832.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik-JStPQhx5iz7ojlUZoqHhOnbaMKENGkTzeTQKVPuvgLrF7NAFHPilQgGaxkCcbHnA2YHME406WogBtmgTsmUOVsi5FMYU0UvOBpzZfCrOA5UpAD-iWrCY-wPMM5QyR1PePyfI8MjztnJ/s640/0913010832.jpg" width="640" /></a></div><br />
</div><div class="MsoBodyText"><br />
And speaking of marketing, the unmistakable overall message was: We will dominate <b>everything</b>. Everything with a processor in it, that is. Servers, with volumes ever-increasing at huge rates? Check. High-end 10+ core major stompers? Check. Midrange? Check. Low end? Super checkety-check-check-check. Ultrabook (future) with 14-day standby. (Standby? Do we really care?) Even a cell phone, demoed, run by an Intel processor. It’s the little black rectangle at the center-right of this pic:<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEji4SZl7QEp9YgYDTuI6Xb1lj7BHmtJvLYIYeCQ0v9_u4XQ3a0T9tfy1f6LojzwiUQRkuV-tKunb8N3GOh8aUxWRtDqrljpe0v1tNwykh3kf9xg6mRFB6Bjll76ctHoUnEP91P78py_i2uF/s1600/0913010951.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEji4SZl7QEp9YgYDTuI6Xb1lj7BHmtJvLYIYeCQ0v9_u4XQ3a0T9tfy1f6LojzwiUQRkuV-tKunb8N3GOh8aUxWRtDqrljpe0v1tNwykh3kf9xg6mRFB6Bjll76ctHoUnEP91P78py_i2uF/s320/0913010951.jpg" width="320" /></a></div><br />
</div><div class="MsoBodyText"><br />
(I couldn’t get a better picture, since after every keynote there was a “photo opportunity” that produced a paparazzi-dense melee/feeding frenzy on the stage. This is, I'm told, and IDF tradition. I’m not sufficiently a press-banger to elbow my way through that wall of bodies.)</div><div class="MsoBodyText"><br />
The low-power demo that impressed me, though, was of a two-watt processor in a system showing a squee-worthy kitty video (and something else, but who noticed?), powered by a small solar panel. This was a demo of the future potential of near-threshold voltage operation, also touted (not, I’m sure, by accident) (not at <i>all</i>) in the Intel Fellows’ panel the day before. They used an old Pentium to do it, undoubtedly for reasons I’m not enough of a circuit jocky to understand. There was even what appeared – horrors! – to be an on-stage <i>ad lib</i> (!!) about “dumpster diving” for it. (Hey, eBay! Did they just call you a dumpster? The perils of <i>ad libbing</i>.) Some blatant futurism followed this, talking about 100 GF in that same 2W envelope; no hint when, fantastic if it ever happens.</div><div class="MsoBodyText"><br />
There are chinks in the armor, though. You have to look seriously to find them, or have some comparisons on your side.</div><div class="MsoBodyText"><br />
A friend happened to note to me, for example, that this IDF was three keynotes short of the usual full house of six. There was Otellini’s (CEO) general keynote, and Mooley Eden’s laptop ultrabook keynote, and Justin Rattner’s “futures” presentation in which he laughs too much for my taste. Those are regulars at every IDF. However, there was no keynote specifically devoted to Servers; understandable, I suppose, because they’re between big releases and have nothing major to announce (but they said a whole lot about the next-gen Ivy Bridge and the future server market in a media-only briefing). There was also no keynote for Digital Home; they are wrapped up with Sony <span class="Apple-style-span" style="color: purple;">[and other partners]</span> on that one, and likely it hasn’t any splashes to make at this time (or else everybody’s figured out that connecting your TV to the Internet isn’t yet a world-shaking idea). And… dang, there was a third one historically, but I’ve lost it. Sorry. <span class="Apple-style-span" style="color: purple;">[The third missing keynote was on softtware and services, traditionally performed by Renee James.]</span> Takeaway: Ambitions seem a bit shrunken, but it may just be circumstances. </div><div class="MsoBodyText"><br />
A big deal was made in a media briefing about how they were going to improve Intel's Atom SoCs (Systems-On-a-Chip) at <i>double Moore’s Law</i>. (I think you’re supposed to gasp now.) That sounds sexy, but I interpret it as meaning they figured out that Atom really needs to be done in their latest and greatest silicon technology, as opposed to lagging a couple of generations (nodes) back the way it now does particularly now that their highest-end technologies are focused on low power.<br />
<br />
So they’re going to catch up. Everybody, including Atom, will be using use the same 14nm technology in 2014. (That’s an estimated, forward-looking 2014, see their prospectus for caveats, etc.) Until then, well, there are iterations. I take “double Moore’s Law” to mean that they can’t steer the massive ship of microprocessor development fast enough to catch up in a single release; and/or (likely) their existing Atom customer base can’t wait without any new Atom products for as long as a single leap would take.<br />
<br />
Will this put a dent in ARM's dominance of the low-power arena? Or MIPS's share? Maybe, in time.</div><div class="MsoBodyText"><br />
Then there was that graph, also in a media briefing, of future server shipments. (Wish I had a pic; can’t find the pdf on the Intel web site.) They extended it to show some trebling or quadrupling of server shipments in the next few years, but…<br />
<br />
Maybe they have some data I don’t have. To me, the actual past data on the graph seemed to me to say that curve of shipment volumes recently started flattening out. Extrapolating based on the slope that existed a couple of quarters or years in the past doesn’t seem justified by what I saw purely based on that graph.</div><div class="MsoBodyText"><br />
Hey, did I mention that I wuz a medium? I got in with media credentials, which was another personal first. (Thanks again!) Talk about being a newbie – I didn’t even know there was a special “media corridor” until half-way through the first day. Dang. I could have had a much better breakfast on the first day.</div><div class="MsoBodyText"><br />
Now I have this itch to buy a fedora so I can put a press pass into the hatband.</div><div class="MsoBodyText"><br />
More will come, but I’ve got a trip to Mesa Verde for the next few days, so it won’t be immediate. Sorry. The wait won’t be anywhere near as long as it has been between other recent posts.</div><div class="MsoBodyText"><br />
</div>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com6tag:blogger.com,1999:blog-3155908228127841862.post-4565836467340590072011-08-15T18:39:00.001-06:002011-08-16T17:09:54.500-06:00IBM Dumps Blue Waters – Final Curtain on the Old Days<div style="border-bottom: solid #4F81BD 1.0pt; border: none; mso-border-bottom-themecolor: accent1; mso-element: para-border-div; padding: 0in 0in 4.0pt 0in;"><div class="MsoTitle">IBM has pulled out of the much-touted Blue Waters supercomputer project of IBM and National Center for Supercomputing Applications at the University of Illinois, an effort which was supposed to produce one petaflops of sustained performance by the end of 2012. Googling “IBM Blue Waters” and selecting “news” will give you a bevy of reports on this, (like <a href="http://www.theregister.co.uk/2011/08/08/ibm_kills_blue_waters_super/">this</a>, <a href="http://www.slashgear.com/ibm-ncsa-petascale-supercomputer-blue-waters-project-abandoned-08170381/">this</a>, <a href="http://www.hpcwire.com/hpcwire/2011-08-08/ibm_bails_on_blue_waters_supercomputer.html">this</a>, <a href="http://www.pcmag.com/article2/0,2817,2390697,00.asp">this</a>) so I’m going to refrain from reduplicating what everybody else has said.</div><div class="MsoTitle"><br />
</div><div class="MsoTitle">I don’t have any inside scoop on this, in the sense that I have no under-the-table secret contacts or communications channels back into IBM. However, I can make some connections between dots already out there, based on my experience leading one flashy HPC project (<a href="http://www-03.ibm.com/ibm/history/exhibits/vintage/vintage_4506VV1005.html">RP3</a>) back in the 1980s (possibly the first IBM did), and being close to such projects after that. My conclusion: There has been a major change in IBM executive management’s attitude towards flashy HPC projects, a change that is probably the drop of the final shoe of the “good old days” of IT architecture research.</div></div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">I deduce the attitude change from HPCwire’s call to Herb Schultz, marketing manager for IBM's Deep Computing unit, in which he said a while ago that “There is really no appetite in IBM anymore -- with some of the leadership changes over the last few years – for revenue that has no profit with it”.</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">So, IBM wants to make money on its high performance computing products. What’s wrong with that? Nothing. As every IBM manager is taught in their first management training – at least I was – the purpose of IBM isn’t to advance technology, or make the world a better place, or be a good corporate citizen; it’s to make money. (Those were the multiple choices in a quiz, by the way.) It’s perfectly obvious that any company that doesn’t make money, and thereby stay in business, can’t do anything. It’s like the first and most important rule of breathing I was taught in Tai Chi, which was: Breathe. If you don’t do that, you won’t be around long.</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">But as everyone should also know, there’s a focus on making money now, directly, measurably; and there’s setting up to make more money in the future. The first is needed; but if done exclusively, without the second, your corporate lifetime is also being limited – rather like living on a tasty but unhealthy diet.</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">I recall distinctly the response of <a href="http://en.wikipedia.org/wiki/Ralph_E._Gomory">Ralph Gomory</a>, then IBM Senior VP of Science and Technology, to a cadre of high-level development managers who were complaining about the cost of some HPC project, proposing to kill it. He told them “This will make you money in ways you can’t conceive of” (approximate quote). He was right. What they return isn’t money, directly; it’s column-inches on the front page of the New York Times and similar media.</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">This works. I’ve recounted in a <a href="http://perilsofparallel.blogspot.com/2009/02/larrabee-in-ps4-so-whats-future-of-ibms.html">much earlier post</a> a case I was involved in where an IBM account rep absolutely <b>owned</b> the entire IT account of a large, conservative retailer in the Midwest – because an IBM RISC system was given the credit for beating Kasparov. (Winning Jeopardy! hardly has the same cachet.)</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">Also, while it may be hard to fathom now, there was a time when computer architecture and hardware development research was simply pursued for its own sake, primarily because we might find something out by doing it, without knowing what that might be.</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">This also works. My personal example of that is tree saturation<a href="file:///C:/Users/gpfister/Desktop/Perils%20Blue%20Waters.docx#_ftn1" name="_ftnref1" title=""><sup><sup><span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;">[1]</span></sup></sup></a> (a.k.a. congestion spreading, but in non-lossy networks), which I and Alan Norton serendipitously discovered in the RP3 project. I distinctly recall involuntarily standing and my whole body stiffening when I looked at the graphs revealing it, and realized what was happening. It was my own personal “eureka!” kind of moment. We’d no clue we’d find that, and it was the occasion of my only recursive award – an award from IBM research for getting an award for the paper on it. Gomory (who, coincidentally, was Research Division president at the time) said that was exactly the kind of thing he had hoped to get from RP3.</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">However, two things have changed since then: There’s a much stronger focus on showing results today (which the IBM stock price rise duly reflects). And the cost of entry has become quite a bit higher, particularly entries like Blue Waters.</div><div class="MsoBodyText">Back when Gomory said what I recounted above, IBM was riding high on steady income from mainframes and their software. Those still bring in substantial money, particularly via drag of software along with them (which the hardware guys aren’t allowed to count… grrr…). Now, though, the software business has moved on to the much more competitive arena of stand-alone software products that run on a variety of platforms. Of course, there is also now the whole service business that practically didn’t exist back then.</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">In addition, the cost of entry has skyrocketed. Back when I was involved in RP3, we had a contract with DARPA that brought in a whole $1M or so, which paid something like half the real bill. Compare that with <a href="http://www.theregister.co.uk/2011/07/15/power_775_super_pricing/">El Reg</a>’s estimate that a single Blue Waters rack is an $8M proposition, with over 200 racks needed for the final configuration and you’re over $1B. Those are all rough numbers, and they’re retail, not cost (an impossible number to pin down from outside), but you can see where the table stakes have gotten beyond many of the highest high rollers stash.</div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">So I’m going to label this pull out from Blue Waters as the final ringing down of the last curtain on an era of free-wheeling profit-unconstrained research into computer architecture and systems. </div><div class="MsoBodyText"><br />
</div><div class="MsoBodyText">It was fun while it lasted, but now, no matter what you do, the issue is where and when the profit comes out. That’s normal now, but I think we need to remember that it was not always so.</div><div><br />
<hr align="left" size="1" width="33%" /><div id="ftn1"><div class="MsoFootnoteText"><a href="file:///C:/Users/gpfister/Desktop/Perils%20Blue%20Waters.docx#_ftnref1" name="_ftn1" title=""><span class="MsoFootnoteReference"><span class="MsoFootnoteReference"><span style="font-family: Calibri, sans-serif; font-size: 10pt; line-height: 115%;">[1]</span></span></span></a> I’d like to give a URL for that, but it was back in the early 80s pre-web. There are lots of papers still out there about avoiding or fixing it (many wrong) that you can find by Googling “tree saturation”, though. Finally figured out how to fix it in InfiniBand. Complicated. Possibly not worth the effort. <span class="Apple-style-span" style="color: purple;">Added: Since someone asked, here's bibliographical information on the paper: "Hot spot" contention and combining in multistage interconnection networks. GF Pfister, V Norton IEEE TRANS. COMP. 34:1010, 943-948, 1985</span></div></div></div>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com3tag:blogger.com,1999:blog-3155908228127841862.post-11873299207185491312011-05-17T16:40:00.013-06:002011-05-23T13:03:32.043-06:00Sandy Bridge Graphics Disappoints<span xmlns=""></span><br />
<span style="color: red;" xmlns=""><b>See update and the end of this post: New drivers released.</b></span><br />
<span xmlns="">Well, I'm bummed out.<br />
</span><br />
<span xmlns="">I was really looking forward to purchasing a new laptop that had one of Intel's new Sandy Bridge chips. That's the chip with integrated graphics which, while it wouldn't exactly rock, would at least be adequate for games at midrange settings. No more fussing around comparing added discrete graphics chips, fewer scorch marks on my lap, and other associated goodness would ensue.<br />
</span><br />
<span xmlns="">The pre-ship performance estimates and hands'-on trials said that would be possible, as I pointed out in <a href="http://perilsofparallel.blogspot.com/2010/09/intel-graphics-in-sandy-bridge-good.html">Intel Graphics in Sandy Bridge: Good Enough</a>. This would have had the side effect of pulling the rug out from under Nvidia's volumes for GPUs, causing the HPC market to have to pull its own weight, meaning have traditional HPC price tags (see <a href="http://perilsofparallel.blogspot.com/2010/08/nvidia-based-cheap-supercomputing.html">Nvidia-based Cheap Supercomputing Coming to an End</a>). That would have been an earthquake, since most of the highest-end HPC systems now get their peak speeds from Nvidia CUDA accelerators, a situation not in small part due to their (relatively) low prices arising from high graphics volumes.<br />
</span><br />
<span xmlns="">Then TechSpot had to go and do a <a href="http://www.techspot.com/review/392-budget-gpu-comparison/page1.html">performance comparison of low-end graphics cards</a>, and later, just as a side addition, throw in measurements of Sandy Bridge graphics, too.<br />
</span><br />
<span xmlns="">Now, I'm sufficiently old-fashioned in my language that I really try to avoid even marginally obscene terms, even if they are in widespread everyday use, but in this case I have to make an exception: <br />
</span><br />
<span xmlns="">Damn, Sandy Bridge really sucks at graphics.<br />
</span><br />
<span xmlns="">It's the lowest of the low in every case. It's unusable for every game tested (and they tested quite a few), unless you're on some time-dilation drug that makes less than 15 frames per second seem zippy. Some frame rates – at medium settings – are in single digits.<br />
</span><br />
<span xmlns="">With Sandy Bridge, Intel has solidly maintained its historic lock on the worst graphics performance in the industry. This, by the way, is with the Intel i7 chips overclocked to 3.4GHz. That should also overclock the graphics (unless Intel is doing something I don't know about with the graphics clock).<br />
</span><br />
<span xmlns="">Ah, but possibly there is a "3D" fix for this coming soon? Ivy Bridge, the upcoming 22nm shrink of Sandy Bridge (the Intel "tock" following Sandy Bridge "tick"), has those wondrous new much-promoted transistors. Heh. Intel says Ivy Bridge will have – drum roll – <a href="http://vr-zone.com/articles/ivy-bridge-to-have-20-percent-performance-advantage-over-sandy-bridge/11061.html">30% faster graphics than Sandy Bridge</a>. <br />
</span><br />
<span xmlns="">See prior marginal obscenity. <br />
</span><br />
<span xmlns="">Intel does tend to sandbag future performance estimates, but not by enough to lift 30% up to 200-300%; that's what would be needed to produce what people were saying Sandy Bridge would do. Is that all we get from those "3D" transistors? The way the Intel media guys are going on about 3D, I expected Tri-Gate (which can be two- or five- or whatever-gate) to give me an Avatar-like mind meld or something.<br />
</span><br />
<span xmlns="">All that stuff about on-chip integrated graphics taking over the low-end high-volume market for discrete graphics just isn't going to happen this year with Sandy Bridge, or later with Ivy Bridge. As a further grain of salt in my wound, Nvidia is even <a href="http://www.theregister.co.uk/2011/05/13/nvidia_q1_f2012_numbers/">seeing a nice revenue uptick</a> from selling discrete graphics add-ons to new Sandy Bridge systems. It's not that I have anything against Nvidia. I just didn't think that uptick, of all things, was going to happen.<br />
</span><br />
<span xmlns="">This doesn't change my opinion that GPUs integrated on-chip won't ultimately take over the low-end graphics market. As the real Moore's Law – the law about transistor densities, not clock rates – continues to march on, it's inevitable that on-chip integrated graphics will be just fine for low- and medium-range games. It just won't happen soon with Intel products. <br />
</span><br />
<span xmlns="">Ah, but what about AMD? Their Fusion chips with integrated graphics, which they call APUs, are supposed to be rather good. Performance information <a href="http://cpuforever.com/showthread.php?tid=1265&pid=2253">leaked on message boards</a> about their upcoming A4-3400, A6-3650 and A8-3850 APUs make them sound as good as, well, um, as good as Sandy Bridge was supposed to be. Hm.<br />
</span><br />
<span xmlns="">Several years ago I heard a high-level AMD designer say that people looking for performance with Fusion were going to be disappointed; it was strictly a cost/performance product. That was several years ago, and things could have changed, but chip design lead times are still multi-year.<br />
</span><br />
<span xmlns="">In any event, this time I think I'll wait until shipped products are tested before declaring victory. <br />
</span><br />
<span xmlns="">Meanwhile, here I go again, flipping back and forth between laptop specs and GPU specs, as usual.<br />
</span><br />
<span xmlns=""><em>Sigh.</em><br />
</span><br />
<span xmlns=""></span><br />
<div class="MsoBodyText"><span xmlns=""><b><span style="color: red;">UPDATE May 23, 2011</span></b></span></div><span xmlns=""> </span><br />
<div class="MsoBodyText"><span xmlns="">Intel has just released new drivers for Sandy Bridge. The <a href="http://newsroom.intel.com/community/intel_newsroom/blog/2011/05/10/chip-shot-new-hd-graphics-driver-keeps-the-fun-going">press release</a> says they provide “up to 40% performance improvements on select games, support for the latest games like Valve’s Portal 2 and Stereoscopic 3D playback on DisplayPort monitors.”</span></div><div class="MsoBodyText"><span xmlns=""><br />
</span></div><div class="MsoBodyText"><span xmlns="">At this time I don't know of test results that would confirm whether this really makes a difference, but if it’s real, and applies broadly enough, it might be just barely enough to make the Ivy Bridge chip the beginning of the end for low-end discrete graphics.</span></div><br />
<span xmlns=""><em><br />
</em></span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com17tag:blogger.com,1999:blog-3155908228127841862.post-43812915820537583022011-02-22T20:25:00.000-07:002011-02-22T20:25:42.859-07:00I'm Also On SemiAccurate Now (and a bit about Apple)The good folks over at <a href="http://www.semiaccurate.com/">SemiAccurate</a> have invited me to contribute there, and I've accepted.<br />
<br />
I'm certainly not going to abandon this blog, although I'll admit there hasn't been much activity here recently. But posts on some of the types of topics I've covered here will appear there, instead; the theory is that this way they reach a wider audience. Posts with topics that are too far from S|A's area, as well as any longer than they're comfortable with, will still appear here.<br />
<br />
I've gone live there already. So, if you would like to better understand why Apple charges 30% of revenue to iPad app developers and subscription services, take a look over there - or, to be more exact, <a href="http://www.semiaccurate.com/2011/02/22/ipad-game-console/">right here</a>. The topic isn't my usual stomping ground, but that had nothing to do with S|A; it's something that just occurred to me while reading some things about Apple's shenanigans.<br />
<br />
I guess I can now add "web journalist" to my vita. I wonder when I get a press card to tuck jauntily into my hat band?Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com5tag:blogger.com,1999:blog-3155908228127841862.post-60348499275185225502011-01-11T15:00:00.000-07:002011-01-11T15:00:39.819-07:00Intel-Nvidia Agreement Does Not Portend a CUDABridge or Sandy CUDA<span xmlns=""></span><br />
<span xmlns="">Intel and Nvidia reached a legal agreement recently in which they cross-license patents, stop suing each other over chipset interfaces, and oh, yeah, Nvidia gets $1.5B from Intel in five easy payments of $300M each.<br />
</span><br />
<span xmlns="">This has been covered in many places, like <a href="http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2011/01/11/BUAS1H713D.DTL">here</a>, <a href="http://www.eweek.com/c/a/IT-Infrastructure/Intel-Nvidia-End-Litigation-Sign-New-Patent-Agreement-340551/">here</a>, and <a href="http://www.anandtech.com/show/4121/the-license-agreement-intel-to-pay-nvidia-15-billion">here</a>, but in particular <a href="http://arstechnica.com/business/news/2011/01/intelnvidia-bombshell-look-for-nvidia-gpu-on-intel-processor-die.ars?utm_source=rss&utm_medium=rss&utm_campaign=rss">Ars Technica</a> originally lead with a headline about a Sandy Bridge (Intel GPU integrated on-chip with CPUs; see <a href="http://perilsofparallel.blogspot.com/2010/09/intel-graphics-in-sandy-bridge-good.html">my post</a> if you like) using Nvidia GPUs as the graphics engine. Ars has since retracted that (see web page referenced above), replacing the original web page. (The URL still reads "bombshell-look-for-nvidia-gpu-on-intel-processor-die.")<br />
</span><br />
<span xmlns="">Since that's been retracted, maybe I shouldn't bother bringing it up, but let me be more specific about why this is wrong, based on my reading the <a href="http://download.intel.com/pressroom/legal/Intel_Nvidia_2011_Redacted.pdf">actual legal agreement</a> (redacted, meaning a confidential part was deleted). Note: I'm not a lawyer, although I've had to wade through lots of legalese over my career; so this is based on an "informed" layman's reading.<br />
</span><br />
<span xmlns="">Yes, they have cross-licensed each others' patents. So if Intel does something in its GPU that is covered by an Nvidia patent, no suits. Likewise, if Nvidia does something covered by Intel patents, no suits. This is the usual intention of cross-licensing deals: Each side has "freedom of action," meaning they don't have to worry about inadvertently (or not) stepping on someone else's intellectual property.<br />
</span><br />
<span xmlns="">It does mean that Intel could, in theory, build a whole dang Nvidia GPU and sell it. Such things have happened, historically, but usually without cross-licensing, and are uncommon (IBM mainframe clones, X86 clones), but as a practical matter, wholesale inclusion of one company's processor design into another company's products is a hard job. There is a lot to a large digital widget not covered by the patents – numbers of undocumented implementation-specific corner cases that can mess up full software compatibility, without which there's no point. Finding them all is massive undertaking. <br />
</span><br />
<span xmlns="">So switching to a CUDA GPU architecture would be a massive undertaking, and furthermore it's a job Intel apparently doesn't want to do. Intel has its own graphics designs, with years of the design / test / fabricate pipeline already in place; and between the ill-begotten Larrabee (now MICA) and its own specific GPUs and media processors Intel has demonstrated that they really want to do graphics in house.<br />
</span><br />
<span xmlns="">Remember, what this whole suit was originally all about was Nvidia's chipset business – building stuff that connects processors to memory and IO. Intel's interfaces to the chipset were patent protected, and Nvidia was complaining that Intel didn't let Nvidia get at the newer ones, even though they were allegedly covered by a legal agreement. It's still about that issue. <br />
</span><br />
<span xmlns="">This makes it surprising that, buried down in section 8.1, is this statement:<br />
</span><br />
<span xmlns="">"Notwithstanding anything else in this Agreement, NVIDIA Licensed Chipsets shall not include any Intel Chipsets that are capable of electrically interfacing directly (with or without buffering or pin, pad or bump reassignment) with an Intel Processor that has an integrated (whether on-die or in-package) main memory controller, such as, without limitation, the Intel Processor families that are code named 'Nehalem', 'Westmere' and 'Sandy Bridge.'"<br />
</span><br />
<span xmlns="">So all Nvidia gets is the old FSB (front side bus) interfaces. They can't directly connect into Intel's newer processors, since those interfaces are still patent protected, and those patents aren't covered. They have to use PCI, like any other IO device.<br />
</span><br />
<span xmlns="">So what did Nvidia really get? They get bupkis, that's what. Nada. Zilch. Access to an obsolete bus interface. Well, they get bupkis plus $1.5B, which is a pretty fair sweetener. Seems to me that it's probably compensation for the chipset business Nvidia lost when there was still a chipset business to have, which there isn't now.<br />
</span><br />
<span xmlns="">And both sides can stop paying lawyers. On this issue, anyway.<br />
</span><br />
<h2><span xmlns="">Postscript<br />
</span></h2><span xmlns="">Sorry, this blog hasn't been very active recently, and a legal dispute over obsolete busses isn't a particularly wonderful re-start. At least it's short. Nvidia's <a href="http://blogs.nvidia.com/2011/01/project-denver-processor-to-usher-in-new-era-of-computing/">Project Denver</a> – sticking a general-purpose ARM processor in with a GPU – might be an interesting topic, but I'm going to hold off on that until I can find out what the architecture really looks like. I'm getting a little tired of just writing about GPUs, though. I'm not going to stop that, but I am looking for other topics on which I can provide some value-add.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com3tag:blogger.com,1999:blog-3155908228127841862.post-8540685754554144182010-12-06T17:26:00.000-07:002010-12-06T17:26:33.161-07:00The Varieties of Virtualization<span xmlns=""></span><br />
<span xmlns="">There appear to be many people for whom the term <em>virtualization</em> exclusively means the implementation of virtual machines à la VMware's products, Microsoft's Hyper-V, and so on. That's certainly a very important and common case, enough so that I covered various ways to do it in a <a href="http://perilsofparallel.blogspot.com/2010/05/how-hardware-virtualization-works-part.html">separate series of posts</a>; but it's scarcely the only form of virtualization in use. <br />
</span><br />
<span xmlns="">There's a hint that this is so in the gaggle of other situations where the word <em>virtualization</em> is used, such as desktop virtualization, application virtualization, user virtualization (I like that one; I wonder what it's like to be a virtual user), and, of course, Java Virtual Machine (JVM). Talking about the latter as a true case of virtualization may cause some head-scratching; I think most people consign it to a different plane of existence than things like VMware.<br />
</span><br />
<span xmlns="">This turns out not to be the case. They're not only all in the same (boringly mundane) plane, they relate to one another hierarchically. I see five levels to that hierarchy right now, anyway; I wouldn't claim this is the last word.<br />
</span><br />
<span xmlns="">A key to understanding this is to adopt an appropriate definition of virtualization. Mine is that virtualization is the creation of isolated, idealized platforms on which computing services are provided. Anything providing that, whether it's hardware, software, or a mixture, is virtualization. The adjectives in front of "platform" could have qualifiers: Maybe it's not quite idealized in all cases, and isolation is never total. But lack of qualification is the intent.<br />
</span><br />
<span xmlns="">Most types of virtualization allow hosting of several platforms on one physical or software resource, but that's not part of my definition because it's not universal; it could be just one, or a single platform could be created spanning multiple physical resources. It's also necessary to not always dwell all that heavily on boundaries between hardware and software. But that's starting to get ahead of the discussion. Let's go through the levels, starting at the bottom.<br />
</span><br />
<span xmlns="">I'll relate this to the cloud computing's IaaS/PaaS/SaaS levels later.<br />
</span><br />
<h2><span xmlns="">Level 1: Hardware Partitioning<br />
</span></h2><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMqzsI9p0fIisEJyoiH46g4flU4FSt6yZ6_mlSreJRhy-jQoRXhIZEaySMqvOVIf3h2Ysw-Q3SNNIoP5TOVJaIBDSmLEToW2FuWT1ubekBxUuDoHX1pS7qeAlFkzagSXoph8R9Q4osmavM/s1600/3509-milk-chocolate-breakup_500x500.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMqzsI9p0fIisEJyoiH46g4flU4FSt6yZ6_mlSreJRhy-jQoRXhIZEaySMqvOVIf3h2Ysw-Q3SNNIoP5TOVJaIBDSmLEToW2FuWT1ubekBxUuDoHX1pS7qeAlFkzagSXoph8R9Q4osmavM/s200/3509-milk-chocolate-breakup_500x500.jpg" width="200" /></a></div><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgykOOrXgVvsRT8qSdmI_-pzd2tHtC_K4i4oGY_Of_ntN_2dQf_Ag-XKQ9cinlZ0pDg6Js2b4x7NMuIMnVhtgJSgEMaFxuqQ2oIY7hbzTaLBIdYM0mqGsmDvCeMOeNPzdeiYchCVCA6gtoC/s1600/Hardware+Partitioning.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><br />
</a><span xmlns="">Some hardware is designed like a brick of chocolate that can be broken apart along various predefined fault lines, each piece a fully functional computer. Sun Microsystems (Oracle, now) famously did this with its .com workhorse, the <a href="http://en.wikipedia.org/wiki/Sun_Enterprise_10000">Enterprise 10000</a> (UE10000). That system had multiple boards plugged into a memory-bus backplane, each board with processor(s), memory, and IO. Firmware let you set registers allowing or disallowing inter-board memory traffic, cache coherence and IO traffic, allowing you to create partitions of the whole machine built with any number of whole boards. The register setting, etc., is set up so that no code running on any of the processors can alter it or, usually, even tell it's there; a privileged console accesses them, under command of an operator, and that's it. HP, IBM and others have provided similar capabilities in large systems, often with the processors, memory, and IO in separate units, numbers of each assigned to different partitions.</span><br />
<span xmlns=""><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgykOOrXgVvsRT8qSdmI_-pzd2tHtC_K4i4oGY_Of_ntN_2dQf_Ag-XKQ9cinlZ0pDg6Js2b4x7NMuIMnVhtgJSgEMaFxuqQ2oIY7hbzTaLBIdYM0mqGsmDvCeMOeNPzdeiYchCVCA6gtoC/s1600/Hardware+Partitioning.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="113" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgykOOrXgVvsRT8qSdmI_-pzd2tHtC_K4i4oGY_Of_ntN_2dQf_Ag-XKQ9cinlZ0pDg6Js2b4x7NMuIMnVhtgJSgEMaFxuqQ2oIY7hbzTaLBIdYM0mqGsmDvCeMOeNPzdeiYchCVCA6gtoC/s400/Hardware+Partitioning.png" style="cursor: move;" width="400" /></a> </span><br />
<span xmlns="">Hardware partitioning has the big advantage that even hardware failures (for the most part) simply cannot propagate among partitions. With appropriate electrical design, you can even power-cycle one partition without affecting others. Software failures are of course also totally isolated within partitions (as long as one isn't performing a service for another, but that issue is on another plane of abstraction).<br />
</span><br />
<span xmlns="">The big negative of hardware partitioning is that you usually cannot have very many of them. Even a single chip now contains multiple processors, so partitioning even by separate chips is far less granularity than is generally desirable. In fact, it's common to assign just a fraction of one CPU, and that can't be done without bending the notion of a hardware-isolated, power-cycle-able partition to the breaking point. In addition, there is always some hardware in common across the partition. For example, power supplies are usually shared, and whatever interconnects all the parts is shared; failure of that shared hardware cause all partitions to fail. (For more complete high availability, you need multiple completely separate physical computers, not under the same sprinkler head, preferably located on different tectonic plates, etc. depending on your personal level of paranoia.)<br />
</span><br />
<span xmlns="">Despite its negatives, hardware partitioning is fairly simple to implement, useful, and still used. It or something like it, I speculate, is effectively what will be used for initial <a href="http://perilsofparallel.blogspot.com/2010/11/nvidia-past-future-and-circular.html">"virtualization" of GPUs</a> when that starts appearing.<br />
</span><br />
<h2><span xmlns="">Level 2: Virtual Machines<br />
</span></h2><span xmlns="">This is the level of VMware and its kissin' cousins. All the hardware is shared <em>en masse</em>, and a special layer of software, a hypervisor, creates the illusion of multiple completely separate hardware platforms. Each runs its own copy of an operating system and any applications above that, and (ideally) none even knows that the others exist. I've <a href="http://perilsofparallel.blogspot.com/2010/05/how-hardware-virtualization-works-part.html">previously written</a> about how this trick can be performed without degrading performance to any significant degree, so won't go into it here.<br />
</span><br />
<span xmlns="">The good news here is that you can create as many virtual machines as you like, independent of the number of physical processors and other physical resources – at least until you run out of resources. The hypervisor usually contains a scheduler that time-slices among processors, so sub-processor allocation is available. With the right hardware, IO can also fractionally allocated (again, see my <a href="http://perilsofparallel.blogspot.com/2010/05/how-hardware-virtualization-works-part.html">prior posts</a>).<br />
</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJEaDBxUddx7is9Wk9vINdLMDn5eEdQQYoRq4CDFH3JxwzbgKNVQeh_S150eHya2lnD4WnoXcZFA14_ZYLvET7DMV3-zgCwH579ipIzhsis3FvBDiDOyj09Mw870PZyLf2ML2MMkjxGhvY/s1600/Virtual+Machines.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="182" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJEaDBxUddx7is9Wk9vINdLMDn5eEdQQYoRq4CDFH3JxwzbgKNVQeh_S150eHya2lnD4WnoXcZFA14_ZYLvET7DMV3-zgCwH579ipIzhsis3FvBDiDOyj09Mw870PZyLf2ML2MMkjxGhvY/s320/Virtual+Machines.png" width="320" /></a></div><span xmlns=""><br />
</span><br />
<span xmlns="">The bad news is that you generally get much less hardware fault isolation than with hardware partitioning; if the hardware croaks, well, it's one basket and those eggs are scrambled. Very sophisticated hypervisors can help with that when there is appropriate hardware support (mainframe customers do get something for their money). In addition, and this is certainly obvious after it's stated: If you put N virtual machines on one physical machine, you are now faced with all the management pain of managing all N copies of the operating system and its applications.<br />
</span><br />
<span xmlns="">This is the level often used in so-called desktop virtualization. In that paradigm, individuals don't own hardware, their own PC. Instead, they "own" a block of bits back on a server farm that happens to be the description of a virtual machine, and can request that their virtual machine be run from whatever terminal device happens to be handy. It might actually run back on the server, or might run on a local machine after downloading. Many users absolutely loathe this; they want to own and control their own hardware. Administrators like it, a lot, since it lets them own, and control, the hardware.<br />
</span><br />
<h2><span xmlns="">Level 3: Containers<br />
</span></h2><span xmlns="">This level was, as far as I know, originally developed by Sun Microsystems (Oracle), so I'll use their name for it: Containers. IBM (in AIX) and probably others also provide it, under different names. <br />
</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjABelWEuWfTf0qraAvcgGtfCxG7aE8qB_GgNW1AivGpO5fvld_-DG1mBqLFoxGW_5Rs0tCkV7qHFYNaefWSUJMp520u8PC7PZYVky5f8t_j25SF1kX7oNpIq_p6xw63irgES-TXwZpYBA7/s1600/Containers.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="109" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjABelWEuWfTf0qraAvcgGtfCxG7aE8qB_GgNW1AivGpO5fvld_-DG1mBqLFoxGW_5Rs0tCkV7qHFYNaefWSUJMp520u8PC7PZYVky5f8t_j25SF1kX7oNpIq_p6xw63irgES-TXwZpYBA7/s320/Containers.png" width="320" /></a></div><span xmlns="">With containers, you have one copy of the operating system code, but it provides environments, containers, which act like separate copies of the OS. In Unix/Linux terms, each container has its own file system root (including IO), process tree, shared segment naming space, and so on. So applications run as if they were running on their own copy of the operating system – but they are actually sharing one copy of the OS code, with common but separate OS data structures, etc.; this provides significant resource sharing that helps the efficiency of this level.<br />
</span><br />
<span xmlns="">This is quite useful if you have applications or middleware that were written under the assumption that they were going to run on their own separate server, and as a result, for example, all use the same name for a temporary file. Were they run on the same OS, they would clobber each other in the common /tmp directory; in separate containers, they each have their own /tmp. More such applications exist than one would like to believe; the most quoted case is the Apache web server, but my information on that may be out of date and it may have been changed by now. Or not, since I'm not sure what the motivation to change would be. <br />
</span><br />
<span xmlns="">I suspect container technology was originally developed in the Full Moon cluster single-system-image project, which needs similar capabilities. See my <a href="http://perilsofparallel.blogspot.com/2009/01/multi-multicore-single-system-image.html">much earlier post</a> about single-system-image if you want more information on such things.<br />
</span><br />
<span xmlns="">In addition, there's just one real operating system to manage in this case, so management headaches are somewhat lessened. You do have to manage all those containers, so it isn't an N:1 advantage, but I've heard customers say this is a significant management savings.<br />
</span><br />
<span xmlns="">A perhaps less obvious example of containerization is the multiuser BASIC systems that flooded the computer education system several decades back. There was one copy of the BASIC interpreter, run on a small minicomputer and used simultaneously by many students, each of whom had their own logon ID and wrote their own code. And each of whom could botch things up for everybody else with the wrong code that soaked up the CPU. (This happened regularly in the "computer lab" I supervised for a while.) I locate this in the container level rather than higher in the stack because the BASIC interpreter really was the OS: It ran on the bare metal, with no supervisor code below it.<br />
</span><br />
<span xmlns="">Of course, fault isolation at this level is even less than in the prior cases. Now if the OS crashes, all the containers go down. (Or if the wrong thing is done in BASIC…) In comparison, an OS crash in a virtual machine is isolated to that virtual machine.<br />
</span><br />
<h2><span xmlns="">Level 4: Software Virtual Machines<br />
</span></h2><span xmlns="">We've reached the JVM level. It's also the .NET level, the Lisp level, the now more usual BASIC level, and even the CICS (and so on): the level of more-or-less programming-language based independent computing environments. Obviously, multiple of these can be run as applications under a single operating system image, each providing a separate environment for the execution of applications. At least this can be done in theory, and in many cases in practice; some environments were implemented as if they owned the computer they run on.<br />
</span><br />
<span xmlns="">What you get out of this is, of course, a more standard programming environment that can be portable – run on multiple computer architectures – as well as extensions to a machine environment that provide services simplifying application development. Those extensions are usually the key reason this level is used. There's also a bit of fault tolerance, since if one of those dies of a fault in its support or application code, it need not always affect others, assuming a competent operating system implementation.<br />
</span><br />
<span xmlns="">Fault isolation at this level is mostly software only; if one JVM (say) crashes, or the code running on it crashes, it usually doesn't affect others. Sophisticated hardware / firmware / OS can inject the ability to keep many of the software VMs up if a failure occurred that only affected one of them. (Mainframe again.)<br />
</span><br />
<h2><span xmlns="">Level 5: Multitenant / Multiuser Environment<br />
</span></h2><span xmlns="">Many applications allow multiple users to log in, all to the same application, with their own profiles, data collections, etc. They are legion. Examples include web-based email, Facebook, Salesforce.com, Worlds of Warcraft, and so on. Each user sees his or her own data, and thinks he / she is doing things isolated from others except at those points where interaction is expected. They see their own virtual system – a very specific, particularized system running just one application, but a system apparently isolated from all others in any event.<br />
</span><br />
<span xmlns="">The advantages here? Well, people pay to use them (or put up with advertising to use them). Aside from that, there is potentially massive sharing of resources, and, concomitantly, care must be taken in the software and system architecture to avoid massive sharing of faults.<br />
</span><br />
<h2><span xmlns="">All Together Now<br />
</span></h2><span xmlns="">Yes. You can have all of these levels of virtualization active simultaneously in one system: A hardware partition running a hypervisor creating a virtual machine that hosts an operating system with containers that each run several programming environments executing multi-user applications.<br />
</span><br />
<span xmlns="">It's possible. There may be circumstances where it appears warranted. I don't think I'd want to manage it, myself. Imagining a performance tuning on a 5-layer virtualization cake makes me shudder. I once had a television system that had two volume controls in series: A cable set-top box had its volume control, feeding an audio system with its own. Just those two levels drove me nuts until I hit upon a setting of one of them that let the other, alone, span the range I wanted.<br />
</span><br />
<h2><span xmlns="">Virtualization and Cloud Computing<br />
</span></h2><span xmlns="">These levels relate to the usual IaaS/PaaS/SaaS (Infrastructure / Platform / Software as a Service) distinctions discussed in cloud computing circles, but are at a finer granularity than those.<br />
</span><br />
<span xmlns="">IaaS relates to the bottom two layers: hardware partitioning and virtual machines. Those two levels, particularly virtual machines, make it possible to serve up raw computing infrastructure (machines) in a way that can utilize the underlying hardware far more efficiently than handing customers whole computers that they aren't going to use 100% of the time. As I've pointed out <a href="http://perilsofparallel.blogspot.com/2010/05/how-hardware-virtualization-works-part.html">elsewhere</a>, it is not a logical necessity that a cloud use this or some other form of virtualization; but in many situations, it is an economic necessity. <br />
</span><br />
<span xmlns="">Software virtual machines are what PaaS serves up. There's a fairly close correspondence between the two concepts.<br />
</span><br />
<span xmlns="">SaaS is, of course, a Multiuser environment. It may, however, be delivered by using software virtual machines under it.<br />
</span><br />
<span xmlns="">Containers are a mix of IaaS and PaaS. It's doesn't provide pure hardware, but a plain OS is made available, and that can certainly be considered a software platform. It is, however, a fairly barren environment compared with what software virtual machines provide..<br />
</span><br />
<h2><span xmlns="">Conclusion<br />
</span></h2><span xmlns="">This post has been brought to you by my poor head, which aches every time I encounter yet another discussion over whether and how various forms of cloud computing do or do not use virtualization. Hopefully it may help clear up some of that confusion.<br />
</span><br />
<span xmlns="">Oh, yes, and the obvious conclusion: There's more than one kind of virtualization, out there, folks.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com6tag:blogger.com,1999:blog-3155908228127841862.post-88522907202722116072010-11-15T17:33:00.000-07:002010-11-15T17:33:24.044-07:00The Cloud Got GPUs<span xmlns=""></span><br />
<span xmlns="">Amazon just announced, on the first full day of SC10 (SuperComputing 2010), the availability of Amazon EC2 (cloud) machine instances with dual Nvidia Fermi GPUs. According to Amazon's <a href="http://aws.amazon.com/ec2/hpc-applications/">specification</a> of instance types, this "Cluster GPU Quadruple Extra Large" instance contains:<br />
</span><br />
<ul><li><span xmlns="">22 GB of memory<br />
</span></li>
<span xmlns="">
<li>33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core "Nehalem" architecture)<br />
</li>
<li>2 x NVIDIA Tesla "Fermi" M2050 GPUs<br />
</li>
<li>1690 GB of instance storage<br />
</li>
<li>64-bit platform<br />
</li>
<li>I/O Performance: Very High (10 Gigabit Ethernet)<br />
</li>
</span></ul><span xmlns="">So it looks like the future virtualization features of CUDA really are for purposes of using GPUs in the cloud, as I mentioned in <a href="http://perilsofparallel.blogspot.com/2010/11/nvidia-past-future-and-circular.html">my prior post</a>.<br />
</span><br />
<span xmlns="">One of these XXXXL instances costs $2.10 per hour for Linux; Windows users need not apply. Or, if you reserve an instance for a year – for $5630 – you then pay just $0.74 per hour during that year. (Prices quoted from Amazon's <a href="http://aws.amazon.com/ec2/">price list</a> as of 11/15/2010; no doubt it will decrease over time.)<br />
</span><br />
<span xmlns="">This became such hot news that GPU was a trending topic on Twitter for a while.<br />
</span><br />
<span xmlns="">For those of you who don't watch such things, many of the <a href="http://www.top500.org/">Top500</a> HPC sites – the 500 supercomputers worldwide that are the fastest at the Linpack benchmark – have nodes featuring Nvidia Fermi GPUs. This year that list notoriously includes, in the top slot, the system causing the heaviest breathing at present: The Tianhe-1A at the National Supercomputer Center in Tianjin, in China.<br />
</span><br />
<span xmlns="">I wonder how well this will do in the market. Cloud elasticity – the ability to add or remove nodes on demand – is usually a big cloud selling point for commercial use (expand for holiday rush, drop nodes after). How much it will really be used in HPC applications isn't clear to me, since those are usually batch mode, not continuously operating, growing and shrinking, like commercial web services. So it has to live on price alone. The price above doesn't feel all that inexpensive to me, but I'm not calibrated well in HPC costs these days, and don't know how much it compares with, for example, the cost of running the same calculation on Teragrid. Ad hoc, extemporaneous use of HPC is another possible use, but, while I'm sure it exists, I'm not sure how much exists.<br />
</span><br />
<span xmlns="">Then again, how about services running games, including the rendering? I wonder if, for example, the communications secret sauce used by <a href="http://perilsofparallel.blogspot.com/2010/07/onlive-follow-up-bandwidth-and-cost.html">OnLive</a> to stream rendered game video fast enough for first-person shooters can operate out of Amazon instances. Even if it doesn't, games that can tolerate a tad more latency may work. Possibly games targeting small screens, requiring less rendering effort, are another possibility. That could crater startup costs for companies offering games over the web.<br />
</span><br />
<span xmlns="">Time will tell. For accelerators, we certainly are living in interesting times.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com11tag:blogger.com,1999:blog-3155908228127841862.post-24574547911828475562010-11-11T19:09:00.003-07:002010-11-12T19:55:40.625-07:00Nvidia Past, Future, and Circular<span xmlns=""></span><br />
<span xmlns="">I'm getting tired about writing about Nvidia and its Fermi GPU architecture (see here and here for recent posts). So I'm going to just dump out some things I've considered for blog entries into this one, getting it all out of the way.<br />
</span><br />
<h2><span xmlns=""><span class="Apple-style-span" style="font-size: medium;">Past Fermi Product Mix</span><br />
</span></h2>For those of you wondering about how much Nvidia's product mix is skewed to the low end, here's some data for Q3, 2010 from <a href="http://investorvillage.com/smbd.asp?mb=476&mn=191714&pt=msg&mid=9688438">Investor Village</a>:<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs2hLWC7oAR98COgMP9eS_NTebcM7CeeLYhbcORxVFHo177BP_bn79CyyuYnLIf8WY5IqOcTr9tu-XGhkg_2shi0SQ4jpP_2Jr7W4aR0E0mGbewlP-yP1YxDSZP3SXbqrM_H8qM-xIJbhd/s1600/Nvidia+Normalized+Share.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="409" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs2hLWC7oAR98COgMP9eS_NTebcM7CeeLYhbcORxVFHo177BP_bn79CyyuYnLIf8WY5IqOcTr9tu-XGhkg_2shi0SQ4jpP_2Jr7W4aR0E0mGbewlP-yP1YxDSZP3SXbqrM_H8qM-xIJbhd/s640/Nvidia+Normalized+Share.gif" width="640" /></a></div><br />
<span xmlns="">Also, note that despite the raging hormones of high-end HPC, the caption indicates that their median and mean prices have decreased from Q2: They became more, not less, skewed towards the low end. As I've <a href="http://perilsofparallel.blogspot.com/2010/08/nvidia-based-cheap-supercomputing.html">pointed out</a>, this will be a real problem as Intel's and AMD's <a href="http://arstechnica.com/business/news/2010/11/with-fusion-amds-devils-are-in-the-details.ars?comments=1">on-die GPUs</a> assert some market presence, with "good enough" graphics for free – built into all PC chips. It won't be long now, since AMD has already <a href="http://www.crn.com/news/components-peripherals/228200687/amd-shows-off-new-fusion-chips.htm;jsessionid=VxmcxwYh1WgyGA2ibwok3w**.ecappj02">started shipping</a> its Zacate integrated-GPU chip to manufacturers.<br />
</span><br />
<h2><span xmlns=""><span class="Apple-style-span" style="font-size: medium;">Future Fermis</span><br />
</span></h2><span xmlns="">Recently Fermi's chief executive Jen-Hsun Huang gave an <a href="http://www.zdnet.co.uk/news/application-development/2010/10/26/nvidia-looks-to-the-future-of-gpu-computing-40090520/">interview</a> on what they are looking at for future features in the Fermi architecture. Things he mentioned were: (a) More development of their CUDA software; (b) virtual memory and pre-emption; (c) directly attaching InfiniBand, the leading HPC high-speed system-to-system interconnect, to the GPU. Taking these in that order:<br />
</span><br />
<span xmlns=""><strong>More CUDA:</strong> When asked why not OpenCL, he said because other people are working on OpenCL and they're the only ones doing CUDA. This answer ranks right up there in the stratosphere of disingenuousness. What the question really meant was why they don't work to make OpenCL, a standard, work as well as their proprietary CUDA on their gear? Of course the answer is that OpenCL doesn't get them lock-in, which one doesn't say in an interview.<br />
</span><br />
<span xmlns=""><strong>Virtual memory and pre-emption:</strong> A GPU getting a page fault, then waiting while the data is loaded from main memory, or even disk? I wouldn't want to think of the number of threads it would take to cover that latency. There probably is some application somewhere for which this is the ideal solution, but I doubt it's the main driver. This is a cloud play: Cloud-based systems nearly all use <a href="http://perilsofparallel.blogspot.com/2010/05/how-hardware-virtualization-works-part.html">virtual machines</a> (for very good reason; see the link), splitting one each system node into N virtual machines. Virtual memory and pre-emption allows the GPU to participate in that virtualization. The virtual memory part is, I would guess, more intended to provide memory mapping, so applications can be isolated from one another reliably and can bypass issues of contiguous memory allocation. It's effectively partitioning the GPU, which is arguably a form of virtualization. [<b>UPDATE:</b> Just after this was published, John Carmak (of Id Software ) wrote a <a href="http://media.armadilloaerospace.com/misc/gpuDataPaging.htm">piece</a> laying out the case for paging into GPUs. So that may be useful in games and generally.]</span><br />
<span xmlns=""><br />
</span><br />
<span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;"> </span><span xmlns=""><strong>Direct InfiniBand attachment:</strong> At first glance, this sounds as useful as tits on a boar hog (as I occasionally heard from older locals in Austin). But it is suggested, a little, by the typical compute cycle among parallel nodes in HPC systems. That often goes like this: (a) Shove data from main memory out to the GPU. (b) Compute on the GPU. (c) Suck data back from GPU into main memory. (d) Using the interconnect between nodes, send part of that data from main memory to the main memory in other compute nodes, while receiving data into your memory from other compute nodes. (e) Merge the new data with what's in main memory. (f) Test to see if everybody's done. (g) If not, done, shove resulting new data mix in main memory out to the GPU, and repeat. At least naively, one might think that the copying to and from main memory could be avoided since the GPUs are the ones doing all the computing: Just send the data from one GPU to the other, with no CPU involvement. Removing data copying is, of course, good. In practice, however, it's not quite that straightforward; but it is at least worth looking at.<br />
</span><br />
<span xmlns="">So, that's what may be new in Nvidia CUDA / Fermi land. Each of those are at least marginally justifiable, some very much so (like virtualization). But stepping back a little from these specifics, this all reminds me of dueling Nvidia / AMD (ATI) announcements of about a year ago.<br />
</span><br />
<span xmlns="">That was the time of the <a href="http://insidehpc.com/2009/09/30/nvidia-next-generation-gpu-fermi-targets-hpc-supercomputing/?utm_source=bitly&utm_medium=twitter&utm_campaign=socialmedia">Fermi announcement</a>, which compared with prior Nvidia hardware doubled everything, yada yada, and added… ECC. And support for C++ and the like, and good speed double-precision floating-point.<br />
</span><br />
<span xmlns="">At that time, <a href="http://techreport.com/articles.x/17618/1">Tech Report</a> said that the AMD Radeon HD 5870 doubled everything, yada again, and added … a fancy new anisotropic filtering algorithm for smoothing out texture applications at all angles, and supersampling to better avoid antialiasing.<br />
</span><br />
<span xmlns="">Fine, Nvidia doesn't think much of graphics any more. But haven't they ever heard of the Wheel of Reincarnation?<br />
</span><br />
<h2><span xmlns=""><span class="Apple-style-span" style="font-size: medium;">The Wheel of Reincarnation</span><br />
</span></h2><span xmlns="">The wheel of reincarnation is a graphics system design phenomenon discovered all the way back in 1968 by <a href="http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland-design-of-display-processors.pdf">T. H. Meyers and Ivan Sutherland</a>. There are probably hundreds of renditions of it floating around the web; here's mine.<br />
</span><br />
<span xmlns="">Suppose you want to use a computer to draw pictures on a display of some sort. How do you start? Well, the most dirt-simple, least hardware solution is to add an IO device which, prodded by the processor with X and Y coordinates on the device, puts a dot there. That will work, and actually has been used in the deep past. The problem is that you've now got this whole computer sitting there, and all you're doing with it is putting stupid little dots on the screen. It could be doing other useful stuff, like figuring out what to draw next, but it can't; it's 100% saturated with this dumb, repetitious job.<br />
</span><br />
<span xmlns="">So, you beef up your IO device, like by adding the ability to go through a whole list of X, Y locations and putting dots up at each specified point. That helps, but the computer still has to get back to it very reliably every refresh cycle or the user complains. So you tell it to repeat. But that's really limiting. It would be much more convenient if you could tell the device to go do another list all by itself, like by embedding the next list's address in block of X,Y data. This takes a bit of thought, since it means adding a code to everything, so the device can tell X,Y pairs from next-list addresses; but it's clearly worth it, so in it goes.<br />
</span><br />
<span xmlns="">Then you notice that there are some graphics patterns that you would like to use repeatedly. Text characters are the first that jump out at you, usually. Hmm. That code on the address is kind of like a branch instruction, isn't it? How about a subroutine branch? Makes sense, simplifies lots of things, so in it goes.<br />
</span><br />
<span xmlns="">Oh, yes, then some of those objects you are re-using would be really more useful if they could be rotated and scaled… Hello, arithmetic.<br />
</span><br />
<span xmlns="">At some stage it looks really useful to add conditionals, too, so…<br />
</span><br />
<span xmlns="">Somewhere along the line, to make this a 21<sup>st</sup> century system, you get a frame buffer in there, too, but that's kind of an epicycle; you write to that instead of literally putting dots on the screen. It eliminates the refresh step, but that's all.<br />
</span><br />
<span xmlns="">Now look at what you have. It's a Turing machine. A complete computer. It's got a somewhat strange instruction set, but it works, and can do any kind of useful work. <br />
</span><br />
<span xmlns="">And it's spending all its time doing nothing but putting silly dots on a screen. <br />
</span><br />
<span xmlns="">How about freeing it up to do something more useful by adding a separate device to it to do that?<br />
</span><br />
<span xmlns="">This is the crucial point. You've reached the 360 degree point on the wheel, spinning off a graphics processor on the graphics processor.<br />
</span><br />
<span xmlns="">Every incremental stage in this process was very well-justified, and Meyers and Sutherland say they saw examples (in 1968!) of systems that were more than twice around the wheel: A graphics processor hanging on a graphics processor hanging on a graphics processor. These multi-cycles are often justified if there's distance involved; in fact, in these terms, a typical PC on the Internet can be considered to be twice around the wheel: It's got a graphics processor on a processor that uses a server somewhere else.<br />
</span><br />
<span xmlns="">I've some personal experience with this. For one thing, back in the early 70s I worked for Ivan Sutherland at then-startup Evans and Sutherland Computing Corp., out in Salt Lake City; it was a summer job while I was in grad school. My job was to design nothing less than an IO system on their second planned graphics system (LDS-2). It was, as was asked for, a full-blown minicomputer-level IO system, attached to a system whose purpose in life was to do nothing but put dots on a screen. Why an IO system? Well, why bother the main system with trivia like keyboard and mouse (light pen) interrupts? Just attach them directly to the graphics unit, and let it do the job.<br />
</span><br />
<span xmlns="">Just like Nvidia is talking about attaching InfiniBand directly to its cards.<br />
</span><br />
<span xmlns="">Also, in the mid-80s in IBM Research, after the successful completion of an effort to build special-purpose parallel hardware system of another type (a simulator), I spent several months figuring out how to bend my brain and software into using it for more general purposes, with various and sundry additions taken from the standard repertoire of general-purpose systems.<br />
</span><br />
<span xmlns="">Just like Nvidia is adding virtualization to its systems.<br />
</span><br />
<span xmlns="">Each incremental step is justified – that's always the case with the wheel – just as in the discussion above, I showed a justification for every general-purpose additions to Nvidia architecture are justifiable.<br />
</span><br />
<span xmlns="">The issue here is not that this is all necessarily bad. It just <strong><em>is</em></strong>. The wheel of reincarnation is a factor in the development over time of every special-purpose piece of hardware. You can't avoid it; but you can be aware that you are on it, like it or not. <br />
</span><br />
<span xmlns="">With that knowledge, you can look back at what, in its special-purpose nature, made the original hardware successful – and make your exit from the wheel thoughtfully, picking a point where the reasons for your original success aren't drowned out by the complexity added to chase after ever-widening, and ever more shallow, market areas. That's necessary if you are to retain your success and not go head-to-head with people who have, usually with far more resources than you have, been playing the general-purpose game for decades.<br />
</span><br />
<span xmlns="">It's not clear to me that Nvidia has figured this out yet. Maybe they have, but so far, I don't see it.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com12tag:blogger.com,1999:blog-3155908228127841862.post-88691594080389343392010-10-17T19:08:00.000-06:002010-10-17T19:08:38.664-06:00RIP, Benoit Mandelbrot, father of fractal geometryBenoit Mandelbrot, father of fractal geometry, has died.<br />
<br />
See my post about him, and my interaction with him, in my mostly non-technical blog, Random Gorp: <a href="http://randomgorp.blogspot.com/2010/10/rip-benoit-mandelbrot-father-of-fractal.html">RIP, Benoit Mandelbrot, father of fractal geometry</a>.Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com0tag:blogger.com,1999:blog-3155908228127841862.post-31650721601721312822010-09-04T12:23:00.001-06:002010-09-08T20:26:24.093-06:00Intel Graphics in Sandy Bridge: Good Enough<span xmlns=""></span><br />
<span xmlns="">As I and others expected, Intel is gradually rolling out how much better the graphics in its next generation will be. Anandtech got an early demo part of Sandy Bridge and <a href="http://www.anandtech.com/show/3871/the-sandy-bridge-preview-three-wins-in-a-row/7">checked out</a> the graphics, among other things. The results show that the "good enough" performance I argued for in my prior post (<a href="http://perilsofparallel.blogspot.com/2010/08/nvidia-based-cheap-supercomputing.html">Nvidia-based Cheap Supercomputing Coming to an End</a>) will be good enough to sink third party low-end graphics chip sets. So it's good enough to hurt Nvidia's business model, and make their HPC products fully carry their own development burden, raising prices notably.<br />
<br />
The net is that for this early chip, with early device drivers, at low, but usable resolution (1024x768) there's adequate performance on games like "Batman: Arkham Asylum," "Call of Duty MW2," and a bunch of others, significantly including "Worlds of Warfare." And <a href="http://hothardware.com/News/Intels-NextGeneration-GPU-Will-Play-Bluray-3D-/">it'll play Blue-Ray 3D</a>, too.<br />
<br />
Anandtech's conclusion is "If this is the low end of what to expect, I'm not sure we'll need more than integrated graphics for non-gaming specific notebooks." I agree. I'd add desktops, too. Nvidia isn't standing still, of course; on the low end they are saying <a href="http://venturebeat.com/2010/09/02/nvidias-new-graphics-chips-will-give-you-laptops-with-long-battery-life-and-3d/">they'll do 3D, too, and will save power</a>. But integrated graphics are, effectively, free. It'll be there anyway. Everywhere. And as a result, everything will be tuned to work best on that among the PC platforms; that's where the volumes will be.<br />
<br />
Some comments I've received elsewhere on my prior post have been along the lines of "but Nvidia has such a good computing model and such good software support – Intel's rotten IGP can't match that." True. I agree. But.<br />
<br />
There's a long history of ugly architectures dominating clever, elegant architectures that are superior targets for coding and compiling. Where are the RISC-based CAD workstations of 15+ years ago? They turned into PCs with graphics cards. The DEC Alpha, MIPS, Sun SPARC, IBM POWER and others, all arguably far better exemplars of the computing art, have been trounced by X86, which nobody would call elegant. Oh, and the IBM zSeries, also high on the inelegant ISA scale, just keeps truckin' through the decades, <a href="http://gigaom.com/2010/09/01/holy-smokes-at-5-2-ghz-ibm-chip-is-super-fast/">most recently</a> at an astounding 5.2 GHz. <br />
<br />
So we're just repeating history here. Volume, silicon technology, and market will again trump elegance and computing model.</span><br />
<span xmlns=""><br />
</span><br />
<span xmlns=""><b><span class="Apple-style-span" style="color: red;">PostScript</span></b>: According to <a href="http://www.bloomberg.com/news/2010-09-08/intel-to-show-off-new-chip-with-graphics-tackle-amd-challenge.html">Bloomberg</a>, look for a demo at Intel Developer Forum next week.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com10tag:blogger.com,1999:blog-3155908228127841862.post-65193477543155498762010-08-11T22:05:00.000-06:002010-08-11T22:05:01.192-06:00Nvidia-based Cheap Supercomputing Coming to an EndNvidia's CUDA has been hailed as "<a href="http://www.drdobbs.com/high-performance-computing/207200659">Supercomputing for the Masses</a>," and with good reason. Amazing speedups on scientific / technical code have been reported, ranging from a mere 10X through hundreds. It's become a darling of academic computing and a <a href="http://www.militaryaerospace.com/index/display/article-display/3258028386/articles/military-aerospace-electronics/online-news-2/2010/8/darpa-looks_to_four.html">major player in DARPA's Exascale program</a>, but performance alone is not the reason; it's price. For that computing power, they're incredibly cheap. As Sharon Glotzer of UMich <a href="http://perilsofparallel.blogspot.com/2010/05/all-hail-gpu-tweetstream.html">noted</a>, "Today you can get 2GF for $500. That is <strong><em>ridiculous</em></strong>." It is indeed. And it's only possible because CUDA is subsidized by sinking the fixed costs of its development into the high volumes of Nvidia's mass market low-end GPUs.<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgd-7AMpS212MVhhIGW_3ZA5DEN_cge6leiktF6DHgBbdWdqpy5476uMXOpwkz67fNRKULCVlyrYCF6aQxmpRFz5_8nFjtIqU97XQ44R1Jleiy9K2Cb8xPy35kiLp1EuGKixVRD84tjsV2l/s1600/package-integrated+graphics.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgd-7AMpS212MVhhIGW_3ZA5DEN_cge6leiktF6DHgBbdWdqpy5476uMXOpwkz67fNRKULCVlyrYCF6aQxmpRFz5_8nFjtIqU97XQ44R1Jleiy9K2Cb8xPy35kiLp1EuGKixVRD84tjsV2l/s200/package-integrated+graphics.png" width="190" /></a><span xmlns=""> <br />
Unfortunately, that subsidy won't last forever; its end is now visible. Here's why:<br />
<br />
Apparently ignored in the usual media fuss over Intel's next and greatest, Sandy Bridge, is the integration of Intel's graphics onto the same die as the processor chip.<br />
<br />
The current best integration is onto the same package, as illustrated in the photo of the current best, Clarkdale (a.k.a. Westmere), as shown in the photo on the right. As illustrated, the processor is in 32nm silicon technology, and the graphics, with memory controller, is in 45nm silicon technology. Yes, the graphics and memory controller is the larger chip.<br />
</span><br />
<span xmlns="">Intel has not been touting higher graphics performance from this tighter integration. In fact, Intel's press releasers for Clarkdale claimed that being on two die wouldn't reduce performance because they were in the same package. But unless someone has changed the laws of physics as I know them, that's simply false; at a minimum, eliminating off-chip drivers will reduce latency substantially. Also, being on the same die as the processor implies the same process, so graphics (and memory control) goes all the way from 45nm to 32nm, the same as the processor, in one jump; this certainly will also result in increased performance. For graphics, this is a very loud the Intel "Tock" in its "Tick-Tock" (architecture / silicon) alternation.<br />
<br />
So I'll semi-fearlessly predict some demos of midrange games out of Intel when Sandy Bridge is ready to hit the streets, which hasn't been announced in detail aside from being in 2011.<br />
<br />
Probably not coincidentally, mid-2011 is when AMD's Llano processor sees daylight. Also in 32nm silicon, it incorporates enough graphics-related processing to be an apparently decent DX11 GPU, although to my knowledge the architecture hasn't been disclosed in detail. <br />
<br />
Both of these are lower-end units, destined for laptops, and intent on keeping a tight power budget; so they're not going to run high-end games well or be a superior target for HPC. It seems that they will, however, provide at least adequate low-end, if not midrange, graphics.<br />
<br />
Result: All of Nvidia's low-end market disappears by the end of next year. <br />
<br />
As long as passable performance is provided, integrated into the processor equates with "free," and you can't beat free. Actually, it equates with cheaper than free, since there's one less chip to socket onto the motherboard, eliminating socket space and wiring costs. The power supply will probably shrink slightly, too.<br />
<br />
This means the end of the low-end graphics subsidy of high-performance GPGPUs like Nvidia's CUDA. It will have to pay its own way, with two results: <br />
<br />
First, prices will rise. It will no longer have a huge advantage over purpose-built HPC gear. The market for that gear is certainly expanding. In a <a href="http://lecture2go.uni-hamburg.de/konferenzen/-/k/10940;jsessionid=27E645CA37F378B28913594781846FA8">long talk</a> at the 2010 ISC in Berlin, Intel's Kirk Skaugan (VP of Intel Architecture Group and GM, Data Center Group, USA) stated that HPC was now 25% of Intel's revenue – a number double the HPC market I last heard a few years ago. But larger doesn't mean it has anywhere near the volume of low-end graphics.<br />
<br />
DARPA has pumped more money in, with Nvidia leading a $25M chunk of DARPA's Exascale project. But that's not enough to stay alive. (Anybody remember Thinking Machines?)<br />
<br />
The second result will be that Nvidia become a much smaller company.<br />
<br />
But for users, it's the loss of that subsidy that will hurt the most. No more supercomputing for the masses, I'm afraid. Intel will have MIC (son of Larrabee); that will have a partial subsidy since it probably can re-use some X86 designs, but that's not the same as large low-end sales volumes.<br />
<br />
So enjoy your "supercomputing for the masses," while it lasts.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com10tag:blogger.com,1999:blog-3155908228127841862.post-80455006512079229612010-07-29T16:03:00.009-06:002010-07-31T15:47:45.492-06:00Standards Are About the Money<span xmlns=""></span><br />
<div class="separator" style="clear: both; text-align: center;"><a bitly="BITLY_PROCESSED" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn6SfogCr1is3VFR4mkTSygOeUy4JsRECKht1bh-XET71UjosVHdwpTM0ovB-Ud4wS-EYpQbtbIQgcLXW8twU63PUwwwOC0JtKKgpnAdULFc39445tihry23BuOeNeC8z2auG4iZ6_bnIn/s1600/weird+cloud.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><br />
</a></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a bitly="BITLY_PROCESSED" href="http://www.onlineweblibrary.com/blog/?p=588" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="193" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn6SfogCr1is3VFR4mkTSygOeUy4JsRECKht1bh-XET71UjosVHdwpTM0ovB-Ud4wS-EYpQbtbIQgcLXW8twU63PUwwwOC0JtKKgpnAdULFc39445tihry23BuOeNeC8z2auG4iZ6_bnIn/s320/weird+cloud.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Nonstandard Cloud</td></tr>
</tbody></table><div class="separator" style="clear: both; text-align: left;">Standards for cloud computing are a never-ending topic of cloud buzz ranging all over the map: <a bitly="BITLY_PROCESSED" href="http://www.informationweek.com/news/software/hosted/showArticle.jhtml?articleID=218500508">APIs</a> (programming interfaces), <a bitly="BITLY_PROCESSED" href="http://www.dmtf.org/about/cloud-incubator">system management</a>, <a bitly="BITLY_PROCESSED" href="http://www.hpcinthecloud.com/news/Major-US-Tech-Firms-Pressure-EU-for-Cloud-Standards-99546924.html?utm_source=twitterfeed&utm_medium=twitter">legal issues</a>, and so on.</div><span xmlns=""> <br />
With a few exceptions where the motivation is obvious (like some legal issues in the EU), most of these discussions miss a key point: Standards are implemented and used if and only if they make money for their implementers. <br />
<br />
Whether customers think they would like them is irrelevant – unless that liking is strong enough to clearly translate into increased sales, paying back the cost of defining and implementing appropriate standards. "Appropriate" always means "as close to my existing implementation as possible" to minimize implementation cost.<br />
<br />
That is my experience, anyway, having spent a number of years as a company representative to the InfiniBand Trade Association and the PCI-SIG, along with some interaction with the PowerPC standard and observation of DMTF and IETF standards processes.<br />
<br />
Right now there's an obvious tension, since cloud customers see clear benefits to having an industry-wide, stable implementation target that allows portability among cloud system vendors, a point well-detailed in the <a bitly="BITLY_PROCESSED" href="http://www.google.com/url?sa=t&source=web&cd=3&ved=0CCIQFjAC&url=http%3A%2F%2Fwww.eecs.berkeley.edu%2FPubs%2FTechRpts%2F2009%2FEECS-2009-28.pdf&ei=e95RTMXWDd7pnQeJoNDTAw&usg=AFQjCNFeMMBSnmai9JnaLW-5qXkVLtb3Dw&sig2=CizXsHGPK1GcDa3MJx1QeA">Berkeley report</a> on cloud computing.<br />
<br />
That's all very nice, but unless the cloud system vendors see where the money is coming from, standards aren't going to be implemented where they count. In particular, when there are major market leaders, like Amazon and Google right now, it has to be worth more to those leaders than the lock-in they get from proprietary interfaces. I've yet to see anything indicating that they will, so am not very positive about cloud standards at present time.<br />
<br />
But it could happen. The road to any given standard is very often devious, always political, regularly suffused with all kinds of nastiness, and of course ultimately driven throughout by good old capitalist greed. An example I'm rather familiar with is the way InfiniBand came to be, and semi-failed.<br />
<br />
The beginning was a presentation by Justin Rattner at the 1998 Intel Developer Forum, in which he declared Intel's desire for their micros to grow up to be mainframes (mmmm… really juicy profit margins!). He thought they had everything except for IO. Busses were bad. He actually showed a slide with a diagram that could have come right out of an IBM Parallel Sysplex white paper, complete with channels and channel directors (switches) connecting banks of storage with banks of computers. That was where we need to go, he said, at a commodity price point. </span><br />
<span xmlns=""><br />
</span><br />
<a bitly="BITLY_PROCESSED" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4nCCPLw47TSwlQfCw5yzqdszMnpIpqrHmvjH1OBglj5Z4vrxe26ztgUs-nbOvgd33hdcTgzGYJPnUpysSoY987COpsLRBN4ZBGxcp0tWtQTVB9_Ac1HUS0tsJqUvtOVEUdPiwgOcM9HUD/s1600/ngio+logo.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="71" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4nCCPLw47TSwlQfCw5yzqdszMnpIpqrHmvjH1OBglj5Z4vrxe26ztgUs-nbOvgd33hdcTgzGYJPnUpysSoY987COpsLRBN4ZBGxcp0tWtQTVB9_Ac1HUS0tsJqUvtOVEUdPiwgOcM9HUD/s200/ngio+logo.png" width="200" /></a><span xmlns=""><br />
<br />
Shortly thereafter, Intel founded the Next Generation IO Forum (NGIO), inviting</span> other companies to join in the creation of this new industry IO standard. That sounds fine, and rather a step better than IBM did when trying to foist Microchannel architecture on the world (a dismal failure), until you read the fine print in the membership agreement. There you find a few nasties. Intel had 51% of every vote. Oh, and if you have any intellectual property (IP) (patents) in the area, they now all belonged to Intel. Several companies did join, like Dell; they like to be "tightly integrated" with their suppliers.<br />
<a bitly="BITLY_PROCESSED" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh80ct1yEB5XdPeSXFWjQKbWPbI-eeywySPCi8FM5WsuM5dH7d5QCaaDr5eGNp3IG9q5uLN4aOMsEn3AiNy6o80o-Ql7Nx9iuug7cN3j0lUAF69zbSDRt7y6d9qwuNz5lY-pcejgQJPpurj/s1600/FIO+logo.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh80ct1yEB5XdPeSXFWjQKbWPbI-eeywySPCi8FM5WsuM5dH7d5QCaaDr5eGNp3IG9q5uLN4aOMsEn3AiNy6o80o-Ql7Nx9iuug7cN3j0lUAF69zbSDRt7y6d9qwuNz5lY-pcejgQJPpurj/s200/FIO+logo.png" width="115" /></a><br />
<span xmlns=""> <br />
A few folks with a tad of IP in the IO area, like IBM and Compaq (RIP), understandably declined to join. But they couldn't just let Intel go off and define something they would then have to license. So a collection of companies – initially Compaq, HP, and IBM – founded the rival Future IO Developer's Forum (FIO). Its membership agreement was much more palatable: One company, one vote; and if you had IP that was used, you had to promise to license it with terms that were "reasonable and nondiscriminatory," a phrase that apparently means something quite specific to IP lawyers.</span><br />
<a bitly="BITLY_PROCESSED" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGQmTIIXcpwoujgSSRn7KrVxKQ3jRy-s_RlabWFgNVU2kNoSlzogJvhpurzSeqNFtJI5g00bqcRL0cV8V_DzsP-JFjj71QP9FnK4Bd6kO_gazZlo6Lkq2En8mG7FmMrRN9cGheC750LSXE/s1600/ngio+fio+logo.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGQmTIIXcpwoujgSSRn7KrVxKQ3jRy-s_RlabWFgNVU2kNoSlzogJvhpurzSeqNFtJI5g00bqcRL0cV8V_DzsP-JFjj71QP9FnK4Bd6kO_gazZlo6Lkq2En8mG7FmMrRN9cGheC750LSXE/s200/ngio+fio+logo.png" width="170" /></a><span xmlns=""><br />
<br />
Over the next several months, there was a steady movement of companies out of NGIO and into FIO. When NGIO became only Intel and Dell (still tightly integrated), the two merged as the InfiniBand Trade Association (IBTA). They even had a logo for the merger itself! (See picture.) The name "InfiniBand" was dreamed up by a multi-company collection of marketing people, by the way; when a technical group member told them he thought it was a great name (a lie) they looked worried. The IBTA had, in a major victory for the FIO crowd, the same key terms and conditions as FIO. In addition, Roberts' rules of order were to be used, and most issues were to be decided by a simple majority (of companies).</span><br />
<a bitly="BITLY_PROCESSED" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBwsPqhSzB3SP7awevwS9yI2E-XYACY6BaFvNOwUEvd1YNgha8h4PdxT5ri3oaTLsOUAtMDoVpVynv1o0gInOpQhPwcIARr-t90cjkwYRCwEElux-37tnUhnnLTHmw8lcOp2roqln73Z4W/s1600/IB+logo.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBwsPqhSzB3SP7awevwS9yI2E-XYACY6BaFvNOwUEvd1YNgha8h4PdxT5ri3oaTLsOUAtMDoVpVynv1o0gInOpQhPwcIARr-t90cjkwYRCwEElux-37tnUhnnLTHmw8lcOp2roqln73Z4W/s200/IB+logo.jpg" width="176" /></a><span xmlns=""><br />
<br />
Any more questions about where the politics comes in? Let's cover devious and nasty with a sub-story:<br />
<br />
While on one of the IBTA groups, during a contentious discussion I happened to be leading for one side, I mentioned I was going on vacation for the next two weeks. The first day I was on vacation a senior-level executive of a company on the other side in the dispute, an executive not at all directly involved in IBTA, sent an email to another senior-level executive in a completely different branch of IBM, a branch with which the other company did a very large amount of business. It complained that I "was not being cooperative" and I had said on the IBTA mailing lists that certain IBM products were bad in some way. The obvious intent was that it be forwarded to my management chain through layers of people who didn't understand (or care) what was really going on, just that I had made this key customer unhappy and had dissed IBM products. At the very least, it would have chewed up my time disentangling the mess left after it wandered around forwards for two weeks (I was on vacation, remember?); at worst, it could have resulted in orders to me to be more "cooperative," and otherwise undermined me within my own company. Fortunately, and to my wife's dismay, I had taken my laptop on vacation and watched email; and a staff guy in the different division intercepted that email, forwarded it directly to me, and asked what was going on. As a result, I could nip it all in the bud.<br />
<br />
It's sad and perhaps nearly unbelievable that precisely the same tactic – complain at a high level through an unrelated management chain – had been used by that same company against someone else who was being particularly effective against them.<br />
<br />
Another, shorter, story: A neighbor of mine who was also involved in a similar inter-company dispute told me that, while on a trip (and he took lots of trips; he was a regional sales manager) he happened to return to his hotel room after checking out and found people going through his trash, looking for anything incriminating.<br />
<br />
Standards can be nasty.<br />
<br />
Anyway, after a lot of the dust settled and IB had taken on a fairly firm shape, Intel dropped development of its IB product. Exactly why was never explicitly stated, but the consensus I heard was that compared with others' implementations in progress it was not competitive. Without the veto power of NGIO, Intel couldn't shape the standard to match what it was implementing. With Intel out, Microsoft followed suit, and the end result was InfiniBand as we see it today: A great interconnect for high-end systems that pervades HPC, but not the commodity-volume server part the founders hoped that it would be. I suspect there are folks at Intel who think they would have been more successful at achieving the original purpose if they had their veto, since then it would have matched their inexpensive parts. I tend to doubt that, since in the meantime PCI has turned into a hierarchical switched fabric (PCI Express), eliminating many of the original problems stemming from it being a bus.<br />
<br />
All this illustrates what standards are really about, from my perspective. Any relationship with pristine technical discussions or providing the "right thing" for customers is indirect, with all motivations leading through money – with side excursions through political, devious, and just plain nasty.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com6tag:blogger.com,1999:blog-3155908228127841862.post-52776843443315410332010-07-15T13:23:00.000-06:002010-07-15T13:23:18.111-06:00OnLive Follow-Up: Bandwidth and Cost<span xmlns=""></span><br />
<span xmlns="">As mentioned earlier in <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2010/07/onlive-works-first-use-impressions.html">OnLive Works! First Use Impressions</a>, I've tried <a bitly="BITLY_PROCESSED" href="http://www.onlive.com/">OnLive</a>, and it works quite well, with no noticeable lag and fine video quality. As I've discussed, this could affect GPU volumes, a lot, if it becomes a market force, since you can play high-end games with a low-end PC. However, additional testing has confirmed that users will run into bandwidth and data usage issues, and the cost is not what I'd like for continued use.<br />
<br />
To repeat some background, for completeness: OnLive is a service that runs games on their servers up in the cloud, streaming the video to your PC or Mac. It lets you run the highest-end games on very inexpensive systems, avoiding the cost of a rip-roaring gamer system. I've noted previously that this could hurt the mass market for GPUs, since OnLive doesn't need much graphics on the client. But there were serious questions (see my post <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2009/04/twilight-of-gpu.html">Twilight of the GPU?</a>) as to whether they could overcome bandwidth and lag issues: Can OnLive respond to your inputs fast enough for games to be playable? And could its bandwidth requirements be met with a normal household ISP?<br />
<br />
As I said earlier, and can re-confirm: Video, check. I found no problems there; no artifacts, including in displayed text. Lag, hence gameplay, is perfectly adequate, at least for my level of skill. Those with sub-millisecond reflexes might feel otherwise; I can't tell. There's confirmation of the low lag from <a bitly="BITLY_PROCESSED" href="http://www.eurogamer.net/articles/digitalfoundry-onlive-lag-analysis">Eurogamer</a>, which measured it at "150ms - similar to playing … locally". <br />
<br />
</span><br />
<span xmlns=""><h2>Bandwidth<br />
</h2>Bandwidth, on the other hand, does not present a pretty picture.<br />
<br />
When I was playing or watching action, OnLive continuously ran at about 5.8% - 6.4% utilization of a 100 Mb/sec LAN card. (OnLive won't run on WiFi, only on a wired connection.) This rate is very consistent. Displayed image resolution didn't cause it to vary outside that range, whether it was full-screen on my 1600 x 900 laptop display, full-screen on my 1920 x 1080 monitor, or windowed to about half the laptop screen area (which was the window size OnLive picked without input from me). When looking at static text displays, like OnLive control panels, it dropped down to a much smaller amount, in the 0.01% range; but that's not what you want to spend time doing with a system like this. <br />
<br />
I observed these values playing (Borderlands) and watching game trailers for a collection of "coming soon" games like Deus Ex, Drive, Darksiders, Two Worlds, Driver, etc. If you stand still in a non-action situation, it does go down to about 3% (of 100 Mb/sec) for me, but with action games that isn't the point.<br />
<br />
6.4% of 100 Mb/sec is about 2.9 GB (bytes) per hour. That hurts. <br />
<br />
My ISP, Comcast, considers over 250 GB/month "excessive usage" and grounds for terminating your account if you keep doing it regularly. That limit and OnLive's bandwidth together mean that over a 30-day period, Comcast customers can't play more than 3 hours a day without being considered "excessive."<br />
<br />
<br />
<h2>Prices<br />
</h2>I also found that prices are not a bargain, unless you're counting the money you save using a bargain PC – one that costs, say, what a game console costs.<br />
<br />
First, you pay for access to OnLive itself. For now that can be free, but after a year it's slated to be $4.95 a month. That's scarcely horrible. But you can't play anything with just access; you need to also buy a GamePass for each game you want to play.<br />
<br />
A Full GamePass, which lets you play it forever (or, presumably, as long as OnLive carries the game) is generally comparable to the price of the game itself, or more for the PC version. For example, the Borderlands Full GamePass is $29.99, and the game can be purchased for $30 or less (one site lists it for $3! (plus about $9 shipping)). F.E.A.R. 2 is $19.99 GamePass, and the purchase price is $19-$12. Assassin's Creed II was a loser, with GamePass for $39.99 and purchased game available for $24-$17. The standalone game prices are from online sources, and don't include shipping, so OnLive can net a somewhat smaller total. And you can play it on a cheap PC, right? Hmmm. Or a console.<br />
<br />
There are also, in many cases, 5 day and 3 day passes, typically $9-$7 for 5-day and $4-$6 for 3-day. As a try before you buy, maybe those are OK, but 30 minute free demos are available, too, making a reasonably adequate try available for free.<br />
<br />
Not all the prices are that high. There's something called AAAAAAA, which seems to consist entirely of falling from tall buildings, with a full GamePass for $9.99; and Brain Challenge is $4.99. I'll bet Brain Challenge doesn't use much bandwidth, either.<br />
<br />
The correspondence between Full GamePass and the retail price is obviously no coincidence. I wouldn't be surprised at all to find that relationship to be wired into the deals OnLive has with game publishers. Speculation, since I just don't know: Do the 5 or 3 day pass prices correspond to normal rental rates? I'd guess yes.<br />
<br />
<br />
<h2>Simplicity & the Mac Factor<br />
</h2>A real plus for OnLive is simplicity. Installation is just pure dead simple, and so is starting to play. Not only do you not have to acquire the game, there's no installation and no patching; you just select the game, get a PayPass (zero time with a required pre-registered credit card), and go. Instant gratification.<br />
<br />
Then there's the Mac factor. If you have only Apple products – no console and no Windows PC – you are simply shut out of many games unless you pursue the major hassle of BootCamp, which also requires purchasing a copy of Windows and doing the Windows maintenance. But OnLive runs on Macs, so a wide game experience is available to you immediately, without a hassle.<br />
<br />
<br />
<h2>Conclusion<br />
</h2></span><span xmlns="">To sum up:<br />
<br />
Positive: great video quality, great playability, hassle-free instant gratification, and the Mac factor.<br />
<br />
Negative: Marginally competitive game prices (at best) and bandwidth, bandwidth, bandwidth. The cost can be argued, and may get better over time, but your ISP cutting you off for excessive data usage is pretty much a killer.<br />
<br />
So where does this leave OnLive and, as a consequence, the market for GPUs? I think the bandwidth issue says that OnLive will have little impact in the near future.<br />
<br />
However, this might change. Locally, Comcast TV ads showing off their "Xfinity" rebranding had a small notice indicating that 105 Mb data rates would be available in the future. It seems those have disappeared, so maybe it won't happen. But a 10X data rate improvement wouldn't mean much if you also didn't increase the data usage cap, and a 10X usage cap increase would completely eliminate the bandwidth issue.<br />
<br />
Or maybe the Net Neutrality guys will pick this up and succeed. I'm not sure on that one. It seems like trying to get water from a stone if the backbone won't handle it, but who knows?<br />
<br />
The proof, however, is in the playing and its market share, so we can just watch to see how this works out. The threat is still there, just masked by bandwidth requirements.<br />
<br />
(And I still think virtual worlds should evaluate this technology closely. Installation difficulty is a key inhibitor to several markets there, forcing extreme measures – like shipping laptops already installed – in one documented case; see <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2010/05/living-in-it-tale-of-learning-in-second.html">Living In It: A Tale of Learning in Second Life</a>.)<br />
</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com3tag:blogger.com,1999:blog-3155908228127841862.post-76112796843348698332010-07-12T15:11:00.000-06:002010-07-12T15:11:32.506-06:00Who Does the Shoe Fit? Functionally Decomposed Hardware (GPGPUs) vs. Multicore.<span xmlns=""></span><br />
<span xmlns="">This post is a long reply to the thoughtful comments on my post <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2010/06/wnpots-and-conservatism-of-hardware.html">WNPoTs and the Conservatism of Hardware Development</a> that were made by <a bitly="BITLY_PROCESSED" href="http://www.blogger.com/profile/08451601432189062228">Curt Sampson</a> and <a bitly="BITLY_PROCESSED" href="http://www.blogger.com/profile/06349106495739189749">Andrew Richards</a>. The issue is: Is functionally decomposed hardware, like a GPU, much harder to deal with than a normal multicore (SMP) system? (It's delayed. Sorry. For some reason I ended up in a mental deadlock on this subject.)<br />
<br />
I agree fully with Andrew and Curt that using functionally decomposed hardware can be straightforward <strong><em>if</em></strong> the hardware performs exactly the function you need in the program. If it does not, massive amounts of ingenuity may have to be applied to use it. I've been there and done that, trying at one point to make some special-purpose highly-parallel hardware simulation boxes do things like chip wire routing or more general computing. It required much brain twisting and ultimately wasn't that successful.<br />
<br />
However, GPU designers have been particularly good at making this match. Andrew made this point very well in a video'd debate over on Charlie Demerjian's SemiAccurate blog: Last minute changes that would be completely anathema to GP designs are apparently par for the course with GPU designs.<br />
<br />
The embedded systems world has been dealing with functionally decomposed hardware for decades. In fact, a huge part of their methodology is devoted to figuring out where to put a hardware-software split to match their requirements. Again, though, the hardware does exactly what's needed, often through last-minute FPGA-based hardware modifications.<br />
<br />
However, there's also no denying that the mainstream of software development, all the guys who have been doing Software Engineering and programming system design for a long time, really don't have much use for anything that's not an obvious Turing Machine onto which they can spin off anything they want. Traditional schedulers have a rough time with even clock speed differences. So, for example, traditional programmers look at Cell SPUs, with their manually-loaded local memory, and think they're misbegotten spawn on the devil or something. (I know I did initially.)<br />
<br />
This train of thought made me wonder: Maybe traditional cache-coherent MP/multicore actually is hardware specifically designed for a purpose, like a GPU. That purpose is, I speculate, transaction processing. This is similar to a point I raised long ago in this blog (<a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2008/10/it-departments-should-not-fear.html">IT Departments Should NOT Fear Multicore</a>), but a bit more pointed.<br />
<br />
Don't forget that SMPs have been around for a very long time, and practically from their inception in the early 1970s were used transparently, with no explicit parallel programming and code very often written by less-than-average programmers. Strongly enabling that was a transaction monitor like IBM's CICS (and lots of others). All code is written as a relatively small chunk (debit this account) (and the cash on hand, and total cash in a bank…). That chunk is automatically surrounded by all locking it needs, called by the monitor when a customer implicitly invokes it, and can be backed out as needed either by facilities built into the monitor or by a back-end database system.<br />
<br />
It works, and it works very well right up to the present, even with programmers so bad it's a wonder they don't make the covers fly off the servers. (OK, only a few are that bad, but the point is that genius is not required.)<br />
<br />
Of course, transaction monitors aren't a language or traditional programming construct, and also got zero academic notice except perhaps for Jim Gray. But they work, superbly well on SMP / multicore. They can even work well across clusters (clouds) as long as all data is kept in a separate backend store (perhaps logically separate), which model, by the way, is the basis of a whole lot of cloud computing.<br />
<br />
Attempts to make multicores/SMPs work in other realms, like HPC, have been fairly successful but have always produced cranky comments about memory bottlenecks, floating-point performance, how badly caches fit the requirements, etc., comments you don't hear from commercial programmers. Maybe this is because it was designed for them? That question is, by the way, deeply sarcastic; performance on transactional benchmarks (like TPC's) are the guiding light and laser focus of most larger multicore / SMP designs.<br />
<br />
So, overall, this post makes a rather banal point: If the hardware matches your needs, it will be easy to use. If it doesn't, well, the shoe just doesn't fit, and will bring on blisters. However, the observation that multicore is actually a special purpose device, designed for a specific purpose, is arguably an interesting perspective.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com6tag:blogger.com,1999:blog-3155908228127841862.post-23743560417411179462010-07-06T01:17:00.002-06:002010-07-07T15:42:36.950-06:00OnLive Works! First Use Impressions<span xmlns=""></span><br />
<span xmlns="">I've tried <a bitly="BITLY_PROCESSED" href="http://www.onlive.com/">OnLive</a>, and it works. At least for the games I tried, it seems to work quite well, with no noticeable lag and fine video quality. But I'm not sure about the bandwidth issue yet, or the cost.<br />
<br />
OnLive is a service that runs games on their servers up in the cloud, streaming the video to your PC or Mac. I've noted previously that this could hurt the mass market for GPUs, since it doesn't need much graphics on the client. But there were serious questions (see my post <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2009/04/twilight-of-gpu.html">Twilight of the GPU?</a>) as to whether they could overcome bandwidth and lag issues: Can OnLive respond to your inputs fast enough for games to be playable? And could its bandwidth requirements be met with a normal household ISP?<br />
<br />
As I said above: Lag, check. Video, check. I found no problems there. Bandwidth, inconclusive. Cost, ditto. More data will answer those, but I've not had the chance to gather it yet. Here's what I did:<br />
<br />
I somehow was "selected" from their wait-list as an OnLive founding member, getting me free access for a year – which doesn't mean I play totally free for a year; see below – and tried it out today, playing free 30-minute demos of Assassin's Creed II a little bit, and Borderlands enough for a good impression.<br />
<br />
Assassin's Creed II was fine through initial cutscenes and minor initial movement. But when I reached the point where I was reborn as a player in medieval times, I ran into a showstopper. As an introduction to the controls, the game wanted me to press <squiggle_icon> to move my legs. <squiggle_icon>, unfortunately, corresponds to no key on my laptop. I tried everything plus shift, control, and alt variations, and nothing worked. In the process I accidentally created a brag clip, went back to the OnLive dashboard, and did some other obscure things I never did figure out, but never did move my legs. I moved my arms with about four different key combinations, but the game wasn't satisfied with that. So I ditched it. For all I know there's something on the OnLive web site explaining this, but I didn't look enough to find it.<br />
<br />
I was much more successful with Borderlands, a post-apocalyptic first-person shooter. I completed the initial training mission, leveled up, and was enjoying myself when the demo time – 30 minutes, which I consider adequately generous – ran out. Targeting and firing seemed to be just as good as native games on my system. I played both in a window and in fullscreen mode, and at no time was there noticeable lag or any visual artifacts. It just played smoothly and nicely.<br />
<br />
I wanted to try Dragon Age – I'm more of an RPG guy – but while it shows up on the web site, I couldn't find it among the games available for play on the live system.<br />
<br />
This is not to say there weren't hassles and pains involved in getting going. Here are some details.<br />
<br />
First, my environment: The system I used is a Sony Vaio VGN-2670N, with Intel Core Duo @ 2.66 GHz, a 1600x900 pixel display, with 4GB RAM and an Nvidia GeForce 9300M; but the Nvidia display adapter wasn't being used. For those of you wondering about speed-of-light delays, my location is just North of Denver, CO, so this was all done more than 1000 miles from the closest server farm they have (Dallas, TX). My ISP is Comcast cable, nominally providing 10 Mb/sec; I have seen it peak as high as 15 Mb/sec in spurts during downloads. My OS is 32-bit Windows Vista. (I know…)<br />
<br />
There was a minor annoyance at the start, since their client installer refuses to even try using Google Chrome as the browser. IE, Firefox, and Safari are supported. But that only required me to use IE, which I shun, for the install; it's not used running the client.<br />
<br />
The much bigger pain is that OnLive adamantly refuses to run over Wifi. The launcher checks, gives you one option – exit – and points you to a FAQ, which pointer gets a 404 (page not found). I did find the <a bitly="BITLY_PROCESSED" href="http://www.onlive.com/support/performance">relevant FAQ</a> manually on the web site. There they apologize and say it "does indeed work well with good quality Wi-Fi connections, and in the future OnLive will support wireless" but initially they're scared of bad packet-dropping low-signal-strength crud. I can understand this; they're fighting an uphill battle convincing people this works at all, and do not need a multitude complaining they don't work when the problem is crummy Wi-Fi. (Or WiFi in a coffee shop – a more serious issue; see bandwidth discussion below.)<br />
<br />
Nevertheless, this is a pain for me. I had to go down in the basement and set up a chair where my router is, next to my water heater, to get a wired connection. When I did go down there, after convincing Vista (I know!) to actually use the wired connection, things went as described above.<br />
<br />
That leaves one question: Bandwidth. My ISP, Comcast, has a 250 GB/month limit beyond which I am an "excessive user" and apparently get a stern talking-to, followed by account termination if I don't mend my excessive ways. Up to now, this has been far from an issue. With OnLive, it may be a significant limitation.<br />
<br />
Unfortunately, I didn't monitor my network use carefully when using OnLive, and ran out of time to go back and do better monitoring. I'll report more when I've done that. However, checking some numbers provided by Comcast after the fact, I can see the possibility that averaging four hours a day is all the OnLive I could do and not get terminated, since my hour of use <strong><em>may</em></strong> (just may) have sucked down 2 GB. This could be a significant issue, limiting OnLive to only very casual users, but I need better measurement to be sure.<br />
<br />
This also points to a reason for not initially allowing Wifi that they didn't mention: I doubt your local free Wifi hot spot in a Starbucks or McDonald's is really up to the task of serving several OnLive players all day.<br />
<br />
Finally, there's cost. What I have free is access to the OnLive system; after a year that's $4.95/month (which may be a "founding member" deal). But to play other than a free demo, I need to purchase a PlayPass for each game played. I didn't do that, and still need to check that cost. Sorry, time limitations again.<br />
<br />
So where does this leave the market for GPUs? With the information I have so far, all I can say is that the verdict is inconclusive. I think they really have the lag and display issues licked; those just aren't a problem. If I'm wrong about the bandwidth (entirely possible), and the PlayPasses don't cost too much, it could over time deal a large blow to the mass market for GPUs, which among other problems would sink the volumes that make them relatively inexpensive for HPC use.<br />
<br />
On the other hand, if the bandwidth and cost make OnLive suitable only for very casual gaming, there may actually be a positive effect on the GPU market, since OnLive could be used as a very good "try before you buy" facility. It worked for me; I've been avoiding first-person shooters in favor of RPGs, but found the Borderlands demo to be a lot more fun than I expected.</span><br />
<span xmlns=""><br />
</span><br />
<span xmlns="">Finally, I'll just note that Second Life recently changed direction and is saying they're going to move to a browser-based client. They, and other virtual world systems, might do well to consider instead a system using this type of technology. It would expand the range of client systems dramatically, and, even though there is a client, simplify use dramatically.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com6tag:blogger.com,1999:blog-3155908228127841862.post-49468890052369582712010-06-14T17:22:00.000-06:002010-06-14T17:22:37.332-06:00WNPoTs and the Conservatism of Hardware Development<span xmlns=""></span><br />
<span xmlns="">There are some things about which I am undoubtedly considered a crusty old fogey, the abominable NO man, an ostrich with its head in the sand, and so on. Oh frabjous day! I now have a word for such things, courtesy of <a bitly="BITLY_PROCESSED" href="http://www.antipope.org/charlie/blog-static/2010/05/cmap-9-ebooks.html">Charlie Stross, who wrote</a>:<br />
<br />
<blockquote>Just contemplate, for a moment, how you'd react to some guy from the IT sector walking into your place of work to evangelize a wonderful new piece of technology that will <em>revolutionize your job</em>, once everybody in the general population shells out £500 for a copy and <em>you</em> do a lot of hard work to teach them how to use it, And, on closer interrogation, you discover that he <em>doesn't actually know what you do for a living</em>; he's just certain that his WNPoT is going to revolutionize it. Now imagine that this happens (different IT marketing guy, different WNPoT, same pack drill) approximately once every two months for a five year period. You'd learn to tune him out, wouldn't you?<br />
</blockquote>I've been through that pack drill more times than I can recall, and yes, I tune them out. The WNPoTs in my case were all about technology for computing itself, of course. Here are a few examples; they are sure to step on number of toes:<br />
<br />
<ul><li>Any new programming language existing only for parallel processing, or any reason other than making programming itself simpler and more productive (see my post <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2008/09/101-parallel-languages-part-1.html">101 parallel languages</a>)<br />
</li>
<li>Multi-node single system image (see my post <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2009/01/multi-multicore-single-system-image.html">Multi-Multicore Single System Image</a>)<br />
</li>
<li><a bitly="BITLY_PROCESSED" href="http://www.memristor.org/">Memristors</a>, a new circuit type. A key point here is that exactly one company (HP) is working on it. Good technologies instantly crystallize consortia around themselves. Also, HP isn't a silicon technology company in the first place. <br />
</li>
<li><a bitly="BITLY_PROCESSED" href="http://en.wikipedia.org/wiki/Quantum_computer">Quantum computing</a>. Primarily good for just one thing: Cracking codes.<br />
</li>
<li>Brain simulation and strong artificial intelligence (really "thinking," whatever that means). Current efforts were beautifully characterized by John Horgan, in a <a bitly="BITLY_PROCESSED" href="http://www.scientificamerican.com/blog/post.cfm?id=artificial-brains-are-imminentnot-2010-05-14">SciAm guest blog</a>: 'Current brain simulations resemble the "planes" and "radios" that <a bitly="BITLY_PROCESSED" href="http://en.wikipedia.org/wiki/Cargo_cult">Melanesian cargo-cult tribes</a> built out of palm fronds, coral and coconut shells after being occupied by Japanese and American troops during World War II.'<br />
</li>
</ul>Of course, for the most part those aren't new. They get re-invented regularly, though, and drooled over by ahistorical evalgelists who don't seem to understand that if something has already failed, you need to lay out what has changed sufficiently that it won't just fail again.<br />
<br />
The particular issue of retred ideas aside, genuinely new and different things have to face up to what Charlie Stross describes above, in particular the part about not understanding what you do for a living. That point, for processor and system design, is a lot more important than one might expect, due to a seldom-publicized social fact: Processor and system design organizations are incredibly, insanely, conservative. They have good reason to be. Consider: <br />
<br />
Those guys are building some of the most, if not the most, intricately complex structures ever created in the history of mankind. Furthermore, they can't be fixed in the field with an endless stream of patches. They have to just plain work – not exactly in the first run, although that is always sought, but in the second or, at most, third; beyond that money runs out.<br />
<br />
The result they produce must also please, not just a well-defined demographic, but a multitude of masters from manufacturing to a wide range of industries and geographies. And of course it has to be cost- and performance-competitive when released, which entails a lot of head-scratching and deep breathing when the multi-year process begins.<br />
<br />
Furthermore, each new design does it all over again. I'm talking about the "tock" phase for Intel; there's much less development work in the "tick" process shrink phase. Development organizations that aren't Intel don't get that breather. You don't "re-use" much silicon. (I don't think you ever re-use much code, either, with a few major exceptions; but that's a different issue.)<br />
<br />
This is a very high stress operation. A huge investment can blow up if one of thousands of factors is messed up.<br />
<br />
What they really do to accomplish all this is far from completely documented. I doubt it's even consciously fully understood. (What gets written down by someone paid from overhead to satisfy an ISO requirement is, of course, irrelevant.)<br />
<br />
In this situation, is it any wonder the organizations are almost insanely conservative? Their members cannot even conceive of something except as a delta from both the current product and the current process used to create it, <strong>because that's what worked</strong>. And it worked within the budget. And they have their total intellectual capital invested in it. Anything not presented as a delta of both the current product and process is rejected out of hand. The process and product are intertwined in this; what was done (product) was, with no exceptions, what you were able to do in the context (process).<br />
<br />
An implication is that they do not trust anyone who lacks the scars on their backs from having lived that long, high-stress process. You can't learn it from a book; if you haven't done it, you don't understand it. The introduction of anything new by anyone without the tribal scars is simply impossible. This is so true that I know of situations where taking a new approach to processor design required forming a new, separate organization. It began with a high-level corporate Act of God that created a new high-profile organization from scratch, dedicated to the new direction, staffed with a mix of outside talent and a few carefully-selected high-talent open-minded people pirated from the original organization. Then, very gradually, more talent from the old organization was siphoned off and blended into the new one until there was no old organization left other than a maintenance crew. The new organization had its own process, along with its own product.<br />
<br />
This is why I regard most WNPoT announcements from a company's "research" arm as essentially meaningless. Whatever it is, it won't get into products without an "Act of God" like that described above. WNPoTs from academia or other outside research? Fuggedaboudit. Anything from outside is rejected unless it was originally nurtured by someone with deep, respected tribal scars, sufficiently so that that person thinks they completely own it. Otherwise it doesn't stand a chance.<br />
<br />
Now I have a term to sum up all of this: WNPoT. Thanks, Charlie.<br />
<br />
Oh, by the way, if you want a good reason why the Moore's Law half-death that flattened clock speeds produced multi- / many-core as a response, look no further. They could only do more of what they already knew how to do. It also ties into how the very different computing designs that are the other reaction to flat clocks came not from CPU vendors but outsiders – GPU vendors (and other accelerator vendors; see my post <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2009/07/why-accelerators-now.html">Why Accelerators Now?</a>). They, of course, were also doing more of what they knew how to do, with a bit of Sutherland's Wheel of Reincarnation and DARPA funding thrown in for Nvidia. None of this is a criticism, just an observation.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com5tag:blogger.com,1999:blog-3155908228127841862.post-71705921707242291772010-06-08T21:43:00.002-06:002010-06-09T19:52:54.897-06:00Ten Ways to Trash your Performance Credibility<span xmlns=""></span><br />
<span xmlns="">Watered by rains of development sweat, warmed in the sunny smiles of ecstatic customers, sheltered from the hailstones of Moore's Law, the accelerator speedup flowers are blossoming. <br />
<br />
Danger: The showiest blooms are toxic to your credibility.<br />
<br />
(My wife is planting flowers these days. Can you tell?)<br />
<br />
There's a paradox here. You work with a customer, and he's happy with the result; in fact, he's ecstatic. He compares the performance he got before you arrived with what he's getting now, and gets this enormous number – 100X, 1000X or more. You quote that customer, accurately, and hear:<br />
<br />
"I would have to be pretty drunk to believe that."<br />
<br />
Your great, customer-verified, most wonderful results have trashed your credibility.<br />
<br />
Here are some examples:<br />
<br />
In a <a bitly="BITLY_PROCESSED" href="http://perilsofparallel.blogspot.com/2010/05/all-hail-gpu-tweetstream.html">recent talk</a>, Prof. Sharon Glotzer just glowed about getting a 100X speedup "overnight" on the molecular dynamics codes she runs.<br />
<br />
In an <a bitly="BITLY_PROCESSED" href="http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&gid=95317&discussionID=18896209&split_page=2&goback=.anh_95317">online discussion</a> on LinkedIn, a Cray marketer said his client's task went from taking 12 hours on a Quad-core Intel Westmere 5600 to 1.2 seconds. That's a speedup of 36,000X. What application? Sorry, that's under non-disclosure agreement.<br />
<br />
In a <a bitly="BITLY_PROCESSED" href="http://www.accelereyes.com/resources/spectroscopy">video interview</a>, a customer doing cell pathology image analysis reports their task going from 400 <span class="Apple-style-span" style="color: red;">minutes</span> to 65 milliseconds, for a speedup of just under 370,000X. <span class="Apple-style-span" style="color: red;">(Update: Typo, he really does say "minutes" in the video.)</span><br />
<br />
None of these people are shading the truth. They are doing what is, for them, a completely valid comparison: They're directly comparing where they started with where they ended up. The problem is that the result doesn't pass the drunk test. Or the laugh test. The idea that, by itself, accelerator hardware or even some massively parallel box will produce 5-digit speedups is laughable. Anybody baldly quoting such results will instantly find him- or herself dismissed as, well, the polite version would be that they're living in la-la land or dipping a bit too deeply into 1960s pop pharmacology.<br />
<br />
What's going on with such huge results is that the original system was a target-rich zone for optimization. It was a pile of bad, squirrely code, and sometimes, on top of that, interpreted rather than compiled. Simply getting to the point where an accelerator, or parallelism, or SIMD, or whatever, could be applied involved fixing it up a lot, and much of the total speedup was due to that cleanup – not directly to the hardware.<br />
<br />
This is far from a new issue. Back in the days of vector supercomputers, the following sequence was common: Take a bunch of grotty old Fortran code and run it through a new super-duper vectorizing optimizing compiler. Result: Poop. It might even slow down. So, OK, you clean up the code so the compiler has a fighting chance of figuring out that there's a vector or two in there somewhere, and Wow! Gigantic speedup. But there's a third step, a step not always done: Run the new version of the code through a decent compiler <strong>without</strong> vectors or any special hardware enabled, and, well, hmmm. In lots of cases it runs almost as fast as with the special hardware enabled. Thanks for your help optimizing my code, guys, but keep your hardware; it doesn't seem to add much value.<br />
<br />
The moral of that story is that almost anything is better than grotty old Fortran. Or grotty, messed-up MATLAB or Java or whatever. It's the "grotty" part that's the killer. A related modernized version of this story is told in a recent paper <a bitly="BITLY_PROCESSED" href="http://domino.watson.ibm.com/library/CyberDig.nsf/1e4115aea78b6e7c85256b360066f0d4/9192e6536facfcef85257720005a0265!OpenDocument&Highlight=0,Bordawekar"><em>Believe It or Not! Multi-core CPUs can Match GPU Performance</em></a>, where they note "The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively." If you really clean up the code and match it to the platform it's using, great things can happen. <br />
<br />
This of course doesn't mean that accelerators and other hardware are useless; far from it. The "Believe It or Not!" case wasn't exactly hurt by the fact that Power7 has a macho memory subsystem. It does mean that you should be aware of all the factors that sped up the execution, and using that information, present your results with credit due to the appropriate actions.<br />
<br />
The situation we're in is identical to the one that lead someone (wish I remembered who), decades ago, to write a short paper titled, approximately, <em>Ten Ways to Lie about Parallel Processing</em>. I thought I kept a copy, but if I did I can't find it. It was back at the dawn of whatever, and I can't find it now even with Google Scholar. (If anyone out there knows the paper I'm referencing, please let me know.) <span class="Apple-style-span" style="color: red;">Got it! It's </span><a bitly="BITLY_PROCESSED" href="http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf"><span class="Apple-style-span" style="color: red;">Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers</span></a><span class="Apple-style-span" style="color: red;">, by David H. Bailey. Thank you, </span><a bitly="BITLY_PROCESSED" href="http://www.blogger.com/profile/15007027365887236970"><span class="Apple-style-span" style="color: red;">Roland</span></a><span class="Apple-style-span" style="color: red;">!</span><br />
<br />
In the same spirit, and probably duplicating that paper massively, here are my ten ways to lose your credibility:<br />
<br />
</span><br />
<span xmlns=""></span><br />
<span xmlns=""></span><br />
<span xmlns=""></span><br />
<span xmlns=""></span><br />
<span xmlns=""><ol><li>Only compare the time needed to execute the innermost kernel. Never mind that the kernel is just 5% of the total execution time of the whole task.<br />
</li>
<li>Compare your single-precision result to the original, which computed in double precision. Worry later that your double precision is 4X slower, and the increased data size won't fit in your local memory. Speaking of which,<br />
</li>
<li>Pick a problem size that just barely fits into the local memory you have available. Why? See #4.<br />
</li>
<li>Don't count the time to initialize the hardware and load the problem into its memory. PCI Express is just as fast as a processor's memory bus. Not.<br />
</li>
<li>Change the algorithm. Going from a linear to a binary search or a hash table is just good practice.<br />
</li>
<li>Rewrite the code from scratch. It was grotty old Fortran, anyway; the world is better off without it.<br />
</li>
<li>Allow a <em>slightly</em> different answer. A*(X+Y) equals A*X+A*Y, right? Not in floating point, it doesn't.<br />
</li>
<li>Change the operating system. Pick the one that does IO to your device fastest.<br />
</li>
<li>Change the libraries. The original was 32 releases out of date! And didn't work with my compiler!<br />
</li>
<li>Change the environment. For example, get rid of all those nasty interrupts from the sensors providing the real-time data needed in practice.<br />
</li>
</ol></span><span xmlns="">This, of course, is just a start. I'm sure there are another ten or a hundred out there.<br />
<br />
A truly fair accounting for the speedup provided by an accelerator, or any other hardware, can only be done by comparing it to the best possible code for the original system. I suspect that the only time anybody will be able to do that is when comparing formally standardized benchmark results, not live customer codes. <br />
<br />
For real customer codes, my advice would be to list all the differences between the original and the final runs that you can find. Feel free to use the list above as a starting point for finding those differences. Then show that list <em>before</em> you present your result. That will at least demonstrate that you know you're comparing marigolds and peonies, and will help avoid trashing your credibility.<br />
<br />
*****************<br />
<br />
Thanks to John Melonakos of <a bitly="BITLY_PROCESSED" href="http://www.accelereyes.com/">Accelereyes</a> for discussion and sharing his thoughts on this topic.</span>Greg Pfisterhttp://www.blogger.com/profile/12651996181651540140noreply@blogger.com15