Comments on The Perils of Parallel: MIC and the Knights

it is developers! hardware evolution matters less!...

2013-01-08T03:49:35.858-07:00

it is developers!
hardware evolution matters less!
the UTOPIA of supporting highly parallel apps on legacy code will die!

" Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language."

HELL
again
HELL

in a new paradigm world we need new way of thinking and FOR loops are thing of the past.

what intel's customers will eventually has to end up living up with is the fact that:
they buy highly overpraised hardware with very little improvements on their legacy code!!
they will need to change the legacy code not the hardware, duh...

Debug is indeed a quagmire, both for the systems s...

2011-10-31T12:06:59.760-06:00

Debug is indeed a quagmire, both for the systems software stack on both host and device sides, and the user application code. Personally I think a promising direction is technology such as Corensic's Jinx tool. It uses binary translation and virtualization techniques to slide a microvisor underneath your OS, and intercepts some or all of the code running. Then it periodically interrupts execution to explore the state space of thread interactions in an attempt to trigger bugs, and then force the os or app down that particular execution path. Pretty neat stuff (I hope I got the basic explanation correct, I'm not privy to their implementation).

This is the kind of thing I'd love to see as a standard facility on boxes going forward. For example on the MIC cards and hosts alongside the more traditional debugging/debugger support.

An overall theme in my interactions with hardware systems designers, as someone who architects large commercial systems and the software for them, is that besides raw silicon, increasingly I'm looking to them to provide the necessary enabling layers such as the very lowest levels of dynamic optimizers and better debug/test capabilities. I want to aggressively adopt new hardware but they need to meet me halfway. Yes, there's always a toolchain and naturally focused on the traditional needs, but there is definitely still a lot can be done to enable us to build new high-performance apps.

On the optimization side, I'm excited by the possibilities of the Intel ArBB virtual machine:

http://software.intel.com/sites/products/documentation/arbb/vm/index.htm

and hope to see it across the entire product line at some point. That'd let my applications express the parallelism and have Intel's software do the final code generation & optimization for the cores and exact SIMD family available.

I think the presumption that Intel can get "-...

2011-10-31T12:04:15.872-06:00

I think the presumption that Intel can get "-mmic" to work in a way that provides enough benefit to justify the purchase of the new part is the biggest hole in MIC's plans. I have little faith in a compiler being able to exploit data-parallelism to any real degree, automatically. The community loves to talk about this, but never delivers anything close. Maybe this is different, but we're not counting any chickens until they hatch on any of this.

My biased disclaimer of course is that I come from building GPU libraries, which are painstakingly built by hand to produce top-notch performance. My experience is that no compiler can come close to automatically producing the benefits we put into our library.

All this said, we are very happy for Intel to come out with MIC and think it will find a nice customer base when it arrives. We look forward to supporting it with our library too :)

-John at AccelerEyes

khb, Yes, Intel has run into the debug quagmire. T...

2011-10-31T10:59:54.020-06:00

khb, Yes, Intel has run into the debug quagmire. Take a look at my interview with Jim Reinders a few posts back (http://goo.gl/SdPeD). He has horror stories of weeks spent at customers' locations trying to find bugs. He also has some favorite new tools that make him "cautiously optimistic."

Anonymous, ispc may well be a good thing for compi...

2011-10-31T10:54:18.411-06:00

Anonymous, ispc may well be a good thing for compilers. But I consider the far harder part to be finding the (vector/SIMD) parallelism in the first place.

niall, No problem with choirs; they're everyw...

2011-10-31T10:52:44.920-06:00

niall,

No problem with choirs; they're everywhere, all singing in different keys. :-)

Yes, GPUs must have fixed-function units. But few compared with cores. Intel now says >50, which alone is a big multiplier. Which takes more aggregate area? Depends on implementation, but I'd say fixed function is smaller.

A recent talk by Dally (http://goo.gl/aITIh) did show some conventional cores, but few compared with the >50 (64? 128?) of KC.

There may well be an advantage to not focussing on consumer markets. But that also eliminates volumes, raising price even more. Price is a big deal with Nvidia's sales; see my post http://goo.gl/i2rZ (which was, of course, premature parcitularly given Sandy Bridge mediocre graphics performance (another post)).

Wonderful summary, thanks for posting it. Porting...

2011-10-31T10:33:35.816-06:00

Wonderful summary, thanks for posting it.

Porting is often not nearly as hard as tuning (as you observed). Even harder is debugging ... what (if any) comments did Intel make about support beyond the pain and suffering currently offered by GPU environments (Michael Wolfe's lovely intro on the PGI website to their unified compilation environment makes that sadly clear as an achilles heel).

As to interconnect, Thunderbolt would seem like a better approach than PCI but perhaps there's niceties that escape me.

Greg -- are you familiar with ispc ( http://ispc.g...

2011-10-31T08:03:46.416-06:00

Greg -- are you familiar with ispc ( http://ispc.github.com/ ) ? It´s an experimental compiler that presents a scalar implicit vectorisation model of x86 SSE/AVX vector units. It´s similar to what Intel´s OpenCL compiler does, but without the cruft of that programming model.

Seems like it would be an ideal way to express data parallelism for these wide-vector, multicore Knights´ devices.

Greg -- apologies for preaching to the choir here!...

2011-10-30T19:22:24.740-06:00

Greg -- apologies for preaching to the choir here!

"In the GPU-SIMD vs vector peak FLOPS wars, I simply feel that GPUs' avoiding all the individual cores' instruction cache / fetch / decode / schedule / logic, plus cache coherence, just leaves that much more space for FP units and lanes. So they win (maybe not usably). This is far from a new argument. "

The GPUs also pay a price for being graphics monsters. For example, in the Fermi chip, there are the hardware fixed function units for graphics workloads that a compute dedicated MIC could avoid. There's a raster engine (edge setup, rasterization, z-cull) per 4 cores, a polymorph engine per core (i.e. vertex attribute operations and tessellation) along with the texture caches and texture units for fetch+filter. There are, if I remember correctly, 8 ROP units on the chip. The logic and memory for thread scheduling and state is also not insignificant, and also the implicit simd divergence handling logic.

In Dally-style projections for future more compute-oriented GPUs, we'll still need those units I believe, and he also projects adding latency oriented cores that will suffer the costs you rightly highlight. Those features will have to pay their way though -- the volume graphics oriented parts need not have all those I think.

It is interesting to compare die sizes between Nvidia and AMD for their high-end discrete GPU products. We can observe some of the extra costs in the Nvidia product line to better support compute workloads (such as ECC, coherent caches, high DP perf) that are not as useful for graphics. Given that their volume products are graphics oriented, with some lesser use of compute for physics etc., it is interesting to ponder how long the graphics parts can be expected to suffer these extra costs, before the compute revenue needs to be large enough to pay their way.

Given an emphasis on maximizing performance for a given power budget, and looking at perf/W, and competing in fairly tight mass market $ segments, you really don't want to build too big a chip in the consumer markets... lest you bugger your gross margins, and over-consume a large amount of supply-constrained wafers due to large dies with low yields.

Anonymous -- I agree being able to get code up and running ASAP is huge. Getting initially decent perf via automatic parallelization/vectorization would be great. While I'm all for further work in the area, I don't foresee any dramatic breakthroughs in our ability to extract parallelism from code in existing languages. There is slack between what production compilers currently do and what the research compilers can do but beyond that I don't know of any particular big hammer we have. I'm on the side of higher level languages which are designed to allow easier expression by the programmer -- and putting my money where my mouth is by working on that one.

I agree that vectorization annotations will always...

2011-10-30T18:42:44.586-06:00

I agree that vectorization annotations will always have a margin for better results than automatic compiler analysis - but the question really is whether that margin is getting small enough that the extra performance you can wring out by carefully annotating your source code is worth the time spent.

Any effort Intel puts into improving the vectorization algorithms in icc will be a payoff for MIC, at least in the "legacy codes" market.

Intel's salespeople should be saying "If you go GPGPU then you'll have to rework your source - you have no choice. If you go with MIC, then you can try just recompiling, and that might get you 90% of the way there with a lot less time spent - and if it doesn't, then you can annotate your source and it'll perform as well on MIC as it would have on the GPGPUs, since our better process technology makes up for the GPGPUs architectural advantages".

Naill, thanks for the highly-informed and interest...

2011-10-30T15:32:33.653-06:00

Naill, thanks for the highly-informed and interesting comment.

Some responses:

In the GPU-SIMD vs vector peak FLOPS wars, I simply feel that GPUs' avoiding all the individual cores' instruction cache / fetch / decode / schedule / logic, plus cache coherence, just leaves that much more space for FP units and lanes. So they win (maybe not usably). This is far from a new argument.

However, such a comparison assumes equivalent silicon technology, a situation that won't exist. What will be will be, and we just have to wait and see.

I agree that the Larrabee new instructions (LRBNI) look enticingly fit to compilers. I worry that they're too fit: Every possible combination seems to be in silicon. This seems nice and general to softwawre, but hardware isn't software, and I see a RISC/CISC issue here. Analysis of actual codes will probably show only a few ops intensively used, the rest wasting silicon (and power), better supported by software layers with the silicon used for better pipelining / operation overlap that benefits everything.

"to be the Intel highly-parallel technical co...

2011-10-29T18:10:30.039-06:00

"to be the Intel highly-parallel technical computing
accelerator"

True, although trends suggest there could be a MIC that fits in a regular sockets, or that MIC cores could be integrated with Xeon cores in a single chip. I think an advantage that we shouldn't discount for Intel is their ability to mix and match well designed cores on a chip.

"No binary compatibility, though; you need to recompile"

True, although looking out years down the roadmap, it's not inconceivable that there'll be some convergence. AVX is defined to allow wider revisions and surely MIC experience will feed into ISA extensions. To me, however, we really need a standard low-level virtual machine layer. Expressing the parallelism and locality in my program is one thing, and I'm on the side of getting the programmer to help. However, prematurely binding my app to a specific version of the x86 ISA is a completely different thing. There are definite advantages to allowing me to express my data parallelism in some intermediate way, and then specializing at runtime.

As you rightly point out, extracting good performance requires pervasive use of the SIMD extensions in addition to having all cores cranking away. I think there's lots more work required to get us to the point where everyday application programmers have high-level enough tools to let us do that productively.

I can't really comment on national lab big ol' honkin' fortran codes, but in my application domain (finance and other commercial big data apps), we write lots of new code, and completely turn over our source base surprisingly rapidly.

Most of the applications haven't been written yet. IMHO, Intel really needs to also deliver on toolchains, or support those creating them, that support our types of big data and compute intensive processing. We have abundant task and data parallelism, and can make excellent use of SIMD processing in our database kernels, but we're a different market than traditional HPC.

BTW, not sure I buy the argument that GPUs will always have more peak performance due to wider SIMD width. E.g. Fermi is 16 cores with 32 SIMD lanes each, for 512 lanes total SP, half that for DP. At 50 cores, KNC would be 400 lanes, or 64 cores giving 512 total. Assuming the same ratio between SP and DP performance, the first MIC part could be in the ballpark, assuming 64 as the design target and binning down to SKUs with at least 50 cores active. (I'm staying away from counting flops etc as we can't yet say much about frequencies, issue restrictions, latencies, etc).

Arguably, with better single thread performance on the scalar part of the cores, it could be easier to get a higher % of peak out of many algorithms, with less programming effort. For generations after that, Intel have the rather big hammer of their process advantage to win DP peak perf crown while keeping power under control.

While the Larrabee history may give pause for thought, it's worth keeping in mind that Larrabee was an attempt at a GPU where software was used rather more extensively than traditional. Arguably what failed was the software and the mapping of it to the hardware (or appropriate hardware support). But with MIC, Intel are building a compute machine for compute workloads. I have far more faith that they can rock on that. And, based on the public information, even in the KNF generation, the SIMD ISA is far more friendly for a compiler than some of the alternatives.