The Perils of Parallel: MIC and the Knights

Friday, October 28, 2011

MIC and the Knights

Intel’s Many-Integrated-Core architecture (MIC) was on wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s Ferry (KF) software development kit. Well, I thought it was wide display, but I’m an IDF Newbie. There was mention in two keynotes, a demo in the first booth on the right in the exhibit hall, several sessions, etc. Some old hands at IDF probably wouldn’t consider the display “wide” in IDF terms unless it’s in your face on the banners, the escalators, the backpacks, and the bagels.

Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.

This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.

If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.

An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.

Is It Real?

The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)

More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.

The Big Feature: Source Compatibility

There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.

You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.

Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.

If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.

Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –

Well, not necessarily.

The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)

Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.

On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.

But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.

In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)

Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R² force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)

The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.

Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)

By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.

Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.

Several Other Things

Here are other MIC facts/factlets/opinions, each needing far less discussion.

How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.

What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.

Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.

Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)

Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.

Concluding Thoughts

Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.

Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.

Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.

I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.

Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…

Larrabee was supposed to be real, too.

**************************************************************

Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!

12 comments:

niall said...: "to be the Intel highly-parallel technical computing
accelerator"

True, although trends suggest there could be a MIC that fits in a regular sockets, or that MIC cores could be integrated with Xeon cores in a single chip. I think an advantage that we shouldn't discount for Intel is their ability to mix and match well designed cores on a chip.

"No binary compatibility, though; you need to recompile"

True, although looking out years down the roadmap, it's not inconceivable that there'll be some convergence. AVX is defined to allow wider revisions and surely MIC experience will feed into ISA extensions. To me, however, we really need a standard low-level virtual machine layer. Expressing the parallelism and locality in my program is one thing, and I'm on the side of getting the programmer to help. However, prematurely binding my app to a specific version of the x86 ISA is a completely different thing. There are definite advantages to allowing me to express my data parallelism in some intermediate way, and then specializing at runtime.

As you rightly point out, extracting good performance requires pervasive use of the SIMD extensions in addition to having all cores cranking away. I think there's lots more work required to get us to the point where everyday application programmers have high-level enough tools to let us do that productively.

I can't really comment on national lab big ol' honkin' fortran codes, but in my application domain (finance and other commercial big data apps), we write lots of new code, and completely turn over our source base surprisingly rapidly.

Most of the applications haven't been written yet. IMHO, Intel really needs to also deliver on toolchains, or support those creating them, that support our types of big data and compute intensive processing. We have abundant task and data parallelism, and can make excellent use of SIMD processing in our database kernels, but we're a different market than traditional HPC.

BTW, not sure I buy the argument that GPUs will always have more peak performance due to wider SIMD width. E.g. Fermi is 16 cores with 32 SIMD lanes each, for 512 lanes total SP, half that for DP. At 50 cores, KNC would be 400 lanes, or 64 cores giving 512 total. Assuming the same ratio between SP and DP performance, the first MIC part could be in the ballpark, assuming 64 as the design target and binning down to SKUs with at least 50 cores active. (I'm staying away from counting flops etc as we can't yet say much about frequencies, issue restrictions, latencies, etc).

Arguably, with better single thread performance on the scalar part of the cores, it could be easier to get a higher % of peak out of many algorithms, with less programming effort. For generations after that, Intel have the rather big hammer of their process advantage to win DP peak perf crown while keeping power under control.

While the Larrabee history may give pause for thought, it's worth keeping in mind that Larrabee was an attempt at a GPU where software was used rather more extensively than traditional. Arguably what failed was the software and the mapping of it to the hardware (or appropriate hardware support). But with MIC, Intel are building a compute machine for compute workloads. I have far more faith that they can rock on that. And, based on the public information, even in the KNF generation, the SIMD ISA is far more friendly for a compiler than some of the alternatives.; October 29, 2011 at 6:10 PM
Greg Pfister said...: Naill, thanks for the highly-informed and interesting comment.

Some responses:

In the GPU-SIMD vs vector peak FLOPS wars, I simply feel that GPUs' avoiding all the individual cores' instruction cache / fetch / decode / schedule / logic, plus cache coherence, just leaves that much more space for FP units and lanes. So they win (maybe not usably). This is far from a new argument.

However, such a comparison assumes equivalent silicon technology, a situation that won't exist. What will be will be, and we just have to wait and see.

I agree that the Larrabee new instructions (LRBNI) look enticingly fit to compilers. I worry that they're too fit: Every possible combination seems to be in silicon. This seems nice and general to softwawre, but hardware isn't software, and I see a RISC/CISC issue here. Analysis of actual codes will probably show only a few ops intensively used, the rest wasting silicon (and power), better supported by software layers with the silicon used for better pipelining / operation overlap that benefits everything.; October 30, 2011 at 3:32 PM
Anonymous said...: I agree that vectorization annotations will always have a margin for better results than automatic compiler analysis - but the question really is whether that margin is getting small enough that the extra performance you can wring out by carefully annotating your source code is worth the time spent.

Any effort Intel puts into improving the vectorization algorithms in icc will be a payoff for MIC, at least in the "legacy codes" market.

Intel's salespeople should be saying "If you go GPGPU then you'll have to rework your source - you have no choice. If you go with MIC, then you can try just recompiling, and that might get you 90% of the way there with a lot less time spent - and if it doesn't, then you can annotate your source and it'll perform as well on MIC as it would have on the GPGPUs, since our better process technology makes up for the GPGPUs architectural advantages".; October 30, 2011 at 6:42 PM
niall said...: Greg -- apologies for preaching to the choir here!

"In the GPU-SIMD vs vector peak FLOPS wars, I simply feel that GPUs' avoiding all the individual cores' instruction cache / fetch / decode / schedule / logic, plus cache coherence, just leaves that much more space for FP units and lanes. So they win (maybe not usably). This is far from a new argument. "

The GPUs also pay a price for being graphics monsters. For example, in the Fermi chip, there are the hardware fixed function units for graphics workloads that a compute dedicated MIC could avoid. There's a raster engine (edge setup, rasterization, z-cull) per 4 cores, a polymorph engine per core (i.e. vertex attribute operations and tessellation) along with the texture caches and texture units for fetch+filter. There are, if I remember correctly, 8 ROP units on the chip. The logic and memory for thread scheduling and state is also not insignificant, and also the implicit simd divergence handling logic.

In Dally-style projections for future more compute-oriented GPUs, we'll still need those units I believe, and he also projects adding latency oriented cores that will suffer the costs you rightly highlight. Those features will have to pay their way though -- the volume graphics oriented parts need not have all those I think.

It is interesting to compare die sizes between Nvidia and AMD for their high-end discrete GPU products. We can observe some of the extra costs in the Nvidia product line to better support compute workloads (such as ECC, coherent caches, high DP perf) that are not as useful for graphics. Given that their volume products are graphics oriented, with some lesser use of compute for physics etc., it is interesting to ponder how long the graphics parts can be expected to suffer these extra costs, before the compute revenue needs to be large enough to pay their way.

Given an emphasis on maximizing performance for a given power budget, and looking at perf/W, and competing in fairly tight mass market $ segments, you really don't want to build too big a chip in the consumer markets... lest you bugger your gross margins, and over-consume a large amount of supply-constrained wafers due to large dies with low yields.

Anonymous -- I agree being able to get code up and running ASAP is huge. Getting initially decent perf via automatic parallelization/vectorization would be great. While I'm all for further work in the area, I don't foresee any dramatic breakthroughs in our ability to extract parallelism from code in existing languages. There is slack between what production compilers currently do and what the research compilers can do but beyond that I don't know of any particular big hammer we have. I'm on the side of higher level languages which are designed to allow easier expression by the programmer -- and putting my money where my mouth is by working on that one.; October 30, 2011 at 7:22 PM
Anonymous said...: Greg -- are you familiar with ispc ( http://ispc.github.com/ ) ? It´s an experimental compiler that presents a scalar implicit vectorisation model of x86 SSE/AVX vector units. It´s similar to what Intel´s OpenCL compiler does, but without the cruft of that programming model.

Seems like it would be an ideal way to express data parallelism for these wide-vector, multicore Knights´ devices.; October 31, 2011 at 8:03 AM
khb said...: Wonderful summary, thanks for posting it.

Porting is often not nearly as hard as tuning (as you observed). Even harder is debugging ... what (if any) comments did Intel make about support beyond the pain and suffering currently offered by GPU environments (Michael Wolfe's lovely intro on the PGI website to their unified compilation environment makes that sadly clear as an achilles heel).

As to interconnect, Thunderbolt would seem like a better approach than PCI but perhaps there's niceties that escape me.; October 31, 2011 at 10:33 AM
Greg Pfister said...: niall,

No problem with choirs; they're everywhere, all singing in different keys. :-)

Yes, GPUs must have fixed-function units. But few compared with cores. Intel now says >50, which alone is a big multiplier. Which takes more aggregate area? Depends on implementation, but I'd say fixed function is smaller.

A recent talk by Dally (http://goo.gl/aITIh) did show some conventional cores, but few compared with the >50 (64? 128?) of KC.

There may well be an advantage to not focussing on consumer markets. But that also eliminates volumes, raising price even more. Price is a big deal with Nvidia's sales; see my post http://goo.gl/i2rZ (which was, of course, premature parcitularly given Sandy Bridge mediocre graphics performance (another post)).; October 31, 2011 at 10:52 AM
Greg Pfister said...: Anonymous, ispc may well be a good thing for compilers. But I consider the far harder part to be finding the (vector/SIMD) parallelism in the first place.; October 31, 2011 at 10:54 AM
Greg Pfister said...: khb, Yes, Intel has run into the debug quagmire. Take a look at my interview with Jim Reinders a few posts back (http://goo.gl/SdPeD). He has horror stories of weeks spent at customers' locations trying to find bugs. He also has some favorite new tools that make him "cautiously optimistic."; October 31, 2011 at 10:59 AM
melonakos said...: I think the presumption that Intel can get "-mmic" to work in a way that provides enough benefit to justify the purchase of the new part is the biggest hole in MIC's plans. I have little faith in a compiler being able to exploit data-parallelism to any real degree, automatically. The community loves to talk about this, but never delivers anything close. Maybe this is different, but we're not counting any chickens until they hatch on any of this.

My biased disclaimer of course is that I come from building GPU libraries, which are painstakingly built by hand to produce top-notch performance. My experience is that no compiler can come close to automatically producing the benefits we put into our library.

All this said, we are very happy for Intel to come out with MIC and think it will find a nice customer base when it arrives. We look forward to supporting it with our library too :)

-John at AccelerEyes; October 31, 2011 at 12:04 PM
niall said...: Debug is indeed a quagmire, both for the systems software stack on both host and device sides, and the user application code. Personally I think a promising direction is technology such as Corensic's Jinx tool. It uses binary translation and virtualization techniques to slide a microvisor underneath your OS, and intercepts some or all of the code running. Then it periodically interrupts execution to explore the state space of thread interactions in an attempt to trigger bugs, and then force the os or app down that particular execution path. Pretty neat stuff (I hope I got the basic explanation correct, I'm not privy to their implementation).

This is the kind of thing I'd love to see as a standard facility on boxes going forward. For example on the MIC cards and hosts alongside the more traditional debugging/debugger support.

An overall theme in my interactions with hardware systems designers, as someone who architects large commercial systems and the software for them, is that besides raw silicon, increasingly I'm looking to them to provide the necessary enabling layers such as the very lowest levels of dynamic optimizers and better debug/test capabilities. I want to aggressively adopt new hardware but they need to meet me halfway. Yes, there's always a toolchain and naturally focused on the traditional needs, but there is definitely still a lot can be done to enable us to build new high-performance apps.

On the optimization side, I'm excited by the possibilities of the Intel ArBB virtual machine:

http://software.intel.com/sites/products/documentation/arbb/vm/index.htm

and hope to see it across the entire product line at some point. That'd let my applications express the parallelism and have Intel's software do the final code generation & optimization for the cores and exact SIMD family available.; October 31, 2011 at 12:06 PM
pip010 said...: it is developers!
hardware evolution matters less!
the UTOPIA of supporting highly parallel apps on legacy code will die!

" Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language."

HELL
again
HELL

in a new paradigm world we need new way of thinking and FOR loops are thing of the past.

what intel's customers will eventually has to end up living up with is the fact that:
they buy highly overpraised hardware with very little improvements on their legacy code!!
they will need to change the legacy code not the hardware, duh...; January 8, 2013 at 3:49 AM