The Perils of Parallel: Microsoft

Showing posts with label Microsoft. Show all posts

Friday, October 28, 2011

MIC and the Knights

Intel’s Many-Integrated-Core architecture (MIC) was on wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s Ferry (KF) software development kit. Well, I thought it was wide display, but I’m an IDF Newbie. There was mention in two keynotes, a demo in the first booth on the right in the exhibit hall, several sessions, etc. Some old hands at IDF probably wouldn’t consider the display “wide” in IDF terms unless it’s in your face on the banners, the escalators, the backpacks, and the bagels.

Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.

This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.

If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.

An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.

Is It Real?

The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)

More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.

The Big Feature: Source Compatibility

There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.

You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.

Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.

If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.

Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –

Well, not necessarily.

The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)

Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.

On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.

But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.

In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)

Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R² force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)

The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.

Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)

By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.

Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.

Several Other Things

Here are other MIC facts/factlets/opinions, each needing far less discussion.

How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.

What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.

Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.

Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)

Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.

Concluding Thoughts

Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.

Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.

Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.

I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.

Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…

Larrabee was supposed to be real, too.

**************************************************************

Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!

Thursday, November 26, 2009

Oh, for the Good Old Days to Come

I recently had a glorious flashback to 2004.

Remember how, back then, when you got a new computer you would be just slightly grinning for a few weeks because all your programs were suddenly so crisp and responsive? You hadn't realized your old machine had a rubbery-feeling delay responding to your clicks and key-presses until zip! You booted the new machine for the first time, and wow. It just felt good.

I hadn't realized how much I'd missed that. My last couple of upgrades have been OK. I've gotten a brighter screen, better graphics, lighter weight, and so on. They were worth it, intellectually at least. But the new system zip, the new system crispness of response – it just wasn't there.

I have to say I hadn't consciously noticed that lack because, basically, I mostly didn't need it. How much faster do you want a word processor to be, anyway? So I muddled along like everyone else, all our lives just a tad more drab than they used to be.

Of course, the culprit denying us this small pleasure has been the flattening of single-thread performance wrought by the half-death of Moore's Law. Used to be, after a couple of years delay you would naturally get a system that ran 150% or 200% faster, so everything just went faster. All your programs were rejuvenated, and you noticed, instantly. A few weeks or so later you were of course used to it. But for a while, life was just a little bit better.

That hasn't happened for nigh unto five years now. Sure, we have more cores. I personally didn't get much use out of them. All my regular programs don't perk up. But as I said, I really didn't notice, consciously.

So what happened to make me realize how deprived I – and everybody else – has been? The Second Life client.

I'd always been less than totally satisfied with how well SL ran on my system. It was usable. But it was rubbery. Click to walk or turn and it took just a little … time before responding. It wasn't enough to make things truly unpleasant (except when lots of folks were together, but that's another issue). But it was enough to be noticeably less than great. I just told myself, what the heck, it's not Quake but who cares, that's not what SL is about.

Then for reasons I'll explain in another post, I was motivated to reanimate my SL avatar. It hadn't seen any use for at least six months, so I was not at all surprised to find a new SL client required when I connected. I downloaded, installed, and cranked it up.

Ho. Ly. Crap.

The rubber was gone.

There were immediate, direct responses to everything I told it to do. I proceeded to spend much more time in SL than I originally intended, wandering around and visiting old haunts just because it was so pleasant. It was a major difference, on the order of the difference I used to encounter when using a brand-new system. It was like those good old days of CPU clock-cranking madness. The grin was back.

So was this "just" a new, better, software release? Well, of course it was that. But I wouldn't have bothered writing this post if I hadn't noticed two other things:

First, my CPU utilization meter was often pegged. Pegged, as in 100% utilization, where flooring only one of my two CPUs only reads 50%. When I looked a little deeper, I saw the one, single SL process was regularly over 50%. I've not looked at any of the SL documentation on this, but from that data I can pretty confidently say that this release of the SL client can make effective use of both cores simultaneously. It's the only program I've got with that property.

Second, my thighs started burning. Not literally. But that heat tells me when my discrete GPU gets cranking. So, this client was also exercising the GPU, to good effect.

Apparently, this SL client actually does exploit the theoretical performance improvements from graphics units and multiple cores that had been laying around unused in my system. I was, in effect, pole-vaulted about two system generations down the road – that's how long it's been since there was a discernible difference. The SL client is my first post-Moore client program.

All of this resonates for me with the recent SC09 (Supercomputing Conference 2009) keynote of Intel's Justin Rattner. Unfortunately it wasn't recorded by conference rules (boo!), but reports are that he told the crowd they were in a stagnant, let us not say decaying, business unless they got their butts behind pushing the 3D web. (UPDATE: Intel has posted video of Rattner's talk.)

Say What? No. For me, particularly following the SL experience above, this is not a "Say What?" moment. It makes perfect sense. Without a killer application, the chip volumes won't be there to keep down the costs of the higher-end chips used in non-boutique supercomputers. Asking that audience for a killer app, though, is like asking an industrial assembly-line designer for next year's toy fashion trends. Killer apps have to be client-side and used by the masses, or the volumes aren't there.

Hence, the 3D Web. This would take the kind of processing in the SL client, which can take advantage of multicore and great graphics processing, and put it in something that everybody uses every day: the browser. Get a new system, crank up the browser, and bang! you feel the difference immediately.

Only problem: Why does anybody need the web to be 3D? This is the same basic problem with virtual worlds: OK, here's a virtual world. You can run around and bump into people. What, exactly, do you do there? Chat? Bogus. That's more easily done, with more easily achieved breadth of interaction, on regular (2D) social networking sites. (Hence Google's virtual world failure.)

There are things that virtual worlds and a "3D web" can, potentially, excel at; but that's a topic for a later post.

In the meantime, I'll note that in a great crawl-first development, there are real plans to use graphics accelerators to speed up the regular old 2D web, by speeding up page rendering. Both Microsoft and Mozilla (IE & Firefox) are saying they'll bring accelerator-based speedups to browsers (see CNET and Bas Schouten's Mozilla blog) using Direct2D and DirectWrite to exploit specialized graphics hardware.

One could ask what good it is to render a Twitter page twice as fast. (That really was one of the quoted results.) What's the point? Asking that, however, would only prove that One doesn't Get It. You boot your new system, crank up the browser and bam! Everything you do there, and you do more and more there, has more zip. The web itself – the plain, old 2D web – feels more directly connected to your inputs, to your wishes; it feels more alive. Result?

The grin will be back. That's the point.

Friday, October 31, 2008

Microsoft Azure Just Says NO to Multicore Apps in the Cloud

At their recent PDC’08, Microsoft unveiled their Azure Services Platform, Microsoft’s full-throttle venture into Cloud Computing. Apparently you shouldn’t bother with multithreading, since Azure doesn’t support multicore applications. It scales only “out,” using virtualization, as I said server code would generally do in IT Departments Should NOT Fear Multicore. I’ll give more details about that below; first, an aside about why this is important.

“Cloud Computing” is the most hyped buzzword in IT these days, all the way up to a multi-article special report in The Economist (recommended). So naturally its definition has been raped by the many people who apparently reason “Cloud computing is good. My stuff is good. Therefore my stuff is cloud computing.”

My definition, which I’m sure conflicts with many agendas: Cloud computing is hiring someone out on the web to host your computing, where “host your computing” can range across a wide spectrum: from providing raw iron (hardware provisioning), through providing building blocks of varying complexity, through providing standard commercial infrastructure, to providing the whole application. (See my cloud panel presentation at HPDC 2008.)

Clouds are good because they’re easy, cheap, fast, and can easily scale up and down. They’re easy because you don’t have to purchase anything to get started; just upload your stuff and go. They’re fast because you don’t have to go through your own procurement cycle, get the hardware, hire a sysadmin, upgrade your HVAC and power, etc. They’re cheap because you don’t have to shell out up front for hardware and licenses before getting going; you pay for what you use, when you use it. Finally, they scale because they’re on some monster compute center somewhere (think Google, Amazon, Microsoft – all cloud providers with acres of systems, and IBM’s getting in there too) that can populate servers and remove them very quickly – it’s their job, they’re good at it (or should be) – so if your app takes off and suddenly has huge requirements, you’re golden; and if your app tanks, all those servers can be given back. (This implicitly assumes “scale out,” not multicore, but that’s what everybody means by scale, anyway.)

It is possible, if you’re into such things, to have an interminable discussion with Grid Computing people about whether a cloud is a grid, grid includes cloud, a cloud is a grid with a simpler user interface, and so on. Foo. Similar discussions can revolve around terms like utility computing, SaaS (Software as a Service), PaaS (Platform as …), IaaS (Infrastructure …) and so on. Double foo – but with a nod to the antiquity of “utility computing.” Late 60s. Project MAC. Triassic computing.

Microsoft Azure Services slots directly into the spectrum of my definition at the “provide standard commercial infrastructure” point: Write your code using Microsoft .NET, Windows Live, and similar services; upload it to a Microsoft data center; and off you go. Its presentation is replete with assurances that people used to Microsoft’s development environment (.NET and related) can write the same kind of things for the Microsoft cloud. Code doesn’t port without change, since it will have to use different services – Azure’s storage services in particular look new, although SQL Services are there – but it’s the same kind of development process and code structure many people know and are comfortable with.

That sweeps up a tremendous number of potential cloud developers, and so in my estimation bodes very well for Microsoft doing a great hosting business over time. Microsoft definitely got out in front of the curve on this one. This assumes, of course, that the implementation works well enough. It’s all slideware right now, but a beta-ish Community Technology Preview platform is supposed to be available this fall (2008).

For more details, see Microsoft’s web site discussion and a rather well-written white paper.

So this is important, and big, and is likely to be widely used. Let’s get back to the multicore scaling issues.

That issue leaps out of the white paper with an illustration on page 13 (Figure 6) and a paragraph following. That wasn’t the intent of what was written, which was actually intended to show why you don’t have to manage or build your own Windows systems. But it suffices. Here’s the figure:

[Figure explanation: IIS is Microsoft’s web server (Internet Information Services) that receives web HTTP requests. The Web Role Instance is user code that initially processes that, and passes it off to the Worker Role Instance through queues via the Agents. This is all apparently standard .NET stuff (“apparently” because I can’t claim to be a .NET expert). So the two sets of VM boxes roughly correspond to the web tier (1^st tier), with IIS instead of Apache, and application tier (2^nd tier) in non-Microsoft lingo.]

Here’s the paragraph:

While this might change over time, Windows Azure’s initial release maintains a one-to-one relationship between a VM [virtual machine] and a physical processor core. Because of this, the performance of each application can be guaranteed—each Web role instance and Worker role instance has its own dedicated processor core. To increase an application’s performance, its owner can increase the number of running instances specified in the application’s configuration file. The Windows Azure fabric will then spin up new VMs, assign them to cores, and start running more instances of this application. The fabric also detects when a Web role or Worker role instance has failed, then starts a new one.

The scaling point is this: There’s a one-to-one relationship between a physical processor core and each of these VMs, therefore each role instance you write runs on one core. Period. “To increase an application’s performance, its owner can increase the number of running instances” each of which is a separate single-core virtual computer. This is classic scale out. It simply does not use multiple cores on any given piece of code.

There are weasel words up front about things possibly changing, but given the statement about how you increase performance, it’s clear that this refers to initially not sharing a single core among multiple VMs. That would be significantly cheaper, since most apps don’t use anywhere near 100% of even a single core’s performance; it’s more like 12%. Azure doesn’t share cores, at least initially, because they want to ensure performance isolation.

That’s very reasonable; performance isolation is a big reason people use separate servers (there are 5 or so other reasons). In a cloud megacenter, you don’t want your “instances” to be affected by another company’s stuff, possibly your competitor, suddenly pegging the meter. Sharing a core means relying on scheduler code to ensure that isolation, and, well, my experience of Windows systems doing that is somewhat spotty. The biggest benefit I’ve gotten out of dual core on my laptop is that when some application goes nuts and sucks up the CPU, I can still mouse around and kill it because the second core is not being used.

Why do this, when multicore is a known fact of life? I have a couple of speculations:

First, application developers in general shouldn’t have anything to do with parallelism, since it’s difficult, error-prone, and increases cost; developers who can do it don’t come cheap. That’s a lesson from multiple decades of commercial code development. Application developers haven’t dealt with parallelism since the early 1970s, with SMPs, where they wrote single-thread applications that ran under transaction monitors, instantiated just like Azure is planning (but not on VMs).

Second, it’s not just the applications that would have to be robustly multithreaded; there’s also the entire .NET and Azure services framework. That’s got to be multiple millions of lines of code. Making it all really rock for multicore – not just work right, but work fast – would be insanely expensive, get in the way of adding function that can be more easily charged for, and is likely unnecessary given the application developer point above.

Whatever the ultimate reasons, what this all means is that one of the largest providers of what will surely be one of the most used future programming platforms has just said NO! to multicore.

Bootnote: I’ve seen people on discussion lists pick up on the fact that “Azure” is the color of a cloudless sky. Cloudless. Hmmm. I’m more intrigued by its being a shade of blue: Awesome Azure replacing Big Blue?