The Perils of Parallel: scale out

Showing posts with label scale out. Show all posts

Monday, September 26, 2011

A Conversation with Intel’s John Hengeveld at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview John Hengeveld. John is in the Datacenter and Connected Systems Group in Hillsboro.

Intel-provided information about John:

John is responsible for end user and OEM marketing for Intel’s Workstation and HPC businesses and leads an outstanding team of industry visionaries. John has been at Intel for 6 years and was previously the senior business strategist for Intel’s Digital Enterprise Group and the lead strategist for Intel’s Many Core development initiatives. John has 20 years of experience in general management, strategy and marketing leadership roles in high technology.

John is dedicated to life-long learning, he has taught Corporate Strategy and Business Strategy and Policy; Technology Management; and Marketing Research and Strategy for Portland State University’s Master of Business Administration program. John is a graduate of the Massachusetts Institute of Technology and holds his MBA from the University of Oregon.

I recorded our conversation. What follows is a transcript, rather than a summary, since our topics ranged fairly widely and in some cases information is conveyed by the style of the answer. Conditions weren’t optimal for recording; it was in a large open space with many other conversations going on and the “Intel Robotic Orchestra” playing in the background. Hopefully I got all the words right.

I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)

Full disclosure: As I noted in a prior post, Intel paid for me to attend IDF. Thanks, again.

Occurrences of [] indicate words I added for clarification. There aren’t many.

Pfister: What, overall, is HPC to Intel? Is it synonymous with MIC?

Hengeveld: No. Actually, HPC has a research effort, how to scale applications, how to deal with performance and power issues that are upcoming. That’s the labs portion of it. Then we have significant product activity around our mainstream Xeon products, how to support the software and infrastructure when those products are delivered in cluster form to supercomputing activities. In addition to those products also get delivered into what we refer to as the volume HPC market, which is small and medium-sized clusters being used for product design, research activities, such as those in biomed, some in visualization. Then comes the MIC part. So, when we look at MIC, we try to manage and characterize the collection of workloads we create optimized performance for. About 20% of those, and we think these are representative of workloads in the industry, map to what MIC does really well. And the rest, most customers have…

Pfister: What is the distinguishing characteristic?

Hengeveld: There are two distinguishing characteristics. One is what I would refer to as compute density – applications that have relatively small memory footprints but have a high number of compute operations per memory access, and that parallelize well. Then there’s a second set of applications, streaming applications, where size isn’t significant but memory bandwidth is the distinguishing factor. You see some portion of the workload space there.

Pfister: Streaming is something I was specifically going to ask you about. It seems that with the accelerators being used today, there’s this bifurcation in HPC: Things that don’t need, or can’t use, memory streaming; and those that are limited by how fast you can move data to and from memory.

Hengeveld: That’s right. I agree.

Pfister: Is MIC designed for the streaming side?

Hengeveld: MIC will perform well for many streaming applications. Not all. There are some that require a memory access model MIC doesn’t map to particularly well. But a lot of the streaming applications will do very well on MIC in one of the generations. We have a collection of generations of MIC on the roadmap, but we’re not talking about anything beyond the next “Corner” generation [Knight’s Corner, 2012 product successor to the current limited-production Knight’s Ferry software development vehicle]. More beyond that, down the roadmap, you will see more and more effect for that class of application.

Pfister: So you expect that to be competitive in bandwidth and throughput with what comes out of Nvidia?

Hengeveld: Very much so. We’re competing in this market space to be successful; and we understand that we need to be competitive on a performance density, performance per watt basis. The way I kind of think about it is that we have a roadmap with exceptional performance, but, in addition to that, we have a consistent programming model with the rest of the Xeon platforms. The things you do to create an optimized cluster will work in the MIC space pretty much straightforwardly. We’ve done a number of demonstrations of that here and at ISC. That’s the main difference. So we’ll see the performance; we’ll be ahead in the performance. But the real difference is the programming model.

Pfister: But the application has to be amenable.

Hengeveld: The application has to be amenable. For many customers that do a wide range of applications – you know, if you are doing a few things, it’s likely possible that some of those few things will be these highly-parallel, many-core optimized kinds of things. But most customers are doing a range of things. The powerful general-purpose solution is still the mainstream Xeon architecture, which handles the widest range of workloads really robustly, and as we continue with our beat rate in the Xeon space, you know with Sandy Bridge coming out we moved significantly forward with floating-point performance, and you’ll see that again going forward. You see the charts going up and to the right 2X per release.

Pfister: Yes, all marketing charts go up and to the right.

Hengeveld: Yes, all marketing charts go up and to the right, but the point is that there’s a continued investment to drive floating-point performance and effective parallelism and power efficiency in a way that will be useful to HPC customers and mainstream customers.

Pfister: Is MIC going to be something that will continue over time? That you can write code for an expect it to continue to work in the future?

Hengeveld: Absolutely. It’s a major investment on our part on a distinct architectural approach that we expect to continue on as far out as our roadmaps envision today.

Pfister: Can you tell me anything about memory and connectivity? There was some indication at one point of memory being stacked on a MIC chip.

Hengeveld: A lot of research concepts are being explored for future products, and I can’t really talk about much of that kind of thing for things that are out in the roadmap. There’s a lot of work being done around innovative approaches about how to do the system work around this silicon.

Pfister: MIC vs. SCC – Single Chip Cluster.

Hengeveld: SCC! Got it! I thought you meant single chip computer.

Pfister: That would probably be SoC, System on a Chip. Is SCC part of your thinking on this?

Hengeveld: SCC was a research vehicle to try to explore extreme parallelism and some different instruction set architectures. It was a research vehicle. MIC is a series of products. It’s an architecture that underlies them. We always use “MIC” as an adjective: It’s a MIC architecture, MIC products, or something like that. It means Many Integrated Cores, Many Integrated Core architecture is an approach that underlies a collection of products, that are a product mix from Intel. As opposed to SCC, which is a research vehicle. It’s intended to get the academic community thinking about how to solve some of the major problems that remain in parallelism, using computer science to solve problems.

Pfister: One person noted that a big part of NVIDIA’s success in the space is CUDA…

Hengeveld: Yep.

Pfister: …which people can use to get, without too much trouble, really optimized code running on their accelerators. I know there are a lot of other things that can be re-used from Intel architecture – Threaded Building Blocks, etc. – but will CUDA be supported?

Hengeveld: That’s a question you have to ask NVIDIA. CUDA’s not my product. I have a collection of products that have an architectural approach.

Pfister: OpenCL is covered?

Hengeveld: OpenCL is part of our support roadmap, and we announced that previously. So, yes OpenCL.

Pfister: Inside of a MIC, right now, it has dual counter-rotating rings. Are connections other than that being considered? I’m thinking of the SCC mesh and other stuff. Are they in your thinking at this point?

Hengeveld: Yes, so, further out in the roadmap. These are all part of the research concepts. That’s the reason we do SCC and things like that, to see if it makes sense to use that architecture in the longer term products. But that’s a long ways away. Right now we have a fairly reasonable architectural approach that takes us out a bit, and certainly into our first generation of products. We’re not discussing yet how we’re going to use these learnings in future MIC products. But you can imagine that’s part of the thinking.

Pfister: OK.

Hengeveld: So, here’s the key thing. There are problems in exascale that the industry doesn’t know how to solve yet, and we’re working with the industry very actively to try to figure out whether there are architectural breakthroughs, things like mesh architectures. Is that part of the solution to exascale conundrums? Are there workloads in exascale, sort of a wave processing model, that you might see in a mesh architecture, that might make sense. So working with research centers, working with the labs, in part, we’re trying to figure out how to crack some of these nuts. For us it’s about taking all the pieces people are thinking about and seeing what the whole is.

Pfister: I’m glad to hear you express it that way, since the way it seemed to be portrayed at ISC was, from Intel, “Exascale, we’ve got that covered.”

Hengeveld: So, at the very highest strategic level, we have it covered in that we are working closely with a collection of academic and industry partners to try and solve difficult problems. But exascale is a long way off yet. We’re committed to make it happen, committed to solve the problems. That’s the real meat of what Kirk declared at ISC. It’s not that we have the answer; it’s that we have a commitment to make it happen, and to make it happen in a relatively early time period, with a relatively sustainable product architectural approach. But there are many problems to solve in exascale; we can barely get our arms around it.

Pfister: Do you agree with the DARPA targets for exascale, particularly low power, or would you relax those?

Hengeveld: The Intel commit, what we said in the declaration, was not inconsistent with the DARPA thing. It may be slightly relaxed. You can relax one of two things, you can relax time or you can relax DARPA targets. So I think you’re going to reach DARPA’s targets eventually – but when. So the target that Kirk raised is right in there, in the same ballpark. Exascale in 20MW is one set of rational numbers; I’ve heard 10 [MW], I’ve heard 40 [MW], somewhere between those, right? I think 40 [MW] is so easy it’s not worth thinking about. I don’t think it’s economically rational.

Pfister: As you move forward, what do you think are the primary barriers to performance? There are two different axes here, technical barriers, and market barriers.

Hengeveld: The technical barriers are cracking bandwidth and not violating the power budget; tracking how to manage the thread complexity of an exascale system – how many threads are you going to need? A whole lot. So how do you get your arms around that? There are business barriers: How do you get a return on investment through productizing things that apply in the exascale world? This is a John [?] quote, not an Intel quote, but I am far less interested in the first exascale system than I am in the 100^th. I would like a proliferation of exascale applications and performance, and have it be accessible to a wide range of people and applications, some applications that don’t exist today. In any ecosystem-building task, you’ve got to create awareness of the need, and create economic momentum behind serving that need. Those problems are equally complex to solve [equal to the technical ones]. In my camp, I think that maybe in some ways the technical problems are more solvable, since you’re not training people in a new way of thinking and working and solving problems. It takes some time to do that.

Pfister: Yes, in some ways the science is on a totally different time schedule.

Hengeveld: Yes, I agree. I agree entirely. A lot of what I’m talking about today is leaps forward in science as technical computing advances, but as the capability grows, the science will move to match it. How will that science be used? Interesting question. How will it be proliferated? Genome work is a great target for some of this stuff. You probably don’t need exascale for genome. You can make it faster, you can make it more cost-effective.

Pfister: From what I have heard from people working on this at CSU, they have a whole lot more problems with storage than with computing capability.

Hengeveld: That’s exactly right.

Pfister: They throw data away because they have no place to put it.

Hengeveld: That’s a fine example of the business problems you have to crack along with the compute problems that you have to crack. There’s a whole infrastructure around those applications that has to grow up.

Pfister: Looking at other questions I had… You wouldn’t call MIC a transitional architecture, would you?

Hengeveld: No. Heavens no. It’s a design point for a set of workloads in HPC and other areas. We believe MIC fits more things than just HPC. We started with HPC. It’s a design point that has a persistence well beyond as far as we can see on the roadmap. It’s not a transitional product.

Pfister: I have a lot of detailed technical questions which probably aren’t appropriate, like whether each of the MIC cores has equal latency to main memory.

Hengeveld: Yes, that’s a fine example of a question I probably shouldn’t answer.

Pfister: Returning to ultimate limits of computing, there are two that stand out, power and bandwidth, both to memory and between chips. Does either of those stand out to you as the sore thumb?

Hengeveld: Wow. So, the guts of that question gets to workload characterization. One of my favorite topics is “It’s the workload, stupid.” People say “it’s the economy, stupid,” well in this space it’s the workload. There aren’t general statements you can make about all workloads in this market.

Pfister: Yes, HPC is not one market.

Hengeveld: Right, it’s not one market, it’s not one class of usages, it’s not one architecture of solutions, it’s one reason why MIC is required, it’s not invisible. One size doesn’t fit all. Xeon does a great job of solving a lot of it really well, but there are individual workloads that are valuable that we want to dive into with more capability in a more targeted way. There are workloads in the industry where the interconnect bandwidth between processors in a node and nodes in a cluster is the dominant factor in performance. There are other workloads where the bandwidth to memory is the dominant factor in performance. All have to be solved. All have to be moved forward at a reasonable pace. I think the ones that are going to map to exascale best are ones where the memory bandwidth required can be solved well by local memory, and the problems that can be addressed well are those that have rational scaling of interconnect requirement between nodes. You’re not going to see problems that have a massive explosion of communication; the bandwidth won’t exist to keep up with that. You can actually see something I call “well-fed FLOPS,” which is how many FLOPS can you rationally support given the rest of this architecture. That’s something you have to know for each workload. You have to study it for each domain of HPC usage before you get to the answer about which is more important.

Pfister: You probably have to go now. I did want to say that I noticed the brass rat. Mine is somewhere in the Gulf of Mexico.

Hengeveld: That’s terrible. Class of ’80.

Pfister: Class of ’67.

Hengeveld: Wow.

Pfister: Stayed around for graduate school, too.

Hengeveld: When’d you leave?

Pfister: In ’74.

Hengeveld: We just missed overlapping, then. Have you been back recently?

Pfister: Not too recently. But there have been a lot of changes.

Hengeveld: That’s true, a lot of changes.

Pfister: But East Campus is still the same?

Hengeveld: You were in East Campus? Where’d you live?

Pfister: Munroe.

Hengeveld: I was in the black hall of fifth-floor Bemis.

Pfister: That doesn’t ring a bell with me.

Hengeveld: Back in the early 70s, they painted the hall black, and put in red lights in 5^th-floor Bemis.

Pfister: Oh, OK. We covered all the lights with green gel.

Hengeveld: Yes, I heard of that. That’s something that they did even to my time period there.

Pfister: Anyway, thank you.

Hengeveld: A pleasure. Nice talking to you, too.

Friday, December 4, 2009

Intel’s Single-Chip Clus… (sorry) Cloud

Intel's recent announcement of a 48-core "single-chip cloud" (SCC) is now rattling around several news sources, with varying degrees of boneheaded-ness and/or willful suspension of disbelief in the hype. Gotta set this record straight, and also raise a few questions I didn't find answered in the sources now available (the presentation, the software paper, the developers' video).

Let me emphasize from the start that I do not think this is a bad idea. In fact, it's an idea good enough that I've lead or been associated with rather similar architectures twice, although not on a single chip (RP3, POWER 4 (no, not the POWER4 chip; this one was four original POWER processors in one box, apparently lost on the internet) (but I've still got a button…)). Neither was, in hindsight, a good idea at their times.

So, some facts:

SCC is not a product. It's an experimental implementation of which about a hundred will be made, given to various labs for software research. It is not like Larrabee, which will be shipped in full-bore product-scale quantities Some Day Real Soon Now. Think concept car. That software research will surely be necessary, since:

SCC is neither a multiprocessor nor a multicore system in the usual sense. They call it a "single chip cloud" because the term "cluster" is déclassé. Those 48 cores have caches (two levels), but cache coherence is not implemented in hardware. So, it's best thought of as a 48-node cluster of uniprocessors on a single chip. Except that those 48 cluster nodes all access the same memory. And if one processor changes what's in memory, the others… don't find out. Until the cache at random happens to kick the changed line out. Sometime. Who knows when. (Unless something else is done; but what? See below.)

Doing this certainly does save hardware complexity, but one might note that quite adequately scalable cache coherence does exist. It's sold today by Rackable (the part that was SGI); Fujitsu made big cache-coherent multi-x86 systems for quite a while; and there are ex-Sequent folks out there who remember it well. There's even an IEEE standard for it (SCI, Scalable Coherent Interface). So let's not pretend the idea is impossible and assume your audience will be ignorant. Mostly they will be, but that just makes misleading them more reprehensible.

To leave cache coherence out of an experimental chip like this is quite reasonable; I've no objection there. I do object to things like the presentation's calling this "New Data-Sharing Options." That's some serious lipstick being applied.

It also leads to several questions that are so far unanswered:

How do you keep the processors out of each others' pants? Ungodly uncontrolled race-like uglies must happen unless… what? Someone says "software," but what hardware does that software exercise? Do they, perhaps, keep the 48 separate from one another by virtualization techniques? (Among other things, virtualization hardware has to keep virtual machine A out of virtual machine B's memory.) That would actually be kind of cool, in my opinion, but I doubt it; hypervisor implementations require cache coherence, among other issues. Do you just rely on instructions that push individual cache lines out to memory? Ugh. Is there a way to decree whole swaths of memory to be non-cacheable? Sounds kind of inefficient, but whatever. There must be something, since they have demoed some real applications and so this problem must have been solved somehow. How?

What's going on with the operating system? Is there a separate kernel for each core? My guess: Yes. That's part of being a clus… sorry, a cloud. One news article said it ran "Rock Creek Linux." Never heard of it? Hint: The chip was called Rock Creek prior to PR.

One iteration of non-coherent hardware I dealt with used cluster single system image to make it look like one machine for management and some other purposes. I'll bet that becomes one of the software experiments. (If you don't know what SSI is, I've got five posts for you to read, starting here.)

Appropriately, there's mention of message-passing as the means of communicating among the cores. That's potentially fast message passing, since you're using memory-to-memory transfers in the same machine. (Until you saturate the memory interface – only four ports shared by all 48.) (Or until you start counting usual software layers. Not everybody loves MPI.) Is there any hardware included to support that, like DMA engines? Or protocol offload engines?

Finally, why does every Intel announcement of gee-whiz hardware always imply it will solve the same set of problems? I'm really tired of those flying cars. No, I don't expect to ever see an answer to that one.

I'll end by mentioning something in Intel's SCC (née Rock Creek) that I think is really good and useful: multiple separate power regions. Voltage and frequency can be varied separately in different areas of the chip, so if you aren't using a bunch of cores, they can go slower and/or draw less power. That's something that will be "jacks or better" in future multicore designs, and spending the effort to figure out how to build and use it is very worthwhile.

Heck, the whole thing is worthwhile, as an experiment. On its own. Without inflated hype about solving all the world's problems.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(This has been an unusually topical post, brought to you courtesy of the author's head-banging-wall level of annoyance at boneheaded news stories. Soon we will resume our more usual programming. Not instantly. First I have to close on and move into a house. In Holiday and snow season.)

Friday, June 19, 2009

Why Virtualize? A Primer for HPC Guys

Some folks who primarily know High-Performance Computing (HPC) think virtualization is a rather silly thing to do. Why add another layer of gorp above a perfectly good CPU? Typical question, which I happened to get in email:

I'm perplexed by the stuff going on the app layer -

first came the chip + programs

then came - chip+OS+ applications

then came - Chip+Hypervisor+OS+applications

So for a single unit of compute, the capability keeps decreasing while extra layers are added over again and again.. How does this help?

I mean Virtualization came for consolidation and this reducing the prices of the H/W footprint...now it's being associated with something else?

In answering that question, I have two comments:

First Comment:

Though this really doesn't matter to the root question, you're missing a large bunch of layers. After your second should be

chip+OS+middleware+applications

Where middleware expands to many things: Messaging, databases, transaction managers, Java Virtual Machines, .NET framework, etc. How you order the many layers within middleware, well, that can be argued forever; but in any given instance they obviously do have an order (often with bypasses).

So there are many more layers than you were considering.

How does this help in general? The usual way infrastructure software helps: It lets many people avoid writing everything they need from scratch. But that's not what you're really asking, which was why you would need virtualization in the first place. See below.

Second Comment:

What hypervisors -- really, virtual machines; hypervisors are one implementation of that notion – do is more than consolidation. Consolidation is, to be sure, the killer app of virtualization; it's what put virtualization on the map.

But hypervisors, in particular, do something else: They turn a whole system configuration into a bag of bits, a software abstraction decoupled from the hardware on which they are running. A whole system, ready to run, becomes a file. You can store it, copy it, send it somewhere else, publish it, and so on.

For example, you can:

Stop it for a while (like hibernate – a snapshot (no, not disk contents)).
Restart on the same machine, for example after hardware maintenance.
Restart on a different machine (e.g., VMware's VMotion; others have it under different names)
Copy it – deploy additional instances. This is a core technology of cloud computing that enables "elasticity." (That, and apps structured so this can work.)
By adding an additional layer, run it on a system with a different architecture from the original.

Most of these things have their primary value in commercial computing. The classic HPC app is a batch job: Start it, run it, it's done. Commercial computing's focus nowadays tends to be: Start it, run it, run it, run it, keep it running even though nodes go down, keep it running through power outages, earthquakes, terrorist strikes, … Think web sites, or, before them, transaction systems to which bank ATMs connect. Not that commercial batch doesn't still exist, nor is it less important; payrolls are still met, although there's less physical check printing now. But continuously operating application systems have been a focus of commercial computing for quite a while now.

Of course, some HPC apps are more continuous, too. I'm thinking of analyzing continuous streams of data, e.g., environmental or astronomical data that is continuously collected. Things like SETI or Folding at home could use it, too, running beside interactive use in a separate machine, continually, unhampered by your kids following dubious links and getting virii/trojans. But those are in the minority, so far. Grossly enormous-scale HPC with 10s or 100s of thousands of nodes will soon have to think in these terms as the time between failures (MTBF) of nodes exceeds the run time of jobs, but there it's an annoyance, a problem, not an intentional positive aspect of the application like it is for commercial.

Grid application deployment could do it á la clouds, but except for hibernate-style checkpoint/restart, I don't see that any time soon. They effectively have a kind of virtualization already, at a higher level (like Microsoft does its cloud virtualization in Azure above and assuming .NET).

The traditional performance cost of virtualization is anathema to HPC, too. But that's trending to near zero. With appropriate hardware support (Intel VT, AMD-V, similar things from others) that's gone away for processor & memory. It's still there for IO, but can be fixed; IBM zSeries native IO has had essentially no overhead for years. The rest of the world will have to wait for PCIe to finish finalizing its virtualization story and IO vendors to implement it in their devices; that will come about in a patchy manner, I'd predict, with high-end communication devices (like InfiniBand adapters) leading the way.

So that's what virtualization gets you: isolation (for consolidation) and abstraction from hardware into a manipulable software object (for lots else). Also, security, which I didn't get into here.

Not much for most HPC today, as I read it, but a lot of value for commercial.

Tuesday, February 3, 2009

Is Cloud Computing Just Everything aaS?

As you may have read here before, defining "Cloud Computing" is a cottage industry. I'm not kidding about that; a new online magazine on Cloud Interoperability has, as its inaugural first article, definitions from twenty different people. Now I think I may see why.

This realization came in a discussion on the Google Cloud Interoperability Forum, in a thread discussing the recently published paper "Toward a Unified Ontology of Cloud Computing" by Lamia Yousef (UCSB), Maria Butrico and Dilma Da Silva (IBM Research) (here's the paper), discussed in John Willis' blog post. To this was added another more detailed ontology / taxonomy of Cloud Computing by Christopher Hoff in his blog, which has attracted enough comments to now be at version 1.3.

Here are the key figures from Yousef's presentation, and Hoff, in that order (I used the figure in Willis' blog):

When I look at these diagrams, I think there's something strange going on here. Nothing leaps out at me taxonomically / ontologically / structurally, in either of the two organizations, that causes either one of them to specifically describe a cloud.

They look like a generic ontology / taxonomy / structure / whatever attempting to cover all the conceivable contents of any IT shop, cloud or not.

Does "cloud" just mean you do such a generic description, and then at every useful point of entry, just add "aaS"? ("aaS" = "as a Service," as in: Software as a Service, Platform as a Service, Infrastructure as a Service, etc. ad nauseam.)

Maybe... that actually is what "cloud" means: *aaS.

I don't recall anybody trying to restrict what their cloud notions cover, in general -- everyone wants to be able to do it all. This means that the natural thing to do is try to define "it all," and then aaS it.

If the cloud community is unwilling or unable to accept any restrictions on what Cloud Computing may be or may do, and I can't imagine anything that conceivably enforce any such restrictions, I think that may be inevitable.

Anything-you-can-thing-of-aaS as a definition isn't very much help to anybody wanting to find out what all the Cloud Computing hype may actually mean, of course, whether it is or isn't the same as a Grid, etc.. I'm working up to a future post on that, but I have to find a particular philosophico-logical term first, though. (That search has nothing to do with "ontology," in case you were wondering. It's a name for a type of definition that comes out of Thomas Aquinas, I think.)

(P.S.: Yeah, still doing software / cloud postings. Gotta get some hardware in here.)

Friday, October 31, 2008

Microsoft Azure Just Says NO to Multicore Apps in the Cloud

At their recent PDC’08, Microsoft unveiled their Azure Services Platform, Microsoft’s full-throttle venture into Cloud Computing. Apparently you shouldn’t bother with multithreading, since Azure doesn’t support multicore applications. It scales only “out,” using virtualization, as I said server code would generally do in IT Departments Should NOT Fear Multicore. I’ll give more details about that below; first, an aside about why this is important.

“Cloud Computing” is the most hyped buzzword in IT these days, all the way up to a multi-article special report in The Economist (recommended). So naturally its definition has been raped by the many people who apparently reason “Cloud computing is good. My stuff is good. Therefore my stuff is cloud computing.”

My definition, which I’m sure conflicts with many agendas: Cloud computing is hiring someone out on the web to host your computing, where “host your computing” can range across a wide spectrum: from providing raw iron (hardware provisioning), through providing building blocks of varying complexity, through providing standard commercial infrastructure, to providing the whole application. (See my cloud panel presentation at HPDC 2008.)

Clouds are good because they’re easy, cheap, fast, and can easily scale up and down. They’re easy because you don’t have to purchase anything to get started; just upload your stuff and go. They’re fast because you don’t have to go through your own procurement cycle, get the hardware, hire a sysadmin, upgrade your HVAC and power, etc. They’re cheap because you don’t have to shell out up front for hardware and licenses before getting going; you pay for what you use, when you use it. Finally, they scale because they’re on some monster compute center somewhere (think Google, Amazon, Microsoft – all cloud providers with acres of systems, and IBM’s getting in there too) that can populate servers and remove them very quickly – it’s their job, they’re good at it (or should be) – so if your app takes off and suddenly has huge requirements, you’re golden; and if your app tanks, all those servers can be given back. (This implicitly assumes “scale out,” not multicore, but that’s what everybody means by scale, anyway.)

It is possible, if you’re into such things, to have an interminable discussion with Grid Computing people about whether a cloud is a grid, grid includes cloud, a cloud is a grid with a simpler user interface, and so on. Foo. Similar discussions can revolve around terms like utility computing, SaaS (Software as a Service), PaaS (Platform as …), IaaS (Infrastructure …) and so on. Double foo – but with a nod to the antiquity of “utility computing.” Late 60s. Project MAC. Triassic computing.

Microsoft Azure Services slots directly into the spectrum of my definition at the “provide standard commercial infrastructure” point: Write your code using Microsoft .NET, Windows Live, and similar services; upload it to a Microsoft data center; and off you go. Its presentation is replete with assurances that people used to Microsoft’s development environment (.NET and related) can write the same kind of things for the Microsoft cloud. Code doesn’t port without change, since it will have to use different services – Azure’s storage services in particular look new, although SQL Services are there – but it’s the same kind of development process and code structure many people know and are comfortable with.

That sweeps up a tremendous number of potential cloud developers, and so in my estimation bodes very well for Microsoft doing a great hosting business over time. Microsoft definitely got out in front of the curve on this one. This assumes, of course, that the implementation works well enough. It’s all slideware right now, but a beta-ish Community Technology Preview platform is supposed to be available this fall (2008).

For more details, see Microsoft’s web site discussion and a rather well-written white paper.

So this is important, and big, and is likely to be widely used. Let’s get back to the multicore scaling issues.

That issue leaps out of the white paper with an illustration on page 13 (Figure 6) and a paragraph following. That wasn’t the intent of what was written, which was actually intended to show why you don’t have to manage or build your own Windows systems. But it suffices. Here’s the figure:

[Figure explanation: IIS is Microsoft’s web server (Internet Information Services) that receives web HTTP requests. The Web Role Instance is user code that initially processes that, and passes it off to the Worker Role Instance through queues via the Agents. This is all apparently standard .NET stuff (“apparently” because I can’t claim to be a .NET expert). So the two sets of VM boxes roughly correspond to the web tier (1^st tier), with IIS instead of Apache, and application tier (2^nd tier) in non-Microsoft lingo.]

Here’s the paragraph:

While this might change over time, Windows Azure’s initial release maintains a one-to-one relationship between a VM [virtual machine] and a physical processor core. Because of this, the performance of each application can be guaranteed—each Web role instance and Worker role instance has its own dedicated processor core. To increase an application’s performance, its owner can increase the number of running instances specified in the application’s configuration file. The Windows Azure fabric will then spin up new VMs, assign them to cores, and start running more instances of this application. The fabric also detects when a Web role or Worker role instance has failed, then starts a new one.

The scaling point is this: There’s a one-to-one relationship between a physical processor core and each of these VMs, therefore each role instance you write runs on one core. Period. “To increase an application’s performance, its owner can increase the number of running instances” each of which is a separate single-core virtual computer. This is classic scale out. It simply does not use multiple cores on any given piece of code.

There are weasel words up front about things possibly changing, but given the statement about how you increase performance, it’s clear that this refers to initially not sharing a single core among multiple VMs. That would be significantly cheaper, since most apps don’t use anywhere near 100% of even a single core’s performance; it’s more like 12%. Azure doesn’t share cores, at least initially, because they want to ensure performance isolation.

That’s very reasonable; performance isolation is a big reason people use separate servers (there are 5 or so other reasons). In a cloud megacenter, you don’t want your “instances” to be affected by another company’s stuff, possibly your competitor, suddenly pegging the meter. Sharing a core means relying on scheduler code to ensure that isolation, and, well, my experience of Windows systems doing that is somewhat spotty. The biggest benefit I’ve gotten out of dual core on my laptop is that when some application goes nuts and sucks up the CPU, I can still mouse around and kill it because the second core is not being used.

Why do this, when multicore is a known fact of life? I have a couple of speculations:

First, application developers in general shouldn’t have anything to do with parallelism, since it’s difficult, error-prone, and increases cost; developers who can do it don’t come cheap. That’s a lesson from multiple decades of commercial code development. Application developers haven’t dealt with parallelism since the early 1970s, with SMPs, where they wrote single-thread applications that ran under transaction monitors, instantiated just like Azure is planning (but not on VMs).

Second, it’s not just the applications that would have to be robustly multithreaded; there’s also the entire .NET and Azure services framework. That’s got to be multiple millions of lines of code. Making it all really rock for multicore – not just work right, but work fast – would be insanely expensive, get in the way of adding function that can be more easily charged for, and is likely unnecessary given the application developer point above.

Whatever the ultimate reasons, what this all means is that one of the largest providers of what will surely be one of the most used future programming platforms has just said NO! to multicore.

Bootnote: I’ve seen people on discussion lists pick up on the fact that “Azure” is the color of a cloudless sky. Cloudless. Hmmm. I’m more intrigued by its being a shade of blue: Awesome Azure replacing Big Blue?