The Perils of Parallel: Larrabee

Showing posts with label Larrabee. Show all posts

Friday, October 28, 2011

MIC and the Knights

Intel’s Many-Integrated-Core architecture (MIC) was on wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s Ferry (KF) software development kit. Well, I thought it was wide display, but I’m an IDF Newbie. There was mention in two keynotes, a demo in the first booth on the right in the exhibit hall, several sessions, etc. Some old hands at IDF probably wouldn’t consider the display “wide” in IDF terms unless it’s in your face on the banners, the escalators, the backpacks, and the bagels.

Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.

This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.

If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.

An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.

Is It Real?

The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)

More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.

The Big Feature: Source Compatibility

There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.

You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.

Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.

If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.

Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –

Well, not necessarily.

The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)

Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.

On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.

But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.

In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)

Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R² force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)

The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.

Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)

By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.

Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.

Several Other Things

Here are other MIC facts/factlets/opinions, each needing far less discussion.

How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.

What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.

Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.

Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)

Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.

Concluding Thoughts

Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.

Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.

Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.

I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.

Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…

Larrabee was supposed to be real, too.

**************************************************************

Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!

Wednesday, October 5, 2011

Will Knight’s Corner Be Different? Talking to Intel’s Joe Curley at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview Joe Curley, Director, Technical Computing Marketing of Intel’s Datacenter & Connected Systems Group in Hillsboro.

Intel-provided information about Joe:

Joe Curley, serves Intel® Corporation as director of marketing for technical computing in the Data Center Group. The technical computing marketing team manages marketing for high-performance computing (HPC) and workstation product lines as well as future Intel® Many Integrated Core (Intel® MIC) products. Joe joined Intel in 2007 to manage planning activities that lead up to the announcement of the Intel® MIC Architecture in May of 2010. Prior to joining Intel, Joe worked at Dell, Inc. and graphics pioneer Tseng Labs in a series of marketing and engineering leadership roles.

I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! to all who responded.)

This is the last in a series of three such transcripts. Hallelujah! Doing this has been a pain. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews. But some were in the interviews, including here.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

[We began by discovering we had similar deep backgrounds, both starting in graphics hardware. I designed & built a display processor (a prehistoric GPU), he built “the most efficient framework buffer controller you could possibly make”. Guess which one of us is in marketing?]

A: My experience in the [HPC] business really started relatively recently, a little under five years ago, [when] I started working on many-core processors. I won’t be able to go into history, but I can at least tell you what we’re doing and why.

Q: Why don’t we start there? At a high level, what are you doing, and why? High level for what you are doing, and as much detail on “why” as you can provide.

A: We have to narrow the question. So, at Intel, what we’re after first of all is in what we call our Technical Computing Marketing Group inside Data Center Group. That has really three major objectives. The first one is to specify the needs for high performance computing, how we can help our customers and developers build the best high performance computing systems.

Q: Let me stop you for a second right there. My impression for high performance computing is that they are people whose needs are that they want more. Just more.

A: Oh, yes, but more at what cost? What cost of power, what cost of programability, what cost of size. How are we going to build the IO system to handle it affordably or use the fabric of the day.

Q: Yes, they want more, but they want it at two bytes/FLOPS of memory bandwidth and communication bandwidth.

A: There’s an old thing called the Dilbert Spec, which is “I want it all, and by the way, can it be free?” But that’s not really what people tell us they want. People in HPC have actually been remarkably pragmatic about what it takes to develop innovation. So they really want us to do some things, and do them really well.

By the way, to finish what we do, we also have the workstation segment, and the MIC Many Integrated Core product line. The marketing for that is also in our group.

You asked “what are you doing and why.” It would probably take forever to go across all domains, but we could go into any one of them a little bit better.

Q: Can you give me a general “why” for HPC, and a specific “why” for MIC?

A: Well, HPC’s a really good business. I get stunned, somebody must be Asking really weird questions, asking “why are you doing HPC?”

Q: What I’ve heard is that HPC is traditionally 12% of the market.

A: Supercomputing is a relatively small percentage of the market. HPC and technical computing, combined, is, not exactly, but roughly, a third of our data center business. [emphasis added by me] Our data center business is a pretty robust business. And high performance computing is a business that requires very high end, high performance processors. It’s actually a very desirable business to be in, if you can do it, and if your systems work. It’s a business we spend a lot of time working on because it’s a good business.

Now, if you look at MIC, back in 2005 we made a tacit conclusion that the performance of a system will come out of parallelism. Parallelism could be expressed at Intel in a lot of different ways. You can look at it as threads, we have this concept called hyperthreading. You can look at it as cores. And we have the SSE instructions sitting around which are SIMD, that’s a form of parallelism; people argue about the definition, but yes, it is. [I agree.] So you take a look at the basic architectural constructs, ease of programming, you know, a cache-based CISC model, and then scaling on cores, threads, SIMD or vectors, these common attributes have been adopted and well-used by a lot of programmers. There are programs across the continuum of coarse- to fine-grained parallel, embarrassingly parallel, pick your taxonomy. But there are applications that developers would be willing to trade the performance of any particular task or thread for the sum of what you can do inside the power envelope at a given period of time. Lots of people have different ways of defining that, you hear throughput, whatever, but this is the class of applications, and over time they’re growing.

Q: Growing relatively, or, say, compared to commercial processing, or…? Is the segment getting larger?

A: The number of people who have tasks they want to run on that kind of hardware is clearly growing. One of the reasons we’re doing MIC, maybe I should just cut it to the easiest answer, is developers and customers asked us to.

Q: Really?

A: And they came to us with a really simple question. We were struggling in the marketing group with how to position MIC, and one of our developers got worked up, like “Look, you give me the parallel performance of an accelerator, but you give me the ease of CPU programming!” Now, ease is a funny word; you can get into religious arguments about ease. But I think what he means is “I don’t have to re-think my algorithm, I don’t have to reorder my data set, there are some things that I don’t have to do. So that they wanted to have the idea of give me this architecture and get it to scale to be wildly parallel. And that is exactly what we’ve done with the MIC architecture. If you think about what the Kinght’s Ferry STP [? Undoubtedly this is SDP - Software Development Platform; I just heard it wrong on the recording.] is, a 32 core, coherent, on a chip, teraflop part, it’s kind of like Paragon or ASCI Red on a chip. [but it is only a TFLOPS in single precision] And the programming model is, surprisingly, kind of like a bunch of processor cores on a network, which a lot of people understand and can get a lot of utility out of in a very well-understood way. So, in a sense, we’re giving people what they want, and that, generally, is good business. And if you don’t give them what they want, they’ll have to go find someone else. So we’re simply doing what our marketplace asked us for.

Q: Well, let me play a little bit of devil’s advocate here, because MIC is very clearly derivative of Larrabee, and…

A: Knight’s Ferry is.

Q: … Knight’s Ferry is. Not MIC?

A: No. I think you have to take a look at what Larrabee was. Larrabee, by the way, was a really cool project, but what Larrabee was was a tile rendering graphics device, which meant its design point, was first of all the programming model was derived from what you do for graphics. It’s going to be API-based, the answer it’s going to generate is going to be a pixel, the pixel is going to have a defined level of sub-pixel accuracy. It’s a very predictable output. The internal optimizations you would make for a graphics implementation of a general many-core architecture is one very specific implementation. Let’s talk about the needs of the high performance computing market. I need bandwidth. I need memory depth. Larrabee didn’t need memory depth; it didn’t have a frame buffer.

Q: It needed bandwidth to local memory [of which it didn’t have enough; see my post The Problem with Larrabee]

A: Yes, but less than you think, because the cache was the critical element in that architecture [again, see that post] if you look through the academic papers on that…

Q: OK, OK.

A: So, they have a common heritage, they’re both derived out of the thoughts that came out of the Intel Labs terascale research. They’re both many-core. But Knight’s Ferry came out with a few, they’re only a few, modifications. But the programming model is completely different. You don’t program a graphics device like you do a computer, and MIC is a computer.

Q: The higher-level programming model is different.

A: Correct.

Q: But it is a big, wide, cache-coherent SMP.

A: Well, yes, that’s what Knight’s Ferry is, but we haven’t talked about what Knight’s Corner yet, and unfortunately I won’t today, and we haven’t talked about where the product line will go from there, either. But there are many things that will remain the same, because there are things you can take and embellish and work and things that will be really different.

Q: But can you at least give me a hint? Is there a chance that Knight’s Corner will be a substantially different hardware model than Knight’s Ferry?

A: I’m going to really love to talk to you about Knight’s Corner. [his emphasis]

Q: But not today.

A: I’m going to duck it today.

Q: Oh, man…

A: The product is going to be in our 22 nm process, and 22 nm isn’t shipping yet. When we get a little bit closer, when it deserves to have the buzz generated, we’ll start generating buzz. Right now, the big thing is that we’re making the investments in the Knight’s Ferry software development platform, to see how codes scale across the many-core, to get the environment and tools up, to let developers poke at it and find stuff, good stuff, bad stuff, in between stuff, that allow us to adjust the product line for ongoing generations. We’ve done that really well since we announced the architecture about 15 months ago.

Q: I was wondering what else I was going to talk about after having talked to both John Hengeveld and Jim Reinders. This is great. Nobody talked about where it really came from, and even hinted that there were changes to the MIC chip [architecture].

A: Oh, no, no, many things will be the same, many things will be different. If you’re targeting trying to do a pixel-renderer, go do a pixel-renderer. If you’re trying to do a general-purpose computing device, do a general-purpose computing device. You’ll see some things and say “well, it’s all the same” and other things “wow, it’s completely different.” We’ll get around to talking about the part when we’re a little closer.

The most important thing that James and/or John should have been talking about is that the key thing is the ability to not force the developer to completely and utterly re-think their problem to use your hardware. There are two models: In an accelerator model, which is something I spent a lot of my life working with, accelerators have the advantage of optimization. You can say “I want to do one thing really well.” So you can then describe a programming model for the hardware. You can say “build your data this way, write your program this way” and if you do it will work. The problem is that not everything fits into the box. Oh, you have sparse data. Oh, you have recursive code.

Q: And there’s madness in that direction, because if you start supporting that you wind yourself around to a general-purpose machine. […usually, a very odd-looking general-purpose machine. I’ve talked about Sutherland’s “Wheel of Reincarnation” in this blog, haven’t I? Oh, there it is: The Cloud Got GPUs, back in November 2010.]

A: Then it’s not an accelerator any more. The thing that you get in MIC is the performance of one of those accelerators. We’ve shown this. We’ve hit 960GF out of a peak 1.2TF without throwing away precision, without playing any circus tricks, just run the hardware. On Knight’s Ferry we’ve shown that. So you get performance, but you’re getting it out of the general purpose programming model.

Q: That’s running LINPACK, or… ?

A: That was an even more basic thing; I’m just talking about SGEMM [single-precision dense matrix multiply].

Q: I just wanted to ground the number.

A: For LU factorization, I think we showed hybrid LU, really cool, one of the great things about this hybrid…

Q: They’re demo-ing that downstairs.

A: … OK. When the matrix size is small, I keep it on the host; when the matrix size is large, I move it. But it’s all the same code, the same code either place. I’m just deciding where I want to run the code intelligently, based on the size of the matrix. You can get the exact number, but I think it’s on the order of 750GBytes/sec for LU [GFLOPS?], which is actually, for a first-generation part, not shabby. [They were doing 650-750 GF according to the meter I saw. That's single precision; Knight's Ferry was originally a graphics part.]

Q: Yaahh, well, there are a lot of people who can deliver something like that.

A: We’ll keep working on it and making it better and better. So, what are we proving today. All we’ve proven today is that the architecture is capable of performance. We’ve got a lot of work to do before we have a product, but the architecture has shown itself to be capable. The programming model, we have people who will speak for us, like the quotes that came from LRZ [data center for the universities of Munich and the Bavarian Academy of Sciences], from Leibnitz [same place], a code they couldn’t port to other accelerators was running in two hours and optimized in two days. Now, actual mileage may vary, see dealer for…

Q: So, there are things that just won’t run on a CUDA model? Example?

A: Well, perhaps, again, the thing you try to get to is whether there is evidence growing that what you say is real. So we’re having people who are starting to be able to speak to that, and that gives people the confidence that we’re going to be able to get there. The other thing it ends up doing, it’s kind of an odd benefit, as people have started building their code, trying to optimize it for MIC, they’re finding the parallelism, they’re doing what we wanted them to do all along, they’re taking the same code on their current cluster and they’re getting benefits right now.

Q: That’s got a long history. People would have some grotty old FORTRAN code, and want to vectorize it, but the vectorizing compiler couldn’t make crap out of it. So they cleaned it up, made it obvious what was going on, and the vectorizer did its thing well. Then they put it back on the original machine and it ran twice as fast.

A: So, one of the nice things that’s happened is that as people are looking at ways to scale power, performance, they’re finally getting around to dealing with parallelism. The offer that we’re trying to provide is portable, high level, standards-based, and you can use it now.

You said “why.” That’s why. Our customers and developers say “if you can do that, that’s really valuable.” Now. We’re four men and a pudding, we haven’t shipped a product yet, we’ve got a lot of work to do, but the thought and the promise and the early data is really good.

Q: OK. Well, great.

A: Was that a good use of the time?

Q: That’s a very good use of the time. Let me poke on one thing a little bit. Conceptually, it ought to be simpler to write code to that kind of a shared memory model and get parallelism out of the code that way. Now, on the other hand, there was a talk – sorry, I forget his name, he was one of the software guys working on Larrabee [it was Tom Forsyth; see my post The Problem with Larrabee again] said someone on the project had written four renderers, and three of them were for Larrabee. He was having one hell of a time trying to get something that performed well. His big issue, at least what it came down to from what I remember of the talk, was memory bandwidth.

A: Well, first of all, we’ve said Larrabee’s not a product. As I’ve said, one of the things that is critical, you’ve got the compute-bound, you’ve got the memory-bound, and most people are somewhere in between, but you have to be able to handle the two edge cases. We understand that, and we intend to deliver a really good value across the spectrum. Now, Knight’s Ferry has the RVI silicon [RVI? I’m guessing here], it’s a variation off the silicon we used, no one cares about that, but on Knight’s Ferry, the memory bus is 256 bits wide. Relatively narrow, and for a graphics processor, very narrow. There are definitely design decisions in how that chip was made that would limit the bandwidth. And the memory it was designed with is slower than the memory today, you have all of the normal things. But if you went downstairs to the show floor, and talk to Daniel Paul, he’s demonstrating a pretty dramatic ray-tracer.

[What follows is a bit confused. He didn’t mean the Austrian Crown stochastic ray-tracing demo, but rather the real-time ray-tracing demo. As I said in my immediately previous post (Random Things of Interest at IDF 2011), the real-time demo is on a set of Knight’s Ferries attached to a Xeon-based node. At the time of the interview, I hadn’t seen the real-time demo, just the stochastic one; the latter is not on Knight’s Ferry.]

Q: I’ve seen that one. The Austrian Crown?

A: Yes.

Q: I thought that was on a cluster.

A: In the little box behind there, he’s able to scale from one to eight Knight’s Ferries.

Q: He never told me there was a Knight’s Ferry in there.

A: Yes, it’s all Knight’s Ferry.

Q: Well, I’m going to go down there and beat on him a little bit.

A: I’m about to point you to a YouTube site, it got compressed and thrown up on YouTube. You can’t get the impact of the complexity of the rays, but you can at least get the superficial idea of the responsiveness of the system from Knight’s Ferry.

[He didn’t point me to YouTube, or I lost it, but here’s one I found. Ignore the fact that the introduction is in Swedish or something [it's Dutch, actually]; Daniel – and it’s Daniel, not David – speaks English, and gives a good demo. Yes, everybody in the “Labs” part of the showroom wore white lab coats. I did a bit of teasing. I also updated the Random Things of Interest post to directly include it.]

Well, if you believe that what we’re going to do in our mainstream processors is roughly double the FLOPS every generation for the next many generations, that’s our intent. What if we can do that on the MIC line as well? By the time you get to where ray-tracing would be practical, you could see multiple of those being integrated into a single device [added in transcription: Multiple MICs in a single device? Hierarchical MIC?] becomes practical computationally. That won’t be far from now. So, it’s a nice demo. David’s an expert in his field, I didn’t hear what he said, but it you want to see the device downstairs actually running a fairly strenuous graphics workload, take a look at that.

Q: OK. I did go down there and I did see that, I just didn’t know it was Knight’s Ferry. [It’s not, it’s not, still confused here.] On that HDR display that is gorgeous. [Where “it” = stochastically-ray-traced Austrian Crown. It is.]

[At that point, Dave Patterson walked in, which interrupted us. We said hello – I know Dave of old, a bit – thanks were exchanged with Joe, and I departed.]

[I can’t believe this is the end of the last one. I really don’t like transcribing.]

Tuesday, January 11, 2011

Intel-Nvidia Agreement Does Not Portend a CUDABridge or Sandy CUDA

Intel and Nvidia reached a legal agreement recently in which they cross-license patents, stop suing each other over chipset interfaces, and oh, yeah, Nvidia gets $1.5B from Intel in five easy payments of $300M each.

This has been covered in many places, like here, here, and here, but in particular Ars Technica originally lead with a headline about a Sandy Bridge (Intel GPU integrated on-chip with CPUs; see my post if you like) using Nvidia GPUs as the graphics engine. Ars has since retracted that (see web page referenced above), replacing the original web page. (The URL still reads "bombshell-look-for-nvidia-gpu-on-intel-processor-die.")

Since that's been retracted, maybe I shouldn't bother bringing it up, but let me be more specific about why this is wrong, based on my reading the actual legal agreement (redacted, meaning a confidential part was deleted). Note: I'm not a lawyer, although I've had to wade through lots of legalese over my career; so this is based on an "informed" layman's reading.

Yes, they have cross-licensed each others' patents. So if Intel does something in its GPU that is covered by an Nvidia patent, no suits. Likewise, if Nvidia does something covered by Intel patents, no suits. This is the usual intention of cross-licensing deals: Each side has "freedom of action," meaning they don't have to worry about inadvertently (or not) stepping on someone else's intellectual property.

It does mean that Intel could, in theory, build a whole dang Nvidia GPU and sell it. Such things have happened, historically, but usually without cross-licensing, and are uncommon (IBM mainframe clones, X86 clones), but as a practical matter, wholesale inclusion of one company's processor design into another company's products is a hard job. There is a lot to a large digital widget not covered by the patents – numbers of undocumented implementation-specific corner cases that can mess up full software compatibility, without which there's no point. Finding them all is massive undertaking.

So switching to a CUDA GPU architecture would be a massive undertaking, and furthermore it's a job Intel apparently doesn't want to do. Intel has its own graphics designs, with years of the design / test / fabricate pipeline already in place; and between the ill-begotten Larrabee (now MICA) and its own specific GPUs and media processors Intel has demonstrated that they really want to do graphics in house.

Remember, what this whole suit was originally all about was Nvidia's chipset business – building stuff that connects processors to memory and IO. Intel's interfaces to the chipset were patent protected, and Nvidia was complaining that Intel didn't let Nvidia get at the newer ones, even though they were allegedly covered by a legal agreement. It's still about that issue.

This makes it surprising that, buried down in section 8.1, is this statement:

"Notwithstanding anything else in this Agreement, NVIDIA Licensed Chipsets shall not include any Intel Chipsets that are capable of electrically interfacing directly (with or without buffering or pin, pad or bump reassignment) with an Intel Processor that has an integrated (whether on-die or in-package) main memory controller, such as, without limitation, the Intel Processor families that are code named 'Nehalem', 'Westmere' and 'Sandy Bridge.'"

So all Nvidia gets is the old FSB (front side bus) interfaces. They can't directly connect into Intel's newer processors, since those interfaces are still patent protected, and those patents aren't covered. They have to use PCI, like any other IO device.

So what did Nvidia really get? They get bupkis, that's what. Nada. Zilch. Access to an obsolete bus interface. Well, they get bupkis plus $1.5B, which is a pretty fair sweetener. Seems to me that it's probably compensation for the chipset business Nvidia lost when there was still a chipset business to have, which there isn't now.

And both sides can stop paying lawyers. On this issue, anyway.

Postscript

Sorry, this blog hasn't been very active recently, and a legal dispute over obsolete busses isn't a particularly wonderful re-start. At least it's short. Nvidia's Project Denver – sticking a general-purpose ARM processor in with a GPU – might be an interesting topic, but I'm going to hold off on that until I can find out what the architecture really looks like. I'm getting a little tired of just writing about GPUs, though. I'm not going to stop that, but I am looking for other topics on which I can provide some value-add.

Monday, June 14, 2010

WNPoTs and the Conservatism of Hardware Development

There are some things about which I am undoubtedly considered a crusty old fogey, the abominable NO man, an ostrich with its head in the sand, and so on. Oh frabjous day! I now have a word for such things, courtesy of Charlie Stross, who wrote:

Just contemplate, for a moment, how you'd react to some guy from the IT sector walking into your place of work to evangelize a wonderful new piece of technology that will revolutionize your job, once everybody in the general population shells out £500 for a copy and you do a lot of hard work to teach them how to use it, And, on closer interrogation, you discover that he doesn't actually know what you do for a living; he's just certain that his WNPoT is going to revolutionize it. Now imagine that this happens (different IT marketing guy, different WNPoT, same pack drill) approximately once every two months for a five year period. You'd learn to tune him out, wouldn't you?

I've been through that pack drill more times than I can recall, and yes, I tune them out. The WNPoTs in my case were all about technology for computing itself, of course. Here are a few examples; they are sure to step on number of toes:

Any new programming language existing only for parallel processing, or any reason other than making programming itself simpler and more productive (see my post 101 parallel languages)
Multi-node single system image (see my post Multi-Multicore Single System Image)
Memristors, a new circuit type. A key point here is that exactly one company (HP) is working on it. Good technologies instantly crystallize consortia around themselves. Also, HP isn't a silicon technology company in the first place.
Quantum computing. Primarily good for just one thing: Cracking codes.
Brain simulation and strong artificial intelligence (really "thinking," whatever that means). Current efforts were beautifully characterized by John Horgan, in a SciAm guest blog: 'Current brain simulations resemble the "planes" and "radios" that Melanesian cargo-cult tribes built out of palm fronds, coral and coconut shells after being occupied by Japanese and American troops during World War II.'

Of course, for the most part those aren't new. They get re-invented regularly, though, and drooled over by ahistorical evalgelists who don't seem to understand that if something has already failed, you need to lay out what has changed sufficiently that it won't just fail again.

The particular issue of retred ideas aside, genuinely new and different things have to face up to what Charlie Stross describes above, in particular the part about not understanding what you do for a living. That point, for processor and system design, is a lot more important than one might expect, due to a seldom-publicized social fact: Processor and system design organizations are incredibly, insanely, conservative. They have good reason to be. Consider:

Those guys are building some of the most, if not the most, intricately complex structures ever created in the history of mankind. Furthermore, they can't be fixed in the field with an endless stream of patches. They have to just plain work – not exactly in the first run, although that is always sought, but in the second or, at most, third; beyond that money runs out.

The result they produce must also please, not just a well-defined demographic, but a multitude of masters from manufacturing to a wide range of industries and geographies. And of course it has to be cost- and performance-competitive when released, which entails a lot of head-scratching and deep breathing when the multi-year process begins.

Furthermore, each new design does it all over again. I'm talking about the "tock" phase for Intel; there's much less development work in the "tick" process shrink phase. Development organizations that aren't Intel don't get that breather. You don't "re-use" much silicon. (I don't think you ever re-use much code, either, with a few major exceptions; but that's a different issue.)

This is a very high stress operation. A huge investment can blow up if one of thousands of factors is messed up.

What they really do to accomplish all this is far from completely documented. I doubt it's even consciously fully understood. (What gets written down by someone paid from overhead to satisfy an ISO requirement is, of course, irrelevant.)

In this situation, is it any wonder the organizations are almost insanely conservative? Their members cannot even conceive of something except as a delta from both the current product and the current process used to create it, because that's what worked. And it worked within the budget. And they have their total intellectual capital invested in it. Anything not presented as a delta of both the current product and process is rejected out of hand. The process and product are intertwined in this; what was done (product) was, with no exceptions, what you were able to do in the context (process).

An implication is that they do not trust anyone who lacks the scars on their backs from having lived that long, high-stress process. You can't learn it from a book; if you haven't done it, you don't understand it. The introduction of anything new by anyone without the tribal scars is simply impossible. This is so true that I know of situations where taking a new approach to processor design required forming a new, separate organization. It began with a high-level corporate Act of God that created a new high-profile organization from scratch, dedicated to the new direction, staffed with a mix of outside talent and a few carefully-selected high-talent open-minded people pirated from the original organization. Then, very gradually, more talent from the old organization was siphoned off and blended into the new one until there was no old organization left other than a maintenance crew. The new organization had its own process, along with its own product.

This is why I regard most WNPoT announcements from a company's "research" arm as essentially meaningless. Whatever it is, it won't get into products without an "Act of God" like that described above. WNPoTs from academia or other outside research? Fuggedaboudit. Anything from outside is rejected unless it was originally nurtured by someone with deep, respected tribal scars, sufficiently so that that person thinks they completely own it. Otherwise it doesn't stand a chance.

Now I have a term to sum up all of this: WNPoT. Thanks, Charlie.

Oh, by the way, if you want a good reason why the Moore's Law half-death that flattened clock speeds produced multi- / many-core as a response, look no further. They could only do more of what they already knew how to do. It also ties into how the very different computing designs that are the other reaction to flat clocks came not from CPU vendors but outsiders – GPU vendors (and other accelerator vendors; see my post Why Accelerators Now?). They, of course, were also doing more of what they knew how to do, with a bit of Sutherland's Wheel of Reincarnation and DARPA funding thrown in for Nvidia. None of this is a criticism, just an observation.

Friday, May 28, 2010

No, Larrabee is Not Dead

We interrupt our series of posts on virtualization for a public service announcement: The numerous reports of Larrabee being dead are, at a minimum, greatly exaggerated. (Note, a significant addition after posting was made below, marked in red.)

Larrabee is the highly-publicized erstwhile mega discrete graphics chip from Intel, the subject of flame wars with Nvidia's CEO, whose initial product introduction was cancelled last December because its performance wasn't yet competitive.

Now, a recent Technology @ Intel blog post about Intel graphics (An Update On Our Graphics-related Programs) has resulted in a flurry of "Larrabee is Dead!" postings. There's Anandtech (Intel Kills Larrabee GPU), Device Magazine (Intel Cancels Larrabee Project), PCWorld (Intel Cancels Larrabee), ZDNet (Intel officially (again) kills off Larrabee), the Inquirer (Larrabee will not be), and … I'll stop there, since a quick Google will find you many more.

In the minority is eWeek (Intel Clarifies Graphics Plans, Hints at HPC Project), taking a balanced "this is what the blog actually said" approach, with SemiAccurate (Larrabee alive and well) on the other side, considering those "dead Larrabee" posts this a case of mass flakiness. I agree.

All the doom-sayers should actually listen to what Paul Otellini, Intel's CEO, said at Intel's Investor Meeting of May 11 2010 – not to what someone else interpreted what he heard his cousin's dog say Otellini said, but to the words he actually said. The full webcasts are archived and publically available. In particular, listen to the last segment, Q&A, starting at time 1:39, when someone named Hans asked about Larrabee with the comment that it hadn't appeared in the prior presentations.

Here's my partial transcript of the response, and I urge you not to believe me. It's a pain to transcribe, and I'm sure I got something wrong. Go listen to the webcast. Some words of mine, and a comment or two, are inserted in brackets [like this].

"Everything you saw in the roadmap today does not have Larrabee built into it. … [our mainstream product will be] based on evolving our mainstream integrated graphics products… [but there will be a] sea change in the architecture with Sandy Bridge… by going onto the chip and moving from two generations behind silicon to [current] silicon you get… best of class [best for integrated graphics, which isn't saying much, but even that will be a huge change for Intel if it happens].

"… In terms of Larrabee, we did not stop the project. If we made any mistake with Larrabee, we probably should not have talked about something that was high risk and long term. We have not stopped the project. We have shipped STVs out. We're looking at how and when to bring it to market. It still has very very high promise in areas of throughput computing and in terms of a general reprogrammable graphics engine using small IA cores. We still like the idea. But we've taken the risk associated with a new architecture out of our roadmap over the next few years so we have the flexibility to stay competitive while still working on it."

Nothing in the tech blog entry says anything more than is said above. It's not in the roadmap, so when the blog says "We will not bring a discrete graphics product to market, at least in the short-term" – the statement that fired up the nay-sayers' posts – the blog is simply re-stating the official party line, an action which is doubtless far from an accident.

Also, please note that Otellini did say, twice, "we did not stop the project" and that they are looking at how and when to bring it to market. All of this is completely consistent with the position taken last January that I reported in another post (The Problem with Larrabee), about relating things I picked up in a talk by Tom Forsyth about "the future of Intel graphics despite what the press says." There has been no change.

**ADDED**

Contrast this with another case: InfiniBand. Intel realized, late in the game and after publicity, that its initial IB product would also be noncompetitive. There, the response was to just fold up the shop, completely. People were reassigned, and the organization ceased to be. (I was there when this all happened, as a significant, working, designing, writing committee-leading IBM rep on the IB standard committee.) The situation is totally different for Larrabee; the shop is open and working, and high-level statements saying so have been repeated. This indicates a very different attitude, a continuing commitment to the technology.

With that kind of consistent high-level corporate backing, to say nothing of the large number of very talented individuals working on Larrabee, were they to kill it there would be a bit more of a ruckus than a sentence or two in a tech blog post.

It must have been a very slow news day.

We now return you to our previously scheduled blog posts, which will resume after Memorial Day.

Wednesday, March 10, 2010

OnLive Goes Live June 17 2010

Games in the cloud in June. Let's see if it works.

Onlive runs games – they list Borderlands, Mass Effect 2, and others – on their server farms, and streams the video to your PC or TV. I've written about this before, in some detail, as the Twilight of the GPU; it's a possible "twilight" since obviously OnLive users need neither consoles nor high-end gaming PCs.

Ah, but will it work? Everybody, everywhere, is questioning that.

The most obvious issue is lag: the time between twitching a control and seeing the response. OnLive inserts a time-of-flight delay in there, which clearly can't help. Whether it will be enough to seriously affect gameplay is anybody's guess right now; the live demo at the original announcement seemed to play well, and they claim it'll be OK (but then, they would). The ugly fact is that console game developers struggle with lag now; that doesn't leave much slack for a long-distance connection.

There's another issue, one which most people are ignoring, that I brought up in my earlier post: bandwidth. The bandwidth chewed up in hours of gaming is large enough (see that post for the math) that it seems like it must force a response from ISPs.

We shall see. I'm pre-registered to be a beta tester when they do an external beta, or so their web site tells me. If they decide I'm worthy, I'll try it and post my reactions here. That will be interesting because of my location: I'm just north of Denver, rather far from all of their four centers: San Francisco Bay Area, Chicago, Washington, D.C., Atlanta, and Dallas. I'm still with the 1000-mile radius they say is needed (600 mi. by air to Dallas), but hardly in their pocket like the usual Silicon Valley coterie of likely beta testers.

See that earlier post for more details about OnLive, or Google "onlive" and check out the many stories on this announcement. It's pointless for me to try to link to them here, since it went everywhere. Even AP. But here's CNET, and Reuters, and PCWorld if you want to read more about the announcement itself.

(This may be the shortest post ever to appear in this blog. I don't usually do "repeat the news" posts, but this one is both major and highly relevant to my general subject area.)

Wednesday, February 17, 2010

The Parallel Power Law

Do a job with 2X the parallelism and use 4X less power -- if the hardware is designed right. Yes, it's a square law. And that "right" hardware design has nothing to do with simplifying the processor design.

No, nobody's going to achieve that in practice, for reasons I'll get into below; it's a limit. But it's very hard to argue with a quadratic in the long run.

That's why I would have thought this would be in the forefront of everybody's mind these days. After all, it affects both software and hardware design of parallel systems. While it seems particularly to affect low-power applications, in reality everything's a low-power issue these days. Instead, I finally tracked it down, not anywhere on the Interwebs, but in physical university library stacks, down in the basement, buried on page 504 of a 1000-page CMOS circuit design text.

This is an issue that has been bugging me for about five months, ever since my post Japan Inc., vs. Intel – Not Really. Something odd was going on there. When they used more processors, reducing per-processor performance to keep aggregate performance constant, they appeared to noticeably reduce the aggregate power consumption of the whole chip. For example, see this graph of the energy needed to execute tomcatv in a fixed amount of time (meeting a deadline); it's Figure 12 from this paper:

In particular, note the dark bars, the energy with power saving on. They decrease as the number of processors increases. The effect is most noticeable in the 1% case I've marked in red above, where nearly all the energy is spent on actively switching the logic (the dynamic power). (Tomcatv is a short mesh-generation program from the SPEC92 benchmark suite.)

More processors, less power. What's going on here?

One thing definitely not going on is a change in the micro-architecture of the processors. In particular, the approach is not one of using simpler processors – casting aside superscalar, VLIW, out-of-order execution – and thereby cramming more of them onto the same silicon area. The simplification route is a popular direction taken by Oracle (Sun) Niagara, IBM's Blue Genes, Intel's Larrabee, and others, and people kept pointing me to it when I brought up this issue; it's a modern slightly milder variation of what legions of I'm-an-engineer-I-don't-write-software guys have been saying for decades: armies of ants outperform a few strong horses. They always have, they always will, but on the things ants can do. Nevertheless, that's not what's happening here, since the same silicon is used in all the cases.

Instead, what's happening appears to have to do with the power supply voltage (V_DD) and clock frequency. That leaves me at a disadvantage, since I'm definitely not a circuits guy. Once upon a time I was EE, but that was more decades ago than I want to admit. But my digging did come up with something, so even if you don't know an inductor from a capacitor (or don't know why EE types call sqrt(-1) "j", not "I"), just hang in there and I'll walk you through it.

The standard formula for the power used by a circuit is P = CV²f, where:

C is the capacitance being driven. This is basically the size of the bucket you're trying to fill up with electrons; a longer, fatter wire is a bigger bucket, for example, and it's more work to fill up a bigger bucket. However, this doesn't change per circuit when you add more processors of the same kind, so the source of the effect can't be here.

f is the clock frequency. Lowering this does reduce power, but only linearly: Halve the speed, you only halve the power, so two processors each half the speed use the same amount of power as one running full speed. You stay in the same place. No help here.

V is the power supply voltage, more properly called V_DD to distinguish it from other voltages hanging around circuits. (No, I don't know why it's a sagging double-D.) This has promise, since it is related to the power by a quadratic term: Halve the voltage and the power goes down 4X.

Unfortunately, V doesn't have an obvious numerical relationship to processor speed. Clearly it does have some relationship: If you have less oomph, it takes longer to fill up that bucket, so you can't run as fast a frequency when you lower the supply voltage. But how much lower does f go as you reduce V? I heard several answers to that which boiled down to "I don't know, but it's probably complicated." One circuit design text just completely punted and recommended using a feedback circuit – test the clock rate, and lower/raise the voltage if the clock is too fast/slow.

But we can ignore that for a little bit, and just look at the power used by an N-way parallel implementation. Kang and Leblibici, the text referenced above, do just that. They consider what happens to power when you parallelize any circuit, reducing the clock to 1/Nth of the original, and combine the results in one final register of capacitance Creg that runs at the original speed, (f_CLK is that original clock speed):

If the original circuit is really big, like a whole core, its capacitance Ctotal overwhelms the place where you deposit the result, Creg, so that "1+" term at the front is essentially 1. So, OK, that means the power ratio is proportional to the square of the voltage ratio. What does that have to do with the clock speed?

They suggest you solve the equations giving the propagation time through an inverter to figure that out. Here they are, purely for your amusement:

Solve them? No thank you. I'll rely on Kang and Leblibici's analysis, which states, after verbally staring at a graph of those equations for a half a page: "The net result is that the dependence of switching power dissipation on the power supply voltage becomes stronger than a simple quadratic relationship,…" (emphasis added).

So, there you go. More parallel means less power. By a lot. Actually stronger than quadratic. And it doesn't depend on micro-architecture: It will work just as well on full-tilt single-thread-optimized out-of-order super scalar or whatever cores – so you can get through serial portions with those cores faster, then spread out and cool down on nicely parallel parts.

You can't get that full quadratic effect in practice, of course, as I mentioned earlier. Some things this does not count:

Less than perfect software scaling.
Domination by static, not switching, power dissipation: The power that's expended just sitting there doing nothing, not even running the clock. This can be an issue, since leakage current is a significant concern nowadays.
Hardware that doesn't participate in this, like memory, communications, and other IO gear. (Update: Bjoern Knafia pointed out on Twitter that caches need this treatment, too. I've no clue how SRAM works in this regard, but it is pretty power-hungry. eDRAM, not so much.)
Power supply overheads
And… a lack of hardware that actually lets you do this.

The latter kind of makes this all irrelevant in practice, today. Nobody I could find, except for the "super CPU" discussed in that earlier post, allows software control of voltage/frequency settings. Embedded systems today, like IBM/Frescale PowerPC and ARM licensees, typically just provide on and off, with several variants of "off" using less power the longer it takes to turn on again. Starting with Nehalem (Core i7) Intel has Turbo Boost, which changes the frequency but not the voltage (as far as I can tell), thereby missing the crucial ingredient to make this work. And there were hints that Larraabee will (can?) automatically reduce frequency if its internal temperature monitoring indicates things are getting too hot.

But the right stuff just isn't there to exploit this circuit-design law of physics. It should be.

And, clearly, software should be coding to exploit it when it becomes available. Nah. Never happen.

Ultimately, though, it's hard to ignore Mother Nature when she's beating you over the head with a square law. It'll happen when people realize what they have to gain.

Thursday, January 28, 2010

The Problem with Larrabee

Memory bandwidth. And, most likely, software cost. Now that I've given you the punch lines, here's the rest of the story.

Larrabee, Intel's venture into high-performance graphics (and accelerated HPC), the root of months of trash talk between Intel and Nvidia, is well-known to have been delayed sin die: The pre-announced 2010 product won't happen, although some number will be available for software development, and no new date has been given for a product. It's also well-known for being an architecture that's clearly programmable with standard thinking and tools, as opposed to other GPGPUs (Nvidia, ATI/AMD), which look like something from another planet to anybody but a graphics wizard. In that context, a talk at Stanford earlier this month by Tom Forsyth, one of the Larrabee architects at Intel, is an interesting event.

Tom's topic was The Challenges of Larrabee as a GPU, and it began with Tom carefully stating the official word on Larrabee (delay) and then interpreting it: "Essentially, the first one isn't as cool as we hoped, and so there's no point in trying to sell it, because no one would buy it." Fair enough. He promised they'd get it right on the 2^nd, 3^rd, 4^th, or whatever try. Everybody's doing what they were doing before the announcement; they'll just keeping on truckin'.

But, among many other interesting things, he also said the target memory bandwidth – presumably required for adequate performance on the software renderer being written (and rewritten, and rewritten…) was to be able to read 2 bytes / thread / cycle.

He said this explicitly, vehemently, and quite repeatedly, further asserting that they were going to try to maintain that number in the future. And he's clearly designing code to that assertion. Here's a quote I copied: "Every 10 instructions, dual-issue means 5 clocks, that's 10 bytes. That's it. Starve." And most good code will be memory-limited.

The thing is: 2 bytes / cycle / thread is a lot. It's so big that a mere whiff of it would get HPC people, die-hard old-school HPC people, salivating. Let's do some math:

Let's say there are 100 processors (high end of numbers I've heard). 4 threads / processor. 2 GHz (he said the clock was measured in GHz).

That's 100 cores x 4 treads x 2 GHz x 2 bytes = 1600 GB/s.

Let's put that number in perspective:

It's moving more than the entire contents of a 1.5 TB disk drive every second.
It's more than 100 times the bandwidth of Intel's shiny new QuickPath system interconnect (12.8 GB/s per direction).
It would soak up the output of 33 banks of DDR3-SDRAM, all three channels, 192 bits per channel, 48 GB/s aggregate per bank.

In other words, it's impossible. Today. It might be that Intel is pausing Larrabee to wait for product shipment of some futuristic memory technology, like the 3D stacked chips with direct vias (vertical wires) passing all the way through the RAM chip to the processor stacked on it (Exascale Ambitions talk at Salishan 20 by Bill Camp, Intel’s Chief Architect/CTO of HPC, p. 21). Tom referred to the memory system designers as wizards beyond his comprehension; but even so, such exotica seems a flaky assumption to me.

What are the levers we have to reduce it? Processor count, clock rate, and that seems to be it. They need those 4 threads / processor (it's at the low end of really keeping their 4-stage pipe busy). He said the clock rate was "measured in GHz," so 1 GHz is a floor there. That's still 800 GB/s. Down to 25 processors we go; I don't know about you, but much lower than 24 cores starts moving out of the realm I was lead to expect. But 25 processors still gives 200 GB/s. This is still probably impossible, but starting to get in the realm of feasibility. Nvidia's Fermi, for example, is estimated as having in excess of 96 GB/s.

So maybe I'm being a dummy: He's not talking about main memory bandwidth, he's talking about bandwidth to cache memory. But then the number is too low. Take his 5 instructions, dual issue, 10 bytes example: You can get a whole lot more than 10 bytes out of an L1 cache in 5 instructions, not even counting the fact that it's probably multi-ported (able to do multiple accesses in a single cycle).

So why don't other GPU vendors have the same problem? I suspect it's at least partly because they have several separate, specialized memories, all explicitly controlled. The OpenCL memory model, for example, includes four separate memory spaces: private, local, constant, and global (cached). These are all explicitly managed, and if you align your stars correctly can all be running simultaneously. (See OpenCL Fundamentals, Episode 2). In contrast, Larrabee has to chokes it all through one general-purpose memory.

Now, switching gears:

Tom also said that the good news is that they can do 20 different rendering pipelines, all on the same hardware, since it's a software renderer; and the bad news is that they have to. He spoke of shipping a new renderer optimized to a new hot game six months after the game shipped.

Doesn't this imply that they expect their software rendering pipeline to be fairly marginal – so they are forced to make that massive continuing investment? When asked why others didn't do that, he indicated that they didn't have a choice; the pipeline's in hardware, so that one size fits all. Well, in the first place that's far less true with newer architectures; both Nvidia and ATI (AMD) are fairly programmable these days (they'd say "very programmable," I'm sure). In the second place, if it works adequately, who cares if you don't have a choice? In the third place, there's a feedback loop between applications and the hardware: Application developers work to match what they do to the hardware that's most generally available. This is the case in general, but is particularly true with fancy graphics. So the games will be designed to what the hardware does well, anyway.

And I don't know about you, but in principle I wouldn't be really excited about having to wait 6 months to play the latest and greatest game at adequate performance. (In practice, I'm never the first one out of the blocks for a new game.)

I have to confess that I've really got a certain amount of fondness for Larrabee. Its architecture seems so much more sane and programmable than the Nvidia/ATI gradual mutation away from fixed-function units. But these bandwidth and programming issues really bother me, and shake out some uncomfortable memories: The last I recall Intel attempting, as in Larrabee, to build whatever hardware the software wanted (or appeared to want on the surface), what came out was the ill-fated and nearly forgotten iAPX 432, renowned for hardware support of multitasking, object-oriented programming, and even garbage collection – and for being 4x slower than an 80286 of the same frequency.

Different situation, different era, different kind of hardware design, I know. But it still makes me worry.

(Acknowledgement: The graphic comparison to 1.5 TB disk transfer, was suggested by my still-wishing-to-remain-anonymous colleague, who also pointed me to the referenced video. This post generally benefited from email discussion with him.)

Friday, December 4, 2009

Intel’s Single-Chip Clus… (sorry) Cloud

Intel's recent announcement of a 48-core "single-chip cloud" (SCC) is now rattling around several news sources, with varying degrees of boneheaded-ness and/or willful suspension of disbelief in the hype. Gotta set this record straight, and also raise a few questions I didn't find answered in the sources now available (the presentation, the software paper, the developers' video).

Let me emphasize from the start that I do not think this is a bad idea. In fact, it's an idea good enough that I've lead or been associated with rather similar architectures twice, although not on a single chip (RP3, POWER 4 (no, not the POWER4 chip; this one was four original POWER processors in one box, apparently lost on the internet) (but I've still got a button…)). Neither was, in hindsight, a good idea at their times.

So, some facts:

SCC is not a product. It's an experimental implementation of which about a hundred will be made, given to various labs for software research. It is not like Larrabee, which will be shipped in full-bore product-scale quantities Some Day Real Soon Now. Think concept car. That software research will surely be necessary, since:

SCC is neither a multiprocessor nor a multicore system in the usual sense. They call it a "single chip cloud" because the term "cluster" is déclassé. Those 48 cores have caches (two levels), but cache coherence is not implemented in hardware. So, it's best thought of as a 48-node cluster of uniprocessors on a single chip. Except that those 48 cluster nodes all access the same memory. And if one processor changes what's in memory, the others… don't find out. Until the cache at random happens to kick the changed line out. Sometime. Who knows when. (Unless something else is done; but what? See below.)

Doing this certainly does save hardware complexity, but one might note that quite adequately scalable cache coherence does exist. It's sold today by Rackable (the part that was SGI); Fujitsu made big cache-coherent multi-x86 systems for quite a while; and there are ex-Sequent folks out there who remember it well. There's even an IEEE standard for it (SCI, Scalable Coherent Interface). So let's not pretend the idea is impossible and assume your audience will be ignorant. Mostly they will be, but that just makes misleading them more reprehensible.

To leave cache coherence out of an experimental chip like this is quite reasonable; I've no objection there. I do object to things like the presentation's calling this "New Data-Sharing Options." That's some serious lipstick being applied.

It also leads to several questions that are so far unanswered:

How do you keep the processors out of each others' pants? Ungodly uncontrolled race-like uglies must happen unless… what? Someone says "software," but what hardware does that software exercise? Do they, perhaps, keep the 48 separate from one another by virtualization techniques? (Among other things, virtualization hardware has to keep virtual machine A out of virtual machine B's memory.) That would actually be kind of cool, in my opinion, but I doubt it; hypervisor implementations require cache coherence, among other issues. Do you just rely on instructions that push individual cache lines out to memory? Ugh. Is there a way to decree whole swaths of memory to be non-cacheable? Sounds kind of inefficient, but whatever. There must be something, since they have demoed some real applications and so this problem must have been solved somehow. How?

What's going on with the operating system? Is there a separate kernel for each core? My guess: Yes. That's part of being a clus… sorry, a cloud. One news article said it ran "Rock Creek Linux." Never heard of it? Hint: The chip was called Rock Creek prior to PR.

One iteration of non-coherent hardware I dealt with used cluster single system image to make it look like one machine for management and some other purposes. I'll bet that becomes one of the software experiments. (If you don't know what SSI is, I've got five posts for you to read, starting here.)

Appropriately, there's mention of message-passing as the means of communicating among the cores. That's potentially fast message passing, since you're using memory-to-memory transfers in the same machine. (Until you saturate the memory interface – only four ports shared by all 48.) (Or until you start counting usual software layers. Not everybody loves MPI.) Is there any hardware included to support that, like DMA engines? Or protocol offload engines?

Finally, why does every Intel announcement of gee-whiz hardware always imply it will solve the same set of problems? I'm really tired of those flying cars. No, I don't expect to ever see an answer to that one.

I'll end by mentioning something in Intel's SCC (née Rock Creek) that I think is really good and useful: multiple separate power regions. Voltage and frequency can be varied separately in different areas of the chip, so if you aren't using a bunch of cores, they can go slower and/or draw less power. That's something that will be "jacks or better" in future multicore designs, and spending the effort to figure out how to build and use it is very worthwhile.

Heck, the whole thing is worthwhile, as an experiment. On its own. Without inflated hype about solving all the world's problems.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(This has been an unusually topical post, brought to you courtesy of the author's head-banging-wall level of annoyance at boneheaded news stories. Soon we will resume our more usual programming. Not instantly. First I have to close on and move into a house. In Holiday and snow season.)