The Perils of Parallel

Tuesday, January 11, 2011

Intel-Nvidia Agreement Does Not Portend a CUDABridge or Sandy CUDA

Intel and Nvidia reached a legal agreement recently in which they cross-license patents, stop suing each other over chipset interfaces, and oh, yeah, Nvidia gets $1.5B from Intel in five easy payments of $300M each.

This has been covered in many places, like here, here, and here, but in particular Ars Technica originally lead with a headline about a Sandy Bridge (Intel GPU integrated on-chip with CPUs; see my post if you like) using Nvidia GPUs as the graphics engine. Ars has since retracted that (see web page referenced above), replacing the original web page. (The URL still reads "bombshell-look-for-nvidia-gpu-on-intel-processor-die.")

Since that's been retracted, maybe I shouldn't bother bringing it up, but let me be more specific about why this is wrong, based on my reading the actual legal agreement (redacted, meaning a confidential part was deleted). Note: I'm not a lawyer, although I've had to wade through lots of legalese over my career; so this is based on an "informed" layman's reading.

Yes, they have cross-licensed each others' patents. So if Intel does something in its GPU that is covered by an Nvidia patent, no suits. Likewise, if Nvidia does something covered by Intel patents, no suits. This is the usual intention of cross-licensing deals: Each side has "freedom of action," meaning they don't have to worry about inadvertently (or not) stepping on someone else's intellectual property.

It does mean that Intel could, in theory, build a whole dang Nvidia GPU and sell it. Such things have happened, historically, but usually without cross-licensing, and are uncommon (IBM mainframe clones, X86 clones), but as a practical matter, wholesale inclusion of one company's processor design into another company's products is a hard job. There is a lot to a large digital widget not covered by the patents – numbers of undocumented implementation-specific corner cases that can mess up full software compatibility, without which there's no point. Finding them all is massive undertaking.

So switching to a CUDA GPU architecture would be a massive undertaking, and furthermore it's a job Intel apparently doesn't want to do. Intel has its own graphics designs, with years of the design / test / fabricate pipeline already in place; and between the ill-begotten Larrabee (now MICA) and its own specific GPUs and media processors Intel has demonstrated that they really want to do graphics in house.

Remember, what this whole suit was originally all about was Nvidia's chipset business – building stuff that connects processors to memory and IO. Intel's interfaces to the chipset were patent protected, and Nvidia was complaining that Intel didn't let Nvidia get at the newer ones, even though they were allegedly covered by a legal agreement. It's still about that issue.

This makes it surprising that, buried down in section 8.1, is this statement:

"Notwithstanding anything else in this Agreement, NVIDIA Licensed Chipsets shall not include any Intel Chipsets that are capable of electrically interfacing directly (with or without buffering or pin, pad or bump reassignment) with an Intel Processor that has an integrated (whether on-die or in-package) main memory controller, such as, without limitation, the Intel Processor families that are code named 'Nehalem', 'Westmere' and 'Sandy Bridge.'"

So all Nvidia gets is the old FSB (front side bus) interfaces. They can't directly connect into Intel's newer processors, since those interfaces are still patent protected, and those patents aren't covered. They have to use PCI, like any other IO device.

So what did Nvidia really get? They get bupkis, that's what. Nada. Zilch. Access to an obsolete bus interface. Well, they get bupkis plus $1.5B, which is a pretty fair sweetener. Seems to me that it's probably compensation for the chipset business Nvidia lost when there was still a chipset business to have, which there isn't now.

And both sides can stop paying lawyers. On this issue, anyway.

Postscript

Sorry, this blog hasn't been very active recently, and a legal dispute over obsolete busses isn't a particularly wonderful re-start. At least it's short. Nvidia's Project Denver – sticking a general-purpose ARM processor in with a GPU – might be an interesting topic, but I'm going to hold off on that until I can find out what the architecture really looks like. I'm getting a little tired of just writing about GPUs, though. I'm not going to stop that, but I am looking for other topics on which I can provide some value-add.

Monday, December 6, 2010

The Varieties of Virtualization

There appear to be many people for whom the term virtualization exclusively means the implementation of virtual machines à la VMware's products, Microsoft's Hyper-V, and so on. That's certainly a very important and common case, enough so that I covered various ways to do it in a separate series of posts; but it's scarcely the only form of virtualization in use.

There's a hint that this is so in the gaggle of other situations where the word virtualization is used, such as desktop virtualization, application virtualization, user virtualization (I like that one; I wonder what it's like to be a virtual user), and, of course, Java Virtual Machine (JVM). Talking about the latter as a true case of virtualization may cause some head-scratching; I think most people consign it to a different plane of existence than things like VMware.

This turns out not to be the case. They're not only all in the same (boringly mundane) plane, they relate to one another hierarchically. I see five levels to that hierarchy right now, anyway; I wouldn't claim this is the last word.

A key to understanding this is to adopt an appropriate definition of virtualization. Mine is that virtualization is the creation of isolated, idealized platforms on which computing services are provided. Anything providing that, whether it's hardware, software, or a mixture, is virtualization. The adjectives in front of "platform" could have qualifiers: Maybe it's not quite idealized in all cases, and isolation is never total. But lack of qualification is the intent.

Most types of virtualization allow hosting of several platforms on one physical or software resource, but that's not part of my definition because it's not universal; it could be just one, or a single platform could be created spanning multiple physical resources. It's also necessary to not always dwell all that heavily on boundaries between hardware and software. But that's starting to get ahead of the discussion. Let's go through the levels, starting at the bottom.

I'll relate this to the cloud computing's IaaS/PaaS/SaaS levels later.

Level 1: Hardware Partitioning

Some hardware is designed like a brick of chocolate that can be broken apart along various predefined fault lines, each piece a fully functional computer. Sun Microsystems (Oracle, now) famously did this with its .com workhorse, the Enterprise 10000 (UE10000). That system had multiple boards plugged into a memory-bus backplane, each board with processor(s), memory, and IO. Firmware let you set registers allowing or disallowing inter-board memory traffic, cache coherence and IO traffic, allowing you to create partitions of the whole machine built with any number of whole boards. The register setting, etc., is set up so that no code running on any of the processors can alter it or, usually, even tell it's there; a privileged console accesses them, under command of an operator, and that's it. HP, IBM and others have provided similar capabilities in large systems, often with the processors, memory, and IO in separate units, numbers of each assigned to different partitions.

Hardware partitioning has the big advantage that even hardware failures (for the most part) simply cannot propagate among partitions. With appropriate electrical design, you can even power-cycle one partition without affecting others. Software failures are of course also totally isolated within partitions (as long as one isn't performing a service for another, but that issue is on another plane of abstraction).

The big negative of hardware partitioning is that you usually cannot have very many of them. Even a single chip now contains multiple processors, so partitioning even by separate chips is far less granularity than is generally desirable. In fact, it's common to assign just a fraction of one CPU, and that can't be done without bending the notion of a hardware-isolated, power-cycle-able partition to the breaking point. In addition, there is always some hardware in common across the partition. For example, power supplies are usually shared, and whatever interconnects all the parts is shared; failure of that shared hardware cause all partitions to fail. (For more complete high availability, you need multiple completely separate physical computers, not under the same sprinkler head, preferably located on different tectonic plates, etc. depending on your personal level of paranoia.)

Despite its negatives, hardware partitioning is fairly simple to implement, useful, and still used. It or something like it, I speculate, is effectively what will be used for initial "virtualization" of GPUs when that starts appearing.

Level 2: Virtual Machines

This is the level of VMware and its kissin' cousins. All the hardware is shared en masse, and a special layer of software, a hypervisor, creates the illusion of multiple completely separate hardware platforms. Each runs its own copy of an operating system and any applications above that, and (ideally) none even knows that the others exist. I've previously written about how this trick can be performed without degrading performance to any significant degree, so won't go into it here.

The good news here is that you can create as many virtual machines as you like, independent of the number of physical processors and other physical resources – at least until you run out of resources. The hypervisor usually contains a scheduler that time-slices among processors, so sub-processor allocation is available. With the right hardware, IO can also fractionally allocated (again, see my prior posts).

The bad news is that you generally get much less hardware fault isolation than with hardware partitioning; if the hardware croaks, well, it's one basket and those eggs are scrambled. Very sophisticated hypervisors can help with that when there is appropriate hardware support (mainframe customers do get something for their money). In addition, and this is certainly obvious after it's stated: If you put N virtual machines on one physical machine, you are now faced with all the management pain of managing all N copies of the operating system and its applications.

This is the level often used in so-called desktop virtualization. In that paradigm, individuals don't own hardware, their own PC. Instead, they "own" a block of bits back on a server farm that happens to be the description of a virtual machine, and can request that their virtual machine be run from whatever terminal device happens to be handy. It might actually run back on the server, or might run on a local machine after downloading. Many users absolutely loathe this; they want to own and control their own hardware. Administrators like it, a lot, since it lets them own, and control, the hardware.

Level 3: Containers

This level was, as far as I know, originally developed by Sun Microsystems (Oracle), so I'll use their name for it: Containers. IBM (in AIX) and probably others also provide it, under different names.

With containers, you have one copy of the operating system code, but it provides environments, containers, which act like separate copies of the OS. In Unix/Linux terms, each container has its own file system root (including IO), process tree, shared segment naming space, and so on. So applications run as if they were running on their own copy of the operating system – but they are actually sharing one copy of the OS code, with common but separate OS data structures, etc.; this provides significant resource sharing that helps the efficiency of this level.

This is quite useful if you have applications or middleware that were written under the assumption that they were going to run on their own separate server, and as a result, for example, all use the same name for a temporary file. Were they run on the same OS, they would clobber each other in the common /tmp directory; in separate containers, they each have their own /tmp. More such applications exist than one would like to believe; the most quoted case is the Apache web server, but my information on that may be out of date and it may have been changed by now. Or not, since I'm not sure what the motivation to change would be.

I suspect container technology was originally developed in the Full Moon cluster single-system-image project, which needs similar capabilities. See my much earlier post about single-system-image if you want more information on such things.

In addition, there's just one real operating system to manage in this case, so management headaches are somewhat lessened. You do have to manage all those containers, so it isn't an N:1 advantage, but I've heard customers say this is a significant management savings.

A perhaps less obvious example of containerization is the multiuser BASIC systems that flooded the computer education system several decades back. There was one copy of the BASIC interpreter, run on a small minicomputer and used simultaneously by many students, each of whom had their own logon ID and wrote their own code. And each of whom could botch things up for everybody else with the wrong code that soaked up the CPU. (This happened regularly in the "computer lab" I supervised for a while.) I locate this in the container level rather than higher in the stack because the BASIC interpreter really was the OS: It ran on the bare metal, with no supervisor code below it.

Of course, fault isolation at this level is even less than in the prior cases. Now if the OS crashes, all the containers go down. (Or if the wrong thing is done in BASIC…) In comparison, an OS crash in a virtual machine is isolated to that virtual machine.

Level 4: Software Virtual Machines

We've reached the JVM level. It's also the .NET level, the Lisp level, the now more usual BASIC level, and even the CICS (and so on): the level of more-or-less programming-language based independent computing environments. Obviously, multiple of these can be run as applications under a single operating system image, each providing a separate environment for the execution of applications. At least this can be done in theory, and in many cases in practice; some environments were implemented as if they owned the computer they run on.

What you get out of this is, of course, a more standard programming environment that can be portable – run on multiple computer architectures – as well as extensions to a machine environment that provide services simplifying application development. Those extensions are usually the key reason this level is used. There's also a bit of fault tolerance, since if one of those dies of a fault in its support or application code, it need not always affect others, assuming a competent operating system implementation.

Fault isolation at this level is mostly software only; if one JVM (say) crashes, or the code running on it crashes, it usually doesn't affect others. Sophisticated hardware / firmware / OS can inject the ability to keep many of the software VMs up if a failure occurred that only affected one of them. (Mainframe again.)

Level 5: Multitenant / Multiuser Environment

Many applications allow multiple users to log in, all to the same application, with their own profiles, data collections, etc. They are legion. Examples include web-based email, Facebook, Salesforce.com, Worlds of Warcraft, and so on. Each user sees his or her own data, and thinks he / she is doing things isolated from others except at those points where interaction is expected. They see their own virtual system – a very specific, particularized system running just one application, but a system apparently isolated from all others in any event.

The advantages here? Well, people pay to use them (or put up with advertising to use them). Aside from that, there is potentially massive sharing of resources, and, concomitantly, care must be taken in the software and system architecture to avoid massive sharing of faults.

All Together Now

Yes. You can have all of these levels of virtualization active simultaneously in one system: A hardware partition running a hypervisor creating a virtual machine that hosts an operating system with containers that each run several programming environments executing multi-user applications.

It's possible. There may be circumstances where it appears warranted. I don't think I'd want to manage it, myself. Imagining a performance tuning on a 5-layer virtualization cake makes me shudder. I once had a television system that had two volume controls in series: A cable set-top box had its volume control, feeding an audio system with its own. Just those two levels drove me nuts until I hit upon a setting of one of them that let the other, alone, span the range I wanted.

Virtualization and Cloud Computing

These levels relate to the usual IaaS/PaaS/SaaS (Infrastructure / Platform / Software as a Service) distinctions discussed in cloud computing circles, but are at a finer granularity than those.

IaaS relates to the bottom two layers: hardware partitioning and virtual machines. Those two levels, particularly virtual machines, make it possible to serve up raw computing infrastructure (machines) in a way that can utilize the underlying hardware far more efficiently than handing customers whole computers that they aren't going to use 100% of the time. As I've pointed out elsewhere, it is not a logical necessity that a cloud use this or some other form of virtualization; but in many situations, it is an economic necessity.

Software virtual machines are what PaaS serves up. There's a fairly close correspondence between the two concepts.

SaaS is, of course, a Multiuser environment. It may, however, be delivered by using software virtual machines under it.

Containers are a mix of IaaS and PaaS. It's doesn't provide pure hardware, but a plain OS is made available, and that can certainly be considered a software platform. It is, however, a fairly barren environment compared with what software virtual machines provide..

Conclusion

This post has been brought to you by my poor head, which aches every time I encounter yet another discussion over whether and how various forms of cloud computing do or do not use virtualization. Hopefully it may help clear up some of that confusion.

Oh, yes, and the obvious conclusion: There's more than one kind of virtualization, out there, folks.

Monday, November 15, 2010

The Cloud Got GPUs

Amazon just announced, on the first full day of SC10 (SuperComputing 2010), the availability of Amazon EC2 (cloud) machine instances with dual Nvidia Fermi GPUs. According to Amazon's specification of instance types, this "Cluster GPU Quadruple Extra Large" instance contains:

22 GB of memory

33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core "Nehalem" architecture)

2 x NVIDIA Tesla "Fermi" M2050 GPUs

1690 GB of instance storage

64-bit platform

I/O Performance: Very High (10 Gigabit Ethernet)

So it looks like the future virtualization features of CUDA really are for purposes of using GPUs in the cloud, as I mentioned in my prior post.

One of these XXXXL instances costs $2.10 per hour for Linux; Windows users need not apply. Or, if you reserve an instance for a year – for $5630 – you then pay just $0.74 per hour during that year. (Prices quoted from Amazon's price list as of 11/15/2010; no doubt it will decrease over time.)

This became such hot news that GPU was a trending topic on Twitter for a while.

For those of you who don't watch such things, many of the Top500 HPC sites – the 500 supercomputers worldwide that are the fastest at the Linpack benchmark – have nodes featuring Nvidia Fermi GPUs. This year that list notoriously includes, in the top slot, the system causing the heaviest breathing at present: The Tianhe-1A at the National Supercomputer Center in Tianjin, in China.

I wonder how well this will do in the market. Cloud elasticity – the ability to add or remove nodes on demand – is usually a big cloud selling point for commercial use (expand for holiday rush, drop nodes after). How much it will really be used in HPC applications isn't clear to me, since those are usually batch mode, not continuously operating, growing and shrinking, like commercial web services. So it has to live on price alone. The price above doesn't feel all that inexpensive to me, but I'm not calibrated well in HPC costs these days, and don't know how much it compares with, for example, the cost of running the same calculation on Teragrid. Ad hoc, extemporaneous use of HPC is another possible use, but, while I'm sure it exists, I'm not sure how much exists.

Then again, how about services running games, including the rendering? I wonder if, for example, the communications secret sauce used by OnLive to stream rendered game video fast enough for first-person shooters can operate out of Amazon instances. Even if it doesn't, games that can tolerate a tad more latency may work. Possibly games targeting small screens, requiring less rendering effort, are another possibility. That could crater startup costs for companies offering games over the web.

Time will tell. For accelerators, we certainly are living in interesting times.

Thursday, November 11, 2010

Nvidia Past, Future, and Circular

I'm getting tired about writing about Nvidia and its Fermi GPU architecture (see here and here for recent posts). So I'm going to just dump out some things I've considered for blog entries into this one, getting it all out of the way.

Past Fermi Product Mix

For those of you wondering about how much Nvidia's product mix is skewed to the low end, here's some data for Q3, 2010 from Investor Village:

Also, note that despite the raging hormones of high-end HPC, the caption indicates that their median and mean prices have decreased from Q2: They became more, not less, skewed towards the low end. As I've pointed out, this will be a real problem as Intel's and AMD's on-die GPUs assert some market presence, with "good enough" graphics for free – built into all PC chips. It won't be long now, since AMD has already started shipping its Zacate integrated-GPU chip to manufacturers.

Future Fermis

Recently Fermi's chief executive Jen-Hsun Huang gave an interview on what they are looking at for future features in the Fermi architecture. Things he mentioned were: (a) More development of their CUDA software; (b) virtual memory and pre-emption; (c) directly attaching InfiniBand, the leading HPC high-speed system-to-system interconnect, to the GPU. Taking these in that order:

More CUDA: When asked why not OpenCL, he said because other people are working on OpenCL and they're the only ones doing CUDA. This answer ranks right up there in the stratosphere of disingenuousness. What the question really meant was why they don't work to make OpenCL, a standard, work as well as their proprietary CUDA on their gear? Of course the answer is that OpenCL doesn't get them lock-in, which one doesn't say in an interview.

Virtual memory and pre-emption: A GPU getting a page fault, then waiting while the data is loaded from main memory, or even disk? I wouldn't want to think of the number of threads it would take to cover that latency. There probably is some application somewhere for which this is the ideal solution, but I doubt it's the main driver. This is a cloud play: Cloud-based systems nearly all use virtual machines (for very good reason; see the link), splitting one each system node into N virtual machines. Virtual memory and pre-emption allows the GPU to participate in that virtualization. The virtual memory part is, I would guess, more intended to provide memory mapping, so applications can be isolated from one another reliably and can bypass issues of contiguous memory allocation. It's effectively partitioning the GPU, which is arguably a form of virtualization. [UPDATE: Just after this was published, John Carmak (of Id Software ) wrote a piece laying out the case for paging into GPUs. So that may be useful in games and generally.]

Direct InfiniBand attachment: At first glance, this sounds as useful as tits on a boar hog (as I occasionally heard from older locals in Austin). But it is suggested, a little, by the typical compute cycle among parallel nodes in HPC systems. That often goes like this: (a) Shove data from main memory out to the GPU. (b) Compute on the GPU. (c) Suck data back from GPU into main memory. (d) Using the interconnect between nodes, send part of that data from main memory to the main memory in other compute nodes, while receiving data into your memory from other compute nodes. (e) Merge the new data with what's in main memory. (f) Test to see if everybody's done. (g) If not, done, shove resulting new data mix in main memory out to the GPU, and repeat. At least naively, one might think that the copying to and from main memory could be avoided since the GPUs are the ones doing all the computing: Just send the data from one GPU to the other, with no CPU involvement. Removing data copying is, of course, good. In practice, however, it's not quite that straightforward; but it is at least worth looking at.

So, that's what may be new in Nvidia CUDA / Fermi land. Each of those are at least marginally justifiable, some very much so (like virtualization). But stepping back a little from these specifics, this all reminds me of dueling Nvidia / AMD (ATI) announcements of about a year ago.

That was the time of the Fermi announcement, which compared with prior Nvidia hardware doubled everything, yada yada, and added… ECC. And support for C++ and the like, and good speed double-precision floating-point.

At that time, Tech Report said that the AMD Radeon HD 5870 doubled everything, yada again, and added … a fancy new anisotropic filtering algorithm for smoothing out texture applications at all angles, and supersampling to better avoid antialiasing.

Fine, Nvidia doesn't think much of graphics any more. But haven't they ever heard of the Wheel of Reincarnation?

The Wheel of Reincarnation

The wheel of reincarnation is a graphics system design phenomenon discovered all the way back in 1968 by T. H. Meyers and Ivan Sutherland. There are probably hundreds of renditions of it floating around the web; here's mine.

Suppose you want to use a computer to draw pictures on a display of some sort. How do you start? Well, the most dirt-simple, least hardware solution is to add an IO device which, prodded by the processor with X and Y coordinates on the device, puts a dot there. That will work, and actually has been used in the deep past. The problem is that you've now got this whole computer sitting there, and all you're doing with it is putting stupid little dots on the screen. It could be doing other useful stuff, like figuring out what to draw next, but it can't; it's 100% saturated with this dumb, repetitious job.

So, you beef up your IO device, like by adding the ability to go through a whole list of X, Y locations and putting dots up at each specified point. That helps, but the computer still has to get back to it very reliably every refresh cycle or the user complains. So you tell it to repeat. But that's really limiting. It would be much more convenient if you could tell the device to go do another list all by itself, like by embedding the next list's address in block of X,Y data. This takes a bit of thought, since it means adding a code to everything, so the device can tell X,Y pairs from next-list addresses; but it's clearly worth it, so in it goes.

Then you notice that there are some graphics patterns that you would like to use repeatedly. Text characters are the first that jump out at you, usually. Hmm. That code on the address is kind of like a branch instruction, isn't it? How about a subroutine branch? Makes sense, simplifies lots of things, so in it goes.

Oh, yes, then some of those objects you are re-using would be really more useful if they could be rotated and scaled… Hello, arithmetic.

At some stage it looks really useful to add conditionals, too, so…

Somewhere along the line, to make this a 21^st century system, you get a frame buffer in there, too, but that's kind of an epicycle; you write to that instead of literally putting dots on the screen. It eliminates the refresh step, but that's all.

Now look at what you have. It's a Turing machine. A complete computer. It's got a somewhat strange instruction set, but it works, and can do any kind of useful work.

And it's spending all its time doing nothing but putting silly dots on a screen.

How about freeing it up to do something more useful by adding a separate device to it to do that?

This is the crucial point. You've reached the 360 degree point on the wheel, spinning off a graphics processor on the graphics processor.

Every incremental stage in this process was very well-justified, and Meyers and Sutherland say they saw examples (in 1968!) of systems that were more than twice around the wheel: A graphics processor hanging on a graphics processor hanging on a graphics processor. These multi-cycles are often justified if there's distance involved; in fact, in these terms, a typical PC on the Internet can be considered to be twice around the wheel: It's got a graphics processor on a processor that uses a server somewhere else.

I've some personal experience with this. For one thing, back in the early 70s I worked for Ivan Sutherland at then-startup Evans and Sutherland Computing Corp., out in Salt Lake City; it was a summer job while I was in grad school. My job was to design nothing less than an IO system on their second planned graphics system (LDS-2). It was, as was asked for, a full-blown minicomputer-level IO system, attached to a system whose purpose in life was to do nothing but put dots on a screen. Why an IO system? Well, why bother the main system with trivia like keyboard and mouse (light pen) interrupts? Just attach them directly to the graphics unit, and let it do the job.

Just like Nvidia is talking about attaching InfiniBand directly to its cards.

Also, in the mid-80s in IBM Research, after the successful completion of an effort to build special-purpose parallel hardware system of another type (a simulator), I spent several months figuring out how to bend my brain and software into using it for more general purposes, with various and sundry additions taken from the standard repertoire of general-purpose systems.

Just like Nvidia is adding virtualization to its systems.

Each incremental step is justified – that's always the case with the wheel – just as in the discussion above, I showed a justification for every general-purpose additions to Nvidia architecture are justifiable.

The issue here is not that this is all necessarily bad. It just is. The wheel of reincarnation is a factor in the development over time of every special-purpose piece of hardware. You can't avoid it; but you can be aware that you are on it, like it or not.

With that knowledge, you can look back at what, in its special-purpose nature, made the original hardware successful – and make your exit from the wheel thoughtfully, picking a point where the reasons for your original success aren't drowned out by the complexity added to chase after ever-widening, and ever more shallow, market areas. That's necessary if you are to retain your success and not go head-to-head with people who have, usually with far more resources than you have, been playing the general-purpose game for decades.

It's not clear to me that Nvidia has figured this out yet. Maybe they have, but so far, I don't see it.

Sunday, October 17, 2010

RIP, Benoit Mandelbrot, father of fractal geometry

Benoit Mandelbrot, father of fractal geometry, has died.

See my post about him, and my interaction with him, in my mostly non-technical blog, Random Gorp: RIP, Benoit Mandelbrot, father of fractal geometry.

Saturday, September 4, 2010

Intel Graphics in Sandy Bridge: Good Enough

As I and others expected, Intel is gradually rolling out how much better the graphics in its next generation will be. Anandtech got an early demo part of Sandy Bridge and checked out the graphics, among other things. The results show that the "good enough" performance I argued for in my prior post (Nvidia-based Cheap Supercomputing Coming to an End) will be good enough to sink third party low-end graphics chip sets. So it's good enough to hurt Nvidia's business model, and make their HPC products fully carry their own development burden, raising prices notably.

The net is that for this early chip, with early device drivers, at low, but usable resolution (1024x768) there's adequate performance on games like "Batman: Arkham Asylum," "Call of Duty MW2," and a bunch of others, significantly including "Worlds of Warfare." And it'll play Blue-Ray 3D, too.

Anandtech's conclusion is "If this is the low end of what to expect, I'm not sure we'll need more than integrated graphics for non-gaming specific notebooks." I agree. I'd add desktops, too. Nvidia isn't standing still, of course; on the low end they are saying they'll do 3D, too, and will save power. But integrated graphics are, effectively, free. It'll be there anyway. Everywhere. And as a result, everything will be tuned to work best on that among the PC platforms; that's where the volumes will be.

Some comments I've received elsewhere on my prior post have been along the lines of "but Nvidia has such a good computing model and such good software support – Intel's rotten IGP can't match that." True. I agree. But.

There's a long history of ugly architectures dominating clever, elegant architectures that are superior targets for coding and compiling. Where are the RISC-based CAD workstations of 15+ years ago? They turned into PCs with graphics cards. The DEC Alpha, MIPS, Sun SPARC, IBM POWER and others, all arguably far better exemplars of the computing art, have been trounced by X86, which nobody would call elegant. Oh, and the IBM zSeries, also high on the inelegant ISA scale, just keeps truckin' through the decades, most recently at an astounding 5.2 GHz.

So we're just repeating history here. Volume, silicon technology, and market will again trump elegance and computing model.

PostScript: According to Bloomberg, look for a demo at Intel Developer Forum next week.

Wednesday, August 11, 2010

Nvidia-based Cheap Supercomputing Coming to an End

Nvidia's CUDA has been hailed as "Supercomputing for the Masses," and with good reason. Amazing speedups on scientific / technical code have been reported, ranging from a mere 10X through hundreds. It's become a darling of academic computing and a major player in DARPA's Exascale program, but performance alone is not the reason; it's price. For that computing power, they're incredibly cheap. As Sharon Glotzer of UMich noted, "Today you can get 2GF for $500. That is ridiculous." It is indeed. And it's only possible because CUDA is subsidized by sinking the fixed costs of its development into the high volumes of Nvidia's mass market low-end GPUs.

Unfortunately, that subsidy won't last forever; its end is now visible. Here's why:

Apparently ignored in the usual media fuss over Intel's next and greatest, Sandy Bridge, is the integration of Intel's graphics onto the same die as the processor chip.

The current best integration is onto the same package, as illustrated in the photo of the current best, Clarkdale (a.k.a. Westmere), as shown in the photo on the right. As illustrated, the processor is in 32nm silicon technology, and the graphics, with memory controller, is in 45nm silicon technology. Yes, the graphics and memory controller is the larger chip.

Intel has not been touting higher graphics performance from this tighter integration. In fact, Intel's press releasers for Clarkdale claimed that being on two die wouldn't reduce performance because they were in the same package. But unless someone has changed the laws of physics as I know them, that's simply false; at a minimum, eliminating off-chip drivers will reduce latency substantially. Also, being on the same die as the processor implies the same process, so graphics (and memory control) goes all the way from 45nm to 32nm, the same as the processor, in one jump; this certainly will also result in increased performance. For graphics, this is a very loud the Intel "Tock" in its "Tick-Tock" (architecture / silicon) alternation.

So I'll semi-fearlessly predict some demos of midrange games out of Intel when Sandy Bridge is ready to hit the streets, which hasn't been announced in detail aside from being in 2011.

Probably not coincidentally, mid-2011 is when AMD's Llano processor sees daylight. Also in 32nm silicon, it incorporates enough graphics-related processing to be an apparently decent DX11 GPU, although to my knowledge the architecture hasn't been disclosed in detail.

Both of these are lower-end units, destined for laptops, and intent on keeping a tight power budget; so they're not going to run high-end games well or be a superior target for HPC. It seems that they will, however, provide at least adequate low-end, if not midrange, graphics.

Result: All of Nvidia's low-end market disappears by the end of next year.

As long as passable performance is provided, integrated into the processor equates with "free," and you can't beat free. Actually, it equates with cheaper than free, since there's one less chip to socket onto the motherboard, eliminating socket space and wiring costs. The power supply will probably shrink slightly, too.

This means the end of the low-end graphics subsidy of high-performance GPGPUs like Nvidia's CUDA. It will have to pay its own way, with two results:

First, prices will rise. It will no longer have a huge advantage over purpose-built HPC gear. The market for that gear is certainly expanding. In a long talk at the 2010 ISC in Berlin, Intel's Kirk Skaugan (VP of Intel Architecture Group and GM, Data Center Group, USA) stated that HPC was now 25% of Intel's revenue – a number double the HPC market I last heard a few years ago. But larger doesn't mean it has anywhere near the volume of low-end graphics.

DARPA has pumped more money in, with Nvidia leading a $25M chunk of DARPA's Exascale project. But that's not enough to stay alive. (Anybody remember Thinking Machines?)

The second result will be that Nvidia become a much smaller company.

But for users, it's the loss of that subsidy that will hurt the most. No more supercomputing for the masses, I'm afraid. Intel will have MIC (son of Larrabee); that will have a partial subsidy since it probably can re-use some X86 designs, but that's not the same as large low-end sales volumes.

So enjoy your "supercomputing for the masses," while it lasts.