Monday, July 12, 2010
Who Does the Shoe Fit? Functionally Decomposed Hardware (GPGPUs) vs. Multicore.
This post is a long reply to the thoughtful comments on my post WNPoTs and the Conservatism of Hardware Development that were made by Curt Sampson and Andrew Richards. The issue is: Is functionally decomposed hardware, like a GPU, much harder to deal with than a normal multicore (SMP) system? (It's delayed. Sorry. For some reason I ended up in a mental deadlock on this subject.)
I agree fully with Andrew and Curt that using functionally decomposed hardware can be straightforward if the hardware performs exactly the function you need in the program. If it does not, massive amounts of ingenuity may have to be applied to use it. I've been there and done that, trying at one point to make some special-purpose highly-parallel hardware simulation boxes do things like chip wire routing or more general computing. It required much brain twisting and ultimately wasn't that successful.
However, GPU designers have been particularly good at making this match. Andrew made this point very well in a video'd debate over on Charlie Demerjian's SemiAccurate blog: Last minute changes that would be completely anathema to GP designs are apparently par for the course with GPU designs.
The embedded systems world has been dealing with functionally decomposed hardware for decades. In fact, a huge part of their methodology is devoted to figuring out where to put a hardware-software split to match their requirements. Again, though, the hardware does exactly what's needed, often through last-minute FPGA-based hardware modifications.
However, there's also no denying that the mainstream of software development, all the guys who have been doing Software Engineering and programming system design for a long time, really don't have much use for anything that's not an obvious Turing Machine onto which they can spin off anything they want. Traditional schedulers have a rough time with even clock speed differences. So, for example, traditional programmers look at Cell SPUs, with their manually-loaded local memory, and think they're misbegotten spawn on the devil or something. (I know I did initially.)
This train of thought made me wonder: Maybe traditional cache-coherent MP/multicore actually is hardware specifically designed for a purpose, like a GPU. That purpose is, I speculate, transaction processing. This is similar to a point I raised long ago in this blog (IT Departments Should NOT Fear Multicore), but a bit more pointed.
Don't forget that SMPs have been around for a very long time, and practically from their inception in the early 1970s were used transparently, with no explicit parallel programming and code very often written by less-than-average programmers. Strongly enabling that was a transaction monitor like IBM's CICS (and lots of others). All code is written as a relatively small chunk (debit this account) (and the cash on hand, and total cash in a bank…). That chunk is automatically surrounded by all locking it needs, called by the monitor when a customer implicitly invokes it, and can be backed out as needed either by facilities built into the monitor or by a back-end database system.
It works, and it works very well right up to the present, even with programmers so bad it's a wonder they don't make the covers fly off the servers. (OK, only a few are that bad, but the point is that genius is not required.)
Of course, transaction monitors aren't a language or traditional programming construct, and also got zero academic notice except perhaps for Jim Gray. But they work, superbly well on SMP / multicore. They can even work well across clusters (clouds) as long as all data is kept in a separate backend store (perhaps logically separate), which model, by the way, is the basis of a whole lot of cloud computing.
Attempts to make multicores/SMPs work in other realms, like HPC, have been fairly successful but have always produced cranky comments about memory bottlenecks, floating-point performance, how badly caches fit the requirements, etc., comments you don't hear from commercial programmers. Maybe this is because it was designed for them? That question is, by the way, deeply sarcastic; performance on transactional benchmarks (like TPC's) are the guiding light and laser focus of most larger multicore / SMP designs.
So, overall, this post makes a rather banal point: If the hardware matches your needs, it will be easy to use. If it doesn't, well, the shoe just doesn't fit, and will bring on blisters. However, the observation that multicore is actually a special purpose device, designed for a specific purpose, is arguably an interesting perspective.