Monday, July 12, 2010

Who Does the Shoe Fit? Functionally Decomposed Hardware (GPGPUs) vs. Multicore.

This post is a long reply to the thoughtful comments on my post WNPoTs and the Conservatism of Hardware Development that were made by Curt Sampson and Andrew Richards. The issue is: Is functionally decomposed hardware, like a GPU, much harder to deal with than a normal multicore (SMP) system? (It's delayed. Sorry. For some reason I ended up in a mental deadlock on this subject.)

I agree fully with Andrew and Curt that using functionally decomposed hardware can be straightforward if the hardware performs exactly the function you need in the program. If it does not, massive amounts of ingenuity may have to be applied to use it. I've been there and done that, trying at one point to make some special-purpose highly-parallel hardware simulation boxes do things like chip wire routing or more general computing. It required much brain twisting and ultimately wasn't that successful.

However, GPU designers have been particularly good at making this match. Andrew made this point very well in a video'd debate over on Charlie Demerjian's SemiAccurate blog: Last minute changes that would be completely anathema to GP designs are apparently par for the course with GPU designs.

The embedded systems world has been dealing with functionally decomposed hardware for decades. In fact, a huge part of their methodology is devoted to figuring out where to put a hardware-software split to match their requirements. Again, though, the hardware does exactly what's needed, often through last-minute FPGA-based hardware modifications.

However, there's also no denying that the mainstream of software development, all the guys who have been doing Software Engineering and programming system design for a long time, really don't have much use for anything that's not an obvious Turing Machine onto which they can spin off anything they want. Traditional schedulers have a rough time with even clock speed differences. So, for example, traditional programmers look at Cell SPUs, with their manually-loaded local memory, and think they're misbegotten spawn on the devil or something. (I know I did initially.)

This train of thought made me wonder: Maybe traditional cache-coherent MP/multicore actually is hardware specifically designed for a purpose, like a GPU. That purpose is, I speculate, transaction processing. This is similar to a point I raised long ago in this blog (IT Departments Should NOT Fear Multicore), but a bit more pointed.

Don't forget that SMPs have been around for a very long time, and practically from their inception in the early 1970s were used transparently, with no explicit parallel programming and code very often written by less-than-average programmers. Strongly enabling that was a transaction monitor like IBM's CICS (and lots of others). All code is written as a relatively small chunk (debit this account) (and the cash on hand, and total cash in a bank…). That chunk is automatically surrounded by all locking it needs, called by the monitor when a customer implicitly invokes it, and can be backed out as needed either by facilities built into the monitor or by a back-end database system.

It works, and it works very well right up to the present, even with programmers so bad it's a wonder they don't make the covers fly off the servers. (OK, only a few are that bad, but the point is that genius is not required.)

Of course, transaction monitors aren't a language or traditional programming construct, and also got zero academic notice except perhaps for Jim Gray. But they work, superbly well on SMP / multicore. They can even work well across clusters (clouds) as long as all data is kept in a separate backend store (perhaps logically separate), which model, by the way, is the basis of a whole lot of cloud computing.

Attempts to make multicores/SMPs work in other realms, like HPC, have been fairly successful but have always produced cranky comments about memory bottlenecks, floating-point performance, how badly caches fit the requirements, etc., comments you don't hear from commercial programmers. Maybe this is because it was designed for them? That question is, by the way, deeply sarcastic; performance on transactional benchmarks (like TPC's) are the guiding light and laser focus of most larger multicore / SMP designs.

So, overall, this post makes a rather banal point: If the hardware matches your needs, it will be easy to use. If it doesn't, well, the shoe just doesn't fit, and will bring on blisters. However, the observation that multicore is actually a special purpose device, designed for a specific purpose, is arguably an interesting perspective.


Anonymous said...

That's certainly a perspective I like, and serves as a nice way of finding the common purpose amongst a number of different processor technologies: not only cache coherency, but out of order execution, branch prediction, and similar things. They all provide a sort of "transactional image" to the programmer, as it were.

I think that from the point of view of getting things right on various sorts of concurrent systems, most programmers are really bad most of the time. This is why we have someone smart and having a good day write the transaction monitor once, and the rest of us just use it. I don't see the hardware designers as doing anything different: they're dealing with the same sort of issues, and providing an interface to the application developer that deals with some of the hard stuff; that it's "hardware" rather than "software" is fairly irrelevant.

Andrew Richards said...

I agree totally. I also found it an unexpected conclusion. It's a big problem for multicore: we haven't found a general-purpose solution. Whereas Von Neuman single-core designs were a great generic solution that solves a lot of problems.

I think we need to study the way we access data. One of those ways is via transactions, and having a general-purpose transaction model in hardware (or hardware+software+OS) would be useful, I think. There seems to be a lack of research in this area that takes into account real problems. Graphics rasterization can be seen as a transactional model, I think. But the memory systems required for these different transactions may be different. But not all memory accesses are transactions.

You also bring up the other issue (and so did Tim Sweeney) that without a more generic solution, it's hard for software developers to innovate. That's dangerous for our industry.

Anonymous said...

I disagree.

In my experience is that programmers often say that the problem they are trying to solve is 'impossible' to do with many cores or GPU and just stop there without actually thinking about what they are trying to achieve.

Sure if you always want to have everything in C++ with a OOP hierarchy of objects that can modify global state then it will be hard/impossible.

Instead by trying to look at with: "What data do I actually need to come out from this code?" "Can I reorganize my data so it fits the hardware better?" "Can I divide my problem into small steps that modify smaller bits of data each time?"

I don't see many taking the steps above but when I have tried this myself I have seen that many problem can be divided into sub problem that comes down to just having code that transform data from one state to another (which is what all code really does anyway) The code often becomes simpler and easier to test (as it does one thing instead of lots of things at the same time)

I'm not saying that everything will fit on SMP/GPU but without rethinking what you are actually trying to achieve you are throwing away a big chance to both improve the performance of your code and making it easier to understand.

Greg Pfister said...

Hi, Anon.

I don't think you're so much disagreeing as bringing up a different viewpoint not discussed in this post.

I certainly agree with you that it's necessary to analyze what's going on in terms of independent operations and parallelism, and necessary to do so first, so it's the basis of the structure of the code. Doing it in reverse is at best not good practice.

In fact, if I'd thought of it, I'd have added that to the post, since it actually dovetails very nicely, with this connection:

There are a huge number of ways of doing that parallel decomposition, some much more suited to the problem itself than others are. If the hardware most naturally supports decomposition X (like, say, SIMD on many GPUs or transactions on SMPs), and X really doesn't fit the problem "well" (whatever that means), you can be beating your head against a wall trying to find a way to use X. The shoe just doesn't fit well.

On the other hand, I don't think it's possible to come up with a parallel decomposition in a vacuum; you're always thinking of some data flow or independence of some kind. So there's a two-edged sword here -- you have to be concrete, but it's not wise to be religiously married to one specific type of decomposition.

Anonymous said...

If designing a chip for a specific purpose is a good idea - I was wondering what you thought of the DOE idea of co-design ? Asking vendors to tailor hardware for specific scientific applications ?

Maybe this will stop the complaints when a SMP is used for HPC ??

- A different anon

Greg Pfister said...

Particularly for their exascale target, which is a major challenge in multiple dimensions, I think it's a great idea.

It seems like they're asking for the right kind of mix, too -- many more software / application people than hardware guys. I hope the actual teams funded keep those proportions, which are, I've heard, the proportions that exist in GPU companies (something Intel is discovering).

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.