Watered by rains of development sweat, warmed in the sunny smiles of ecstatic customers, sheltered from the hailstones of Moore's Law, the accelerator speedup flowers are blossoming.
Danger: The showiest blooms are toxic to your credibility.
(My wife is planting flowers these days. Can you tell?)
There's a paradox here. You work with a customer, and he's happy with the result; in fact, he's ecstatic. He compares the performance he got before you arrived with what he's getting now, and gets this enormous number – 100X, 1000X or more. You quote that customer, accurately, and hear:
"I would have to be pretty drunk to believe that."
Your great, customer-verified, most wonderful results have trashed your credibility.
Here are some examples:
In a recent talk, Prof. Sharon Glotzer just glowed about getting a 100X speedup "overnight" on the molecular dynamics codes she runs.
In an online discussion on LinkedIn, a Cray marketer said his client's task went from taking 12 hours on a Quad-core Intel Westmere 5600 to 1.2 seconds. That's a speedup of 36,000X. What application? Sorry, that's under non-disclosure agreement.
In a video interview, a customer doing cell pathology image analysis reports their task going from 400 minutes to 65 milliseconds, for a speedup of just under 370,000X. (Update: Typo, he really does say "minutes" in the video.)
None of these people are shading the truth. They are doing what is, for them, a completely valid comparison: They're directly comparing where they started with where they ended up. The problem is that the result doesn't pass the drunk test. Or the laugh test. The idea that, by itself, accelerator hardware or even some massively parallel box will produce 5-digit speedups is laughable. Anybody baldly quoting such results will instantly find him- or herself dismissed as, well, the polite version would be that they're living in la-la land or dipping a bit too deeply into 1960s pop pharmacology.
What's going on with such huge results is that the original system was a target-rich zone for optimization. It was a pile of bad, squirrely code, and sometimes, on top of that, interpreted rather than compiled. Simply getting to the point where an accelerator, or parallelism, or SIMD, or whatever, could be applied involved fixing it up a lot, and much of the total speedup was due to that cleanup – not directly to the hardware.
This is far from a new issue. Back in the days of vector supercomputers, the following sequence was common: Take a bunch of grotty old Fortran code and run it through a new super-duper vectorizing optimizing compiler. Result: Poop. It might even slow down. So, OK, you clean up the code so the compiler has a fighting chance of figuring out that there's a vector or two in there somewhere, and Wow! Gigantic speedup. But there's a third step, a step not always done: Run the new version of the code through a decent compiler without vectors or any special hardware enabled, and, well, hmmm. In lots of cases it runs almost as fast as with the special hardware enabled. Thanks for your help optimizing my code, guys, but keep your hardware; it doesn't seem to add much value.
The moral of that story is that almost anything is better than grotty old Fortran. Or grotty, messed-up MATLAB or Java or whatever. It's the "grotty" part that's the killer. A related modernized version of this story is told in a recent paper Believe It or Not! Multi-core CPUs can Match GPU Performance, where they note "The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively." If you really clean up the code and match it to the platform it's using, great things can happen.
This of course doesn't mean that accelerators and other hardware are useless; far from it. The "Believe It or Not!" case wasn't exactly hurt by the fact that Power7 has a macho memory subsystem. It does mean that you should be aware of all the factors that sped up the execution, and using that information, present your results with credit due to the appropriate actions.
The situation we're in is identical to the one that lead someone (wish I remembered who), decades ago, to write a short paper titled, approximately, Ten Ways to Lie about Parallel Processing. I thought I kept a copy, but if I did I can't find it. It was back at the dawn of whatever, and I can't find it now even with Google Scholar. (If anyone out there knows the paper I'm referencing, please let me know.) Got it! It's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, by David H. Bailey. Thank you, Roland!
In the same spirit, and probably duplicating that paper massively, here are my ten ways to lose your credibility:
- Only compare the time needed to execute the innermost kernel. Never mind that the kernel is just 5% of the total execution time of the whole task.
- Compare your single-precision result to the original, which computed in double precision. Worry later that your double precision is 4X slower, and the increased data size won't fit in your local memory. Speaking of which,
- Pick a problem size that just barely fits into the local memory you have available. Why? See #4.
- Don't count the time to initialize the hardware and load the problem into its memory. PCI Express is just as fast as a processor's memory bus. Not.
- Change the algorithm. Going from a linear to a binary search or a hash table is just good practice.
- Rewrite the code from scratch. It was grotty old Fortran, anyway; the world is better off without it.
- Allow a slightly different answer. A*(X+Y) equals A*X+A*Y, right? Not in floating point, it doesn't.
- Change the operating system. Pick the one that does IO to your device fastest.
- Change the libraries. The original was 32 releases out of date! And didn't work with my compiler!
- Change the environment. For example, get rid of all those nasty interrupts from the sensors providing the real-time data needed in practice.
A truly fair accounting for the speedup provided by an accelerator, or any other hardware, can only be done by comparing it to the best possible code for the original system. I suspect that the only time anybody will be able to do that is when comparing formally standardized benchmark results, not live customer codes.
For real customer codes, my advice would be to list all the differences between the original and the final runs that you can find. Feel free to use the list above as a starting point for finding those differences. Then show that list before you present your result. That will at least demonstrate that you know you're comparing marigolds and peonies, and will help avoid trashing your credibility.
Thanks to John Melonakos of Accelereyes for discussion and sharing his thoughts on this topic.