The Perils of Parallel: Ten Ways to Trash your Performance Credibility

Tuesday, June 8, 2010

Ten Ways to Trash your Performance Credibility

Watered by rains of development sweat, warmed in the sunny smiles of ecstatic customers, sheltered from the hailstones of Moore's Law, the accelerator speedup flowers are blossoming.

Danger: The showiest blooms are toxic to your credibility.

(My wife is planting flowers these days. Can you tell?)

There's a paradox here. You work with a customer, and he's happy with the result; in fact, he's ecstatic. He compares the performance he got before you arrived with what he's getting now, and gets this enormous number – 100X, 1000X or more. You quote that customer, accurately, and hear:

"I would have to be pretty drunk to believe that."

Your great, customer-verified, most wonderful results have trashed your credibility.

Here are some examples:

In a recent talk, Prof. Sharon Glotzer just glowed about getting a 100X speedup "overnight" on the molecular dynamics codes she runs.

In an online discussion on LinkedIn, a Cray marketer said his client's task went from taking 12 hours on a Quad-core Intel Westmere 5600 to 1.2 seconds. That's a speedup of 36,000X. What application? Sorry, that's under non-disclosure agreement.

In a video interview, a customer doing cell pathology image analysis reports their task going from 400 minutes to 65 milliseconds, for a speedup of just under 370,000X. (Update: Typo, he really does say "minutes" in the video.)

None of these people are shading the truth. They are doing what is, for them, a completely valid comparison: They're directly comparing where they started with where they ended up. The problem is that the result doesn't pass the drunk test. Or the laugh test. The idea that, by itself, accelerator hardware or even some massively parallel box will produce 5-digit speedups is laughable. Anybody baldly quoting such results will instantly find him- or herself dismissed as, well, the polite version would be that they're living in la-la land or dipping a bit too deeply into 1960s pop pharmacology.

What's going on with such huge results is that the original system was a target-rich zone for optimization. It was a pile of bad, squirrely code, and sometimes, on top of that, interpreted rather than compiled. Simply getting to the point where an accelerator, or parallelism, or SIMD, or whatever, could be applied involved fixing it up a lot, and much of the total speedup was due to that cleanup – not directly to the hardware.

This is far from a new issue. Back in the days of vector supercomputers, the following sequence was common: Take a bunch of grotty old Fortran code and run it through a new super-duper vectorizing optimizing compiler. Result: Poop. It might even slow down. So, OK, you clean up the code so the compiler has a fighting chance of figuring out that there's a vector or two in there somewhere, and Wow! Gigantic speedup. But there's a third step, a step not always done: Run the new version of the code through a decent compiler without vectors or any special hardware enabled, and, well, hmmm. In lots of cases it runs almost as fast as with the special hardware enabled. Thanks for your help optimizing my code, guys, but keep your hardware; it doesn't seem to add much value.

The moral of that story is that almost anything is better than grotty old Fortran. Or grotty, messed-up MATLAB or Java or whatever. It's the "grotty" part that's the killer. A related modernized version of this story is told in a recent paper Believe It or Not! Multi-core CPUs can Match GPU Performance, where they note "The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively." If you really clean up the code and match it to the platform it's using, great things can happen.

This of course doesn't mean that accelerators and other hardware are useless; far from it. The "Believe It or Not!" case wasn't exactly hurt by the fact that Power7 has a macho memory subsystem. It does mean that you should be aware of all the factors that sped up the execution, and using that information, present your results with credit due to the appropriate actions.

The situation we're in is identical to the one that lead someone (wish I remembered who), decades ago, to write a short paper titled, approximately, Ten Ways to Lie about Parallel Processing. I thought I kept a copy, but if I did I can't find it. It was back at the dawn of whatever, and I can't find it now even with Google Scholar. (If anyone out there knows the paper I'm referencing, please let me know.) Got it! It's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, by David H. Bailey. Thank you, Roland!

In the same spirit, and probably duplicating that paper massively, here are my ten ways to lose your credibility:

Only compare the time needed to execute the innermost kernel. Never mind that the kernel is just 5% of the total execution time of the whole task.
Compare your single-precision result to the original, which computed in double precision. Worry later that your double precision is 4X slower, and the increased data size won't fit in your local memory. Speaking of which,
Pick a problem size that just barely fits into the local memory you have available. Why? See #4.
Don't count the time to initialize the hardware and load the problem into its memory. PCI Express is just as fast as a processor's memory bus. Not.
Change the algorithm. Going from a linear to a binary search or a hash table is just good practice.
Rewrite the code from scratch. It was grotty old Fortran, anyway; the world is better off without it.
Allow a slightly different answer. A*(X+Y) equals A*X+A*Y, right? Not in floating point, it doesn't.
Change the operating system. Pick the one that does IO to your device fastest.
Change the libraries. The original was 32 releases out of date! And didn't work with my compiler!
Change the environment. For example, get rid of all those nasty interrupts from the sensors providing the real-time data needed in practice.

This, of course, is just a start. I'm sure there are another ten or a hundred out there.

A truly fair accounting for the speedup provided by an accelerator, or any other hardware, can only be done by comparing it to the best possible code for the original system. I suspect that the only time anybody will be able to do that is when comparing formally standardized benchmark results, not live customer codes.

For real customer codes, my advice would be to list all the differences between the original and the final runs that you can find. Feel free to use the list above as a starting point for finding those differences. Then show that list before you present your result. That will at least demonstrate that you know you're comparing marigolds and peonies, and will help avoid trashing your credibility.

*****************

Thanks to John Melonakos of Accelereyes for discussion and sharing his thoughts on this topic.

15 comments:

bank kus said...: Hi Greg to better clarify the above 10 points would you be willing to provide links to talks/presentations that claim large speedups using CUDA but make the above mistakes.

I think there are genuine 40x speedup stories (not inner kernel speedups but work/second speedup) using CUDA and the price being $1200 is also quite real :-) The #2 system in Top500 uses GPUs btw.; June 8, 2010 at 10:35 PM
Greg Pfister said...: bank -

Links: I know providing them would make things more concrete, but I really don't want, any more than really necessary, to directly stick my finger in people's eyes by listing them in what would amount to a hall of shame.

For example, I saw a case of #1 internally in IBM (not a GPGPU, but an accelerator), and when the folks working on it realized they were grinding at only 5%, some really hurt feelings were expressed rather vehemently.

I really am trying to be constructive here. And somewhat amusing.

(On the other hand, if you happen to have a few, let me know for future reference. :-) )

I'm not disputing that there are genuine, really good-sized speedups with GPUs or other hardware accelerators. They happen, they're real, and they are directly due to good hardware - problem match-ups, not extraneous factors.

I'm just trying to point out that some people are making fools of themselves without realizing it -- by saying silly things, and on the other side: believing silly things. Maybe a little education and listing of ways to avoid that can help.; June 8, 2010 at 10:54 PM
Greg Pfister said...: Oh, and by the way, thanks for the comment!; June 8, 2010 at 10:55 PM
Anonymous said...: Greg - I suspect you are thinking of David Bailey's classic:

http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf

niall; June 8, 2010 at 11:53 PM
Greg Pfister said...: Thanks, niall. Someone else just found it too; I've updated the text. I agree, it's a classic. Well worth reading.; June 8, 2010 at 11:55 PM
Anonymous said...: Hi Greg,

I've noticed cases where parallelising something on N (multicore) processors gives a speedup of greater-than N - and nothing else was changed. When this happens it means that a faster sequential algorithm is available; the interleaving of parallel tasks was more optimal than the ordering of tasks in the sequential version.

Thanks.; June 9, 2010 at 12:46 AM
melonakos said...: Great post as always!

Noticed a math error, which is adds to the humor of this post because math errors are probably more problematic than we realize. In the cell pathology case, the speedup is 400s / 0.065s = ~6,100X (not 370,000X - I think you had an extraneous multiply by 60 in there).

Also note that in the cell pathology case, the authors explicitly list that vectorization played a huge role in the speedup, to remove doubt as to whether or not the accelerator was solely responsible. And the final 65ms equates to roughly 15 fps, which is in the realm of feasibility for GPUs.; June 9, 2010 at 6:34 AM
Anonymous said...: I took an optimization class once where we had to take an existing open source software, optimize it, and show off our results. We thought we had it made. Simply by tweaking the compiler flags, we could show great improvement.

Half way through the project, we were told that "Oh btw, you will judged by a couple of compiler engineers I know. They will judge your improvements against the original code compiled with the best compiler optimizations they know on the very latest hardware."

Crap.

Most improvements came from using OpenMP, clarifying some code, etc. In most cases, the gains were decent, but not dramatic. One problem of course was that we all picked popular open source packages which were likely to already at least not be crap.

-Dave; June 9, 2010 at 3:17 PM
Anonymous said...: The typo @melonakos mentions wasn't numerical - the interview did say 400 minutes to 0.065 seconds.

However, as Greg says, there was a lot more than just moving from CPU to GPU going on. The speed up by redoing the code from serial to parallel was 1200 times (400 minutes to 20 seconds), then an increase of 300x to sub-second times with the GPU.

The "pathologist in a box" thing they mention is a bit much, though.; June 9, 2010 at 5:19 PM
Greg Pfister said...: paulbone.au -

Yes, legitimate cases of superlinear speedup do happen. The ones I've heard of are explained by the increase in local/cache memory associated with each processor. The problem data all fit in N local memories, but didn't fit in 1/Nth that much. The faster SRAM in local memory beat out the references to DRAM otherwise needed. I'm sure there are other cases than that, but it's one I know.

Thanks for contributing!; June 9, 2010 at 7:48 PM
Greg Pfister said...: John and anon -

I re-checked the video, and indeed the original run time was 400 minutes, not seconds. So I did have a typo, but not a math error.

About the parallelization, yes, it certainly did help, and the additional GPGPU speedup was not exactly tiny.

But listen to the interview: What they are understandably floored by is the total change -- 400 minutes to 65 ms. Or nearly *7* *hours* to 65 ms. Yow. That's the short summary they will forever talk about.; June 9, 2010 at 8:01 PM
melonakos said...: Ah yes, makes sense now. The video says the vectorization dropped it from 400min to 20s and the GPU dropped it from 20s to 35ms. Certainly a lot to swallow, especially from such a short video clip without any background. Wonderful to see someone so giddy though about parallelism!; June 10, 2010 at 5:50 AM
Greg Pfister said...: A friend of mine who wishes to stay anonymous sent me the following, which takes the whole thing to previously unscaled heights of silliness, and probably truth:

*********************

Consider Phil, the trusted employee. To do his run, Phil used to travel from his office to the computer's location. In the case of Cray's claim, Phil had to fly with connections and not begin work until the next morning. For the cell pathology work, Phil could catch a nonstop flight.

The new hardware for his work is significantly faster, but far more important is that it was installed in Phil's office.

No more trips.

Management, measuring the time it takes Phil to produce his results, now sees answers immediately. They measure this delta change in time. They also see a significant cost savings in the department's budget.

They conclude that buying thousands of these computers would result in cost saving exceeding the company's total revenues in just a few days, and reward themselves for being such brilliant leaders with additional stock options.

But Phil is no longer being able to visit Sally (Cray trips) and Karen (cell pathology trips) anywhere near as often. Thus the new hardware has significantly decreased Phil's job satisfaction.

****************

I'll add that the energy savings are also superior, so the company gets to put a big green leaf on their web pages, impressing everybody and raising the stock price.; June 10, 2010 at 4:04 PM
Paul Bone said...: Greg,

Yes, I didn't think of that situation. The situation I've seen is where a garbage collector is involved but only uses one core, while the 'mutator' uses 4 cores. When the garbage collector stops the world it trashes a single cache (1/4th of the mutator's cache). Therefore when the mutator resumes it incurs a number of cache misses, but only on one of the 4 cores. When not compiled for parallel execution the mutator and GC run on the same core, so when the GC stops the world all of the mutator's cached data is trashed.; June 24, 2010 at 5:53 PM
zorba said...: Thanks for the article. I was involved in performance engg of a FS. I agree its difficult to come up with convincing performance gain/loss along with the cause ! thanks for the classic paper too.; June 28, 2010 at 10:43 PM

Tuesday, June 8, 2010

Ten Ways to Trash your Performance Credibility

15 comments:

Post a Comment