Comments on The Perils of Parallel: Ten Ways to Trash your Performance Credibility

Thanks for the article. I was involved in performa...

2010-06-28T22:43:02.563-06:00

Thanks for the article. I was involved in performance engg of a FS. I agree its difficult to come up with convincing performance gain/loss along with the cause ! thanks for the classic paper too.

Greg, Yes, I didn't think of that situation. ...

2010-06-24T17:53:16.113-06:00

Greg,

Yes, I didn't think of that situation. The situation I've seen is where a garbage collector is involved but only uses one core, while the 'mutator' uses 4 cores. When the garbage collector stops the world it trashes a single cache (1/4th of the mutator's cache). Therefore when the mutator resumes it incurs a number of cache misses, but only on one of the 4 cores. When not compiled for parallel execution the mutator and GC run on the same core, so when the GC stops the world all of the mutator's cached data is trashed.

A friend of mine who wishes to stay anonymous sent...

2010-06-10T16:04:50.406-06:00

A friend of mine who wishes to stay anonymous sent me the following, which takes the whole thing to previously unscaled heights of silliness, and probably truth:

*********************

Consider Phil, the trusted employee. To do his run, Phil used to travel from his office to the computer's location. In the case of Cray's claim, Phil had to fly with connections and not begin work until the next morning. For the cell pathology work, Phil could catch a nonstop flight.

The new hardware for his work is significantly faster, but far more important is that it was installed in Phil's office.

No more trips.

Management, measuring the time it takes Phil to produce his results, now sees answers immediately. They measure this delta change in time. They also see a significant cost savings in the department's budget.

They conclude that buying thousands of these computers would result in cost saving exceeding the company's total revenues in just a few days, and reward themselves for being such brilliant leaders with additional stock options.

But Phil is no longer being able to visit Sally (Cray trips) and Karen (cell pathology trips) anywhere near as often. Thus the new hardware has significantly decreased Phil's job satisfaction.

****************

I'll add that the energy savings are also superior, so the company gets to put a big green leaf on their web pages, impressing everybody and raising the stock price.

Ah yes, makes sense now. The video says the vector...

2010-06-10T05:50:58.978-06:00

Ah yes, makes sense now. The video says the vectorization dropped it from 400min to 20s and the GPU dropped it from 20s to 35ms. Certainly a lot to swallow, especially from such a short video clip without any background. Wonderful to see someone so giddy though about parallelism!

John and anon - I re-checked the video, and indee...

2010-06-09T20:01:00.132-06:00

John and anon -

I re-checked the video, and indeed the original run time was 400 minutes, not seconds. So I did have a typo, but not a math error.

About the parallelization, yes, it certainly did help, and the additional GPGPU speedup was not exactly tiny.

But listen to the interview: What they are understandably floored by is the total change -- 400 minutes to 65 ms. Or nearly *7* *hours* to 65 ms. Yow. That's the short summary they will forever talk about.

paulbone.au - Yes, legitimate cases of superline...

2010-06-09T19:48:04.665-06:00

paulbone.au -

Yes, legitimate cases of superlinear speedup do happen. The ones I've heard of are explained by the increase in local/cache memory associated with each processor. The problem data all fit in N local memories, but didn't fit in 1/Nth that much. The faster SRAM in local memory beat out the references to DRAM otherwise needed. I'm sure there are other cases than that, but it's one I know.

Thanks for contributing!

The typo @melonakos mentions wasn't numerical ...

2010-06-09T17:19:07.090-06:00

The typo @melonakos mentions wasn't numerical - the interview did say 400 minutes to 0.065 seconds.

However, as Greg says, there was a lot more than just moving from CPU to GPU going on. The speed up by redoing the code from serial to parallel was 1200 times (400 minutes to 20 seconds), then an increase of 300x to sub-second times with the GPU.

The "pathologist in a box" thing they mention is a bit much, though.

I took an optimization class once where we had to ...

2010-06-09T15:17:34.729-06:00

I took an optimization class once where we had to take an existing open source software, optimize it, and show off our results. We thought we had it made. Simply by tweaking the compiler flags, we could show great improvement.

Half way through the project, we were told that "Oh btw, you will judged by a couple of compiler engineers I know. They will judge your improvements against the original code compiled with the best compiler optimizations they know on the very latest hardware."

Crap.

Most improvements came from using OpenMP, clarifying some code, etc. In most cases, the gains were decent, but not dramatic. One problem of course was that we all picked popular open source packages which were likely to already at least not be crap.

-Dave

Great post as always! Noticed a math error, which...

2010-06-09T06:34:12.528-06:00

Great post as always!

Noticed a math error, which is adds to the humor of this post because math errors are probably more problematic than we realize. In the cell pathology case, the speedup is 400s / 0.065s = ~6,100X (not 370,000X - I think you had an extraneous multiply by 60 in there).

Also note that in the cell pathology case, the authors explicitly list that vectorization played a huge role in the speedup, to remove doubt as to whether or not the accelerator was solely responsible. And the final 65ms equates to roughly 15 fps, which is in the realm of feasibility for GPUs.

Hi Greg, I've noticed cases where parallelisi...

2010-06-09T00:46:09.602-06:00

Hi Greg,

I've noticed cases where parallelising something on N (multicore) processors gives a speedup of greater-than N - and nothing else was changed. When this happens it means that a faster sequential algorithm is available; the interleaving of parallel tasks was more optimal than the ordering of tasks in the sequential version.

Thanks.

Thanks, niall. Someone else just found it too; I&#...

2010-06-08T23:55:21.391-06:00

Thanks, niall. Someone else just found it too; I've updated the text. I agree, it's a classic. Well worth reading.

Greg - I suspect you are thinking of David Bailey&...

2010-06-08T23:53:19.985-06:00

Greg - I suspect you are thinking of David Bailey's classic:

http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf

niall

Oh, and by the way, thanks for the comment!

2010-06-08T22:55:20.464-06:00

Oh, and by the way, thanks for the comment!

bank - Links: I know providing them would make th...

2010-06-08T22:54:12.107-06:00

bank -

Links: I know providing them would make things more concrete, but I really don't want, any more than really necessary, to directly stick my finger in people's eyes by listing them in what would amount to a hall of shame.

For example, I saw a case of #1 internally in IBM (not a GPGPU, but an accelerator), and when the folks working on it realized they were grinding at only 5%, some really hurt feelings were expressed rather vehemently.

I really am trying to be constructive here. And somewhat amusing.

(On the other hand, if you happen to have a few, let me know for future reference. :-) )

I'm not disputing that there are genuine, really good-sized speedups with GPUs or other hardware accelerators. They happen, they're real, and they are directly due to good hardware - problem match-ups, not extraneous factors.

I'm just trying to point out that some people are making fools of themselves without realizing it -- by saying silly things, and on the other side: believing silly things. Maybe a little education and listing of ways to avoid that can help.

Hi Greg to better clarify the above 10 points woul...

2010-06-08T22:35:26.824-06:00

Hi Greg to better clarify the above 10 points would you be willing to provide links to talks/presentations that claim large speedups using CUDA but make the above mistakes.

I think there are genuine 40x speedup stories (not inner kernel speedups but work/second speedup) using CUDA and the price being $1200 is also quite real :-) The #2 system in Top500 uses GPUs btw.