The Perils of Parallel

Nice recap. The Intel Developer Forum is a really ...

2014-04-15T12:30:44.163-06:00

Nice recap. The Intel Developer Forum is a really great annual trade show. There is so much to learn.

Thanks for sharing very informative information. W...

2014-03-07T05:00:13.366-07:00

Thanks for sharing very informative information. We all appreciate with this blog post and it is very helpful for those that want to make their career in Information technology.

Keep up sharing...

Cloud Computing training in Mexico

It was great reading your article and subsequent c...

2013-05-14T03:32:42.444-06:00

It was great reading your article and subsequent comments and counter-comments. I am a new entrant to CUDA programming looking to apply it on fractal image compression that involves extensive pattern matching of ranges with domains. I never thought that MIMD could be another option for image compression as it involves lots of data-independent and function-independent processing.
Many thanks for a robust comparison analysis of SIMD vs MIMD.

2- why the lock-based solution need extra copy. in...

2013-01-08T03:57:35.290-07:00

2- why the lock-based solution need extra copy. in fact i suspect the transactional one to have extra copy. all you need is 20bit mask for marking what is locked and what not !?!?!?!?!?!?!

it is developers! hardware evolution matters less!...

2013-01-08T03:49:35.858-07:00

it is developers!
hardware evolution matters less!
the UTOPIA of supporting highly parallel apps on legacy code will die!

" Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language."

HELL
again
HELL

in a new paradigm world we need new way of thinking and FOR loops are thing of the past.

what intel's customers will eventually has to end up living up with is the fact that:
they buy highly overpraised hardware with very little improvements on their legacy code!!
they will need to change the legacy code not the hardware, duh...

I have a few guesses. Intel is jockeying to get i...

2012-11-18T18:04:25.867-07:00

I have a few guesses.

Intel is jockeying to get into (or back into) the Supercomputing 500.

People finally realized you can't do much without independent threads of execution.

Universal Parallel C (UPC) needs a home.

The non-embarrassingly parallel applications are jealous of their friends.

1. Reads outside locked region would show inconsis...

2012-09-02T02:42:39.117-06:00

1. Reads outside locked region would show inconsistent state before. I think the code that reads values outside "Transaction On!" is not guaranteed to NOT see B inconsistent with A. Too many nots? OK, only transactional code is required to see A and B consistent.

2. If I wanted to modify 2 out of 20 fields in a object, the lock-based solution would need to create a copy of the object with the 2 fields modified by a single thread, then "swap-in" the mutated object. The big win of transactional memory is that now we don't need to create new objects to represent the whole state of the system anymore, we can mutate the state in place.

@Anonymous, who postet the long rant Some of the ...

2012-04-06T11:00:54.836-06:00

@Anonymous, who postet the long rant

Some of the stuff you write is spot-on, some is bullshit. It is okay to bash old ideas, old ways of thinking - but not, old people. For example I had a CS professor as my thesis supervisor who was over 80 years old. Many thought he was a dinosaur who didn't know shit, just because he was old. Fact is, every day he cranked out more truly original, new and novel ideas than any of the 20-something, drivelling idiots he had to teach in class. Just one example out of millions.

I might be overlooking something obvious, but it s...

2012-03-19T16:12:21.379-06:00

I might be overlooking something obvious, but it seems to me that you could solve this problem with another state in your coherence protocol. There's no need for a pipeline flush -- anything already brought in from DRAM is still good to compute with, until it points to a memory location not yet brought into the processor, at which point it sees an invalidation and rolls out to the abort block. It seems that what you *can't* have is actual instruction retirement from a transactional region when another has acquired the xaction? Regardless, the cache coherence protocol -- as you note at the beginning -- ought provide most of the mechanics for this.

Thanks for writing such good articles

2012-03-09T13:01:16.725-07:00

Thanks for writing such good articles

No sure why you assume that cache invalidates to t...

2012-02-23T10:48:35.876-07:00

No sure why you assume that cache invalidates to take ownership have to happen at "commit" as opposed to when the write happens. That might cause excessive serialization at release but is not neccessary.

BG/Q was designed to have only one chip per node, ...

2012-02-18T18:41:37.140-07:00

BG/Q was designed to have only one chip per node, so I don't see how it's a problem that TM only works in this context. Similarly, the question of cache coherence between multiple chips in a node is irrelevant because this configuration will never exist.

nice post, spectacular blog

2012-02-15T06:46:59.689-07:00

nice post, spectacular blog

BG/Q - I've not read any architecture document...

2012-02-14T20:45:32.649-07:00

BG/Q - I've not read any architecture documents on that, but my understanding is that it does, except that in BG/Q it's limited to working within a single chip, not across multiple chips in a node. This makes software exploitation more difficult.

However, I'm not certain that BG/Q has cache coherence across the multiple chips in a node, so transactional memory across the chips wouldn't be as useful anyway.

Does this work the same way as the transactional m...

2012-02-14T20:11:58.566-07:00

Does this work the same way as the transactional memory on the IBM BG/Q processor ?

why does Intel waste money with this crap? build ...

2012-01-27T22:24:05.211-07:00

why does Intel waste money with this crap? build a decent gpu why don't you

IMHO AMD's APU or an Atom/ARM chip packaged wi...

2012-01-16T15:01:29.441-07:00

IMHO AMD's APU or an Atom/ARM chip packaged with a commodity GPU is the way to go. Who really cares about running a petaflop's worth of particles due to the extreme latency between iterations which is at minimum the speed of light distance between compute nodes? 99% of particle simulation computations fit on a single high end GPU. Big data problems are all about caching and striping across multiple machines to lower bottlenecks and have failover if a node dies which will start happening with probability one in the very near future.

No, I hadn't considered that because, frankly,...

2012-01-16T12:00:10.249-07:00

No, I hadn't considered that because, frankly, I didn't really know. Certainly Roadrunner has a similar set of issues.

Thanks for the information!

Greg, with regard to the number of codes that can ...

2012-01-15T20:43:14.622-07:00

Greg, with regard to the number of codes that can be ported to run on a GPU-based Titan system: Have you considered how many codes were ever actually run on Roadrunner at LANL? From everything I've read and heard, it is extremely challenging to program Roadrunner and get good performance out of the Cell processors (Roadrunner is a hybrid system, with AMD hosts and Cell accelerators). I'd be willing to bet that the only codes successfully ported to or written for Roadrunner involved little more than one or two simple but CPU intensive kernels; I didn't hear of any "multiphysics" codes being ported to Roadrunner. From a production computing standpoint, Roadrunner is largely a failure. From a "pushing the limits of technology" standpoint, though, it might be considered a success.

That's a pretty bad precedent for a GPU-based Titan, since Titan is intended for production computing. However, there are far more developers with experience programming Nvidia GPUs than there are programming Cell, so perhaps the situation won't be so bad.

My bet is that a handful of codes will be able to use the GPUs in Titan effectively, with at least a few of them having a kernel that can be tuned well enough to pull a big number that ORNL can tout. The rest of the codes, which can't effectively be ported to use GPUs, will just run on the CPUs; they'll get work done but won't make good use of the system. Oddly enough, that sounds like what I've heard about Roadrunner...

Currently I work for Dell. I had been waiting for ...

2012-01-12T03:13:50.759-07:00

Currently I work for Dell. I had been waiting for a decent synopsis on cloud computing and so it was really great to find your article. I like yoyr way of expalanation. Thanks for sharing

As a Dell employee I think IT departments that ant...

2012-01-12T03:10:22.640-07:00

As a Dell employee I think IT departments that anticipate an enormous uptick in user load need not scramble to secure additional hardware and software with cloud computing.

@Daniel - thanks! There's no question that Nv...

2012-01-09T19:22:24.693-07:00

@Daniel - thanks!

There's no question that Nvidia certainly has a good market in ARM chips, including Tegra, and that won't go away any time soon.

Unfortunately, volumes on ARM chips don't really help Tesla and Tesla-like GPGPUs. That's a different design point, and doesn't share much if any silicon.

They surely will re-use their ARM cores in their new Project Denver HPC offerings. However, the big SIMD/SIMT whomper also on that chip won't, in the future, have a high-volume counterpart to juice its volumes, and that's a significant design effort.

I happened to speak with Bill Daly, Nvidia CTO, a few months ago and brought up this point. He basically hoped the low end didn't go away too soon.

@Jeff - Aargh! Can't believe I didn't see ...

2012-01-09T19:10:48.470-07:00

@Jeff - Aargh! Can't believe I didn't see that one. Thanks!

The Intel slide about "Experience with Knight...

2012-01-09T18:53:11.736-07:00

The Intel slide about "Experience with Knights Ferry... unparalleled productivity". I find the wording choice quite amusing. :-)

My name is Matt and I work for Dell. There are a l...

2012-01-03T02:35:18.900-07:00

My name is Matt and I work for Dell. There are a lot of great comments happening on this blog. Thank you so much for the information.