The Perils of Parallel: February 2010

The notion of "parallel PowerPoint" is a poster child for the uselessness of multicore on client computers. Rendering a slide two, four, eight times faster is a joke. Nobody needs it.

Running PowerPoint 4, 16, 64 times longer on my laptop battery, though, that's useful. I am purely sick of carrying around a 3 lb. power supply with my expensively light, carbide case, 3 lb. laptop.

This is why the Parallel Power Law – if the hardware is designed right, the processor power can drop with the square of the number of processors – is important. I'd like to suggest that it is a, possibly the, killer app for parallelism.

There are two keys here, the source of both I discussed in the previous post, the Parallel Power Law. One key is reduced power, of course: Double the processors and halve the clock rate, with associated reduced voltage, and the power drops by 4X (see that post). That decrease doesn't happen, though, unless the other key is present: You don't increase performance, as parallel mavens have been attempting for eons; instead, you keep it constant. Many systems, particularly laptops and graphics units, now reduce clock speed under heat or power constraints; but they don't maintain performance.

The issue is this: All the important everyday programs out there run with perfectly adequate performance today, and have done so for nigh unto a decade. Email, browsers, spreadsheets, word processors, and yes, PowerPoint, all those simply don't have a performance issue; they're fast enough now. Multicore in those systems is a pure waste: It's not needed, and they don't use it. Yes, multiple cores get exercised by the multitasking transparently provided by operating systems, but it really doesn't do anything useful; with a few exceptions, primarily games, programs are nearly always performance-gated by disk or network or other functions than by processing speed. One can point to exceptions – spreadsheets in financial circles where you press F5 (recalculate) and go get coffee, or lunch, or a night's sleep – but they scarcely are used in volume.

And volume is what defines a killer app: It must be something used widely enough to soak up the output of fabs. Client applications are the only ones with that characteristic.

What I'm trying to point to here appears to be a new paradigm for the use of parallelism: Use it not to achieve more performance, but to drastically lower power consumption without reducing performance. It applies to both traditional clients, and to the entire new class of mobile apps running on smart phones / iPads / iGoggles that is now the fastest-expanding frontier.

For sure, the pure quadratic gains (or more; see the comments on that post) won't be realized, because there are many other uses of power this does not affect, like displays, memory, disks, etc. But a substantial fraction of the power used is still in the processor, so dropping its contribution by a lot will certainly help.

Can this become a new killer-app bandwagon? It's physically possible. But I doubt it, because the people with the expertise to do it, the parallel establishment, is too heavily invested in the paradigm of parallel for higher performance, busily working out how to get to exascale computation, with conjoined petascale department systems, etc.

Some areas definitely need that level of computation; weather and climate simulation, for example. But cases like that, while useful and even laudable, are increasingly remote from the volumes needed to keep the industry running as it has in the past. Parallelism for lower power is the only case I've seen that truly addresses that necessarily broad market.

Do a job with 2X the parallelism and use 4X less power -- if the hardware is designed right. Yes, it's a square law. And that "right" hardware design has nothing to do with simplifying the processor design.

No, nobody's going to achieve that in practice, for reasons I'll get into below; it's a limit. But it's very hard to argue with a quadratic in the long run.

That's why I would have thought this would be in the forefront of everybody's mind these days. After all, it affects both software and hardware design of parallel systems. While it seems particularly to affect low-power applications, in reality everything's a low-power issue these days. Instead, I finally tracked it down, not anywhere on the Interwebs, but in physical university library stacks, down in the basement, buried on page 504 of a 1000-page CMOS circuit design text.

This is an issue that has been bugging me for about five months, ever since my post Japan Inc., vs. Intel – Not Really. Something odd was going on there. When they used more processors, reducing per-processor performance to keep aggregate performance constant, they appeared to noticeably reduce the aggregate power consumption of the whole chip. For example, see this graph of the energy needed to execute tomcatv in a fixed amount of time (meeting a deadline); it's Figure 12 from this paper:

In particular, note the dark bars, the energy with power saving on. They decrease as the number of processors increases. The effect is most noticeable in the 1% case I've marked in red above, where nearly all the energy is spent on actively switching the logic (the dynamic power). (Tomcatv is a short mesh-generation program from the SPEC92 benchmark suite.)

More processors, less power. What's going on here?

One thing definitely not going on is a change in the micro-architecture of the processors. In particular, the approach is not one of using simpler processors – casting aside superscalar, VLIW, out-of-order execution – and thereby cramming more of them onto the same silicon area. The simplification route is a popular direction taken by Oracle (Sun) Niagara, IBM's Blue Genes, Intel's Larrabee, and others, and people kept pointing me to it when I brought up this issue; it's a modern slightly milder variation of what legions of I'm-an-engineer-I-don't-write-software guys have been saying for decades: armies of ants outperform a few strong horses. They always have, they always will, but on the things ants can do. Nevertheless, that's not what's happening here, since the same silicon is used in all the cases.

Instead, what's happening appears to have to do with the power supply voltage (V_DD) and clock frequency. That leaves me at a disadvantage, since I'm definitely not a circuits guy. Once upon a time I was EE, but that was more decades ago than I want to admit. But my digging did come up with something, so even if you don't know an inductor from a capacitor (or don't know why EE types call sqrt(-1) "j", not "I"), just hang in there and I'll walk you through it.

The standard formula for the power used by a circuit is P = CV²f, where:

C is the capacitance being driven. This is basically the size of the bucket you're trying to fill up with electrons; a longer, fatter wire is a bigger bucket, for example, and it's more work to fill up a bigger bucket. However, this doesn't change per circuit when you add more processors of the same kind, so the source of the effect can't be here.

f is the clock frequency. Lowering this does reduce power, but only linearly: Halve the speed, you only halve the power, so two processors each half the speed use the same amount of power as one running full speed. You stay in the same place. No help here.

V is the power supply voltage, more properly called V_DD to distinguish it from other voltages hanging around circuits. (No, I don't know why it's a sagging double-D.) This has promise, since it is related to the power by a quadratic term: Halve the voltage and the power goes down 4X.

Unfortunately, V doesn't have an obvious numerical relationship to processor speed. Clearly it does have some relationship: If you have less oomph, it takes longer to fill up that bucket, so you can't run as fast a frequency when you lower the supply voltage. But how much lower does f go as you reduce V? I heard several answers to that which boiled down to "I don't know, but it's probably complicated." One circuit design text just completely punted and recommended using a feedback circuit – test the clock rate, and lower/raise the voltage if the clock is too fast/slow.

But we can ignore that for a little bit, and just look at the power used by an N-way parallel implementation. Kang and Leblibici, the text referenced above, do just that. They consider what happens to power when you parallelize any circuit, reducing the clock to 1/Nth of the original, and combine the results in one final register of capacitance Creg that runs at the original speed, (f_CLK is that original clock speed):

If the original circuit is really big, like a whole core, its capacitance Ctotal overwhelms the place where you deposit the result, Creg, so that "1+" term at the front is essentially 1. So, OK, that means the power ratio is proportional to the square of the voltage ratio. What does that have to do with the clock speed?

They suggest you solve the equations giving the propagation time through an inverter to figure that out. Here they are, purely for your amusement:

Solve them? No thank you. I'll rely on Kang and Leblibici's analysis, which states, after verbally staring at a graph of those equations for a half a page: "The net result is that the dependence of switching power dissipation on the power supply voltage becomes stronger than a simple quadratic relationship,…" (emphasis added).

So, there you go. More parallel means less power. By a lot. Actually stronger than quadratic. And it doesn't depend on micro-architecture: It will work just as well on full-tilt single-thread-optimized out-of-order super scalar or whatever cores – so you can get through serial portions with those cores faster, then spread out and cool down on nicely parallel parts.

You can't get that full quadratic effect in practice, of course, as I mentioned earlier. Some things this does not count:

Less than perfect software scaling.
Domination by static, not switching, power dissipation: The power that's expended just sitting there doing nothing, not even running the clock. This can be an issue, since leakage current is a significant concern nowadays.
Hardware that doesn't participate in this, like memory, communications, and other IO gear. (Update: Bjoern Knafia pointed out on Twitter that caches need this treatment, too. I've no clue how SRAM works in this regard, but it is pretty power-hungry. eDRAM, not so much.)
Power supply overheads
And… a lack of hardware that actually lets you do this.

The latter kind of makes this all irrelevant in practice, today. Nobody I could find, except for the "super CPU" discussed in that earlier post, allows software control of voltage/frequency settings. Embedded systems today, like IBM/Frescale PowerPC and ARM licensees, typically just provide on and off, with several variants of "off" using less power the longer it takes to turn on again. Starting with Nehalem (Core i7) Intel has Turbo Boost, which changes the frequency but not the voltage (as far as I can tell), thereby missing the crucial ingredient to make this work. And there were hints that Larraabee will (can?) automatically reduce frequency if its internal temperature monitoring indicates things are getting too hot.

But the right stuff just isn't there to exploit this circuit-design law of physics. It should be.

And, clearly, software should be coding to exploit it when it becomes available. Nah. Never happen.

Ultimately, though, it's hard to ignore Mother Nature when she's beating you over the head with a square law. It'll happen when people realize what they have to gain.

The Perils of Parallel

Sunday, February 21, 2010

Parallel PowerPoint: Why the Power Law is Important

Wednesday, February 17, 2010

The Parallel Power Law