Do a job with 2X the parallelism and use 4X less power -- if the hardware is designed right. Yes, it's a square law. And that "right" hardware design has nothing to do with simplifying the processor design.
No, nobody's going to achieve that in practice, for reasons I'll get into below; it's a limit. But it's very hard to argue with a quadratic in the long run.
That's why I would have thought this would be in the forefront of everybody's mind these days. After all, it affects both software and hardware design of parallel systems. While it seems particularly to affect low-power applications, in reality everything's a low-power issue these days. Instead, I finally tracked it down, not anywhere on the Interwebs, but in physical university library stacks, down in the basement, buried on page 504 of a 1000-page CMOS circuit design text.
This is an issue that has been bugging me for about five months, ever since my post Japan Inc., vs. Intel – Not Really. Something odd was going on there. When they used more processors, reducing per-processor performance to keep aggregate performance constant, they appeared to noticeably reduce the aggregate power consumption of the whole chip. For example, see this graph of the energy needed to execute tomcatv in a fixed amount of time (meeting a deadline); it's Figure 12 from this paper:
In particular, note the dark bars, the energy with power saving on. They decrease as the number of processors increases. The effect is most noticeable in the 1% case I've marked in red above, where nearly all the energy is spent on actively switching the logic (the dynamic power). (Tomcatv is a short mesh-generation program from the SPEC92 benchmark suite.)
More processors, less power. What's going on here?
One thing definitely not going on is a change in the micro-architecture of the processors. In particular, the approach is not one of using simpler processors – casting aside superscalar, VLIW, out-of-order execution – and thereby cramming more of them onto the same silicon area. The simplification route is a popular direction taken by Oracle (Sun) Niagara, IBM's Blue Genes, Intel's Larrabee, and others, and people kept pointing me to it when I brought up this issue; it's a modern slightly milder variation of what legions of I'm-an-engineer-I-don't-write-software guys have been saying for decades: armies of ants outperform a few strong horses. They always have, they always will, but on the things ants can do. Nevertheless, that's not what's happening here, since the same silicon is used in all the cases.
Instead, what's happening appears to have to do with the power supply voltage (VDD) and clock frequency. That leaves me at a disadvantage, since I'm definitely not a circuits guy. Once upon a time I was EE, but that was more decades ago than I want to admit. But my digging did come up with something, so even if you don't know an inductor from a capacitor (or don't know why EE types call sqrt(-1) "j", not "I"), just hang in there and I'll walk you through it.
The standard formula for the power used by a circuit is P = CV2f, where:
C is the capacitance being driven. This is basically the size of the bucket you're trying to fill up with electrons; a longer, fatter wire is a bigger bucket, for example, and it's more work to fill up a bigger bucket. However, this doesn't change per circuit when you add more processors of the same kind, so the source of the effect can't be here.
f is the clock frequency. Lowering this does reduce power, but only linearly: Halve the speed, you only halve the power, so two processors each half the speed use the same amount of power as one running full speed. You stay in the same place. No help here.
V is the power supply voltage, more properly called VDD to distinguish it from other voltages hanging around circuits. (No, I don't know why it's a sagging double-D.) This has promise, since it is related to the power by a quadratic term: Halve the voltage and the power goes down 4X.
Unfortunately, V doesn't have an obvious numerical relationship to processor speed. Clearly it does have some relationship: If you have less oomph, it takes longer to fill up that bucket, so you can't run as fast a frequency when you lower the supply voltage. But how much lower does f go as you reduce V? I heard several answers to that which boiled down to "I don't know, but it's probably complicated." One circuit design text just completely punted and recommended using a feedback circuit – test the clock rate, and lower/raise the voltage if the clock is too fast/slow.
But we can ignore that for a little bit, and just look at the power used by an N-way parallel implementation. Kang and Leblibici, the text referenced above, do just that. They consider what happens to power when you parallelize any circuit, reducing the clock to 1/Nth of the original, and combine the results in one final register of capacitance Creg that runs at the original speed, (fCLK is that original clock speed):
If the original circuit is really big, like a whole core, its capacitance Ctotal overwhelms the place where you deposit the result, Creg, so that "1+" term at the front is essentially 1. So, OK, that means the power ratio is proportional to the square of the voltage ratio. What does that have to do with the clock speed?
They suggest you solve the equations giving the propagation time through an inverter to figure that out. Here they are, purely for your amusement:
Solve them? No thank you. I'll rely on Kang and Leblibici's analysis, which states, after verbally staring at a graph of those equations for a half a page: "The net result is that the dependence of switching power dissipation on the power supply voltage becomes stronger than a simple quadratic relationship,…" (emphasis added).
So, there you go. More parallel means less power. By a lot. Actually stronger than quadratic. And it doesn't depend on micro-architecture: It will work just as well on full-tilt single-thread-optimized out-of-order super scalar or whatever cores – so you can get through serial portions with those cores faster, then spread out and cool down on nicely parallel parts.
You can't get that full quadratic effect in practice, of course, as I mentioned earlier. Some things this does not count:
- Less than perfect software scaling.
- Domination by static, not switching, power dissipation: The power that's expended just sitting there doing nothing, not even running the clock. This can be an issue, since leakage current is a significant concern nowadays.
- Hardware that doesn't participate in this, like memory, communications, and other IO gear. (Update: Bjoern Knafia pointed out on Twitter that caches need this treatment, too. I've no clue how SRAM works in this regard, but it is pretty power-hungry. eDRAM, not so much.)
- Power supply overheads
- And… a lack of hardware that actually lets you do this.
But the right stuff just isn't there to exploit this circuit-design law of physics. It should be.
And, clearly, software should be coding to exploit it when it becomes available. Nah. Never happen.
Ultimately, though, it's hard to ignore Mother Nature when she's beating you over the head with a square law. It'll happen when people realize what they have to gain.