Wednesday, February 17, 2010

The Parallel Power Law


Do a job with 2X the parallelism and use 4X less power -- if the hardware is designed right. Yes, it's a square law. And that "right" hardware design has nothing to do with simplifying the processor design.

No, nobody's going to achieve that in practice, for reasons I'll get into below; it's a limit. But it's very hard to argue with a quadratic in the long run.

That's why I would have thought this would be in the forefront of everybody's mind these days. After all, it affects both software and hardware design of parallel systems. While it seems particularly to affect low-power applications, in reality everything's a low-power issue these days. Instead, I finally tracked it down, not anywhere on the Interwebs, but in physical university library stacks, down in the basement, buried on page 504 of a 1000-page CMOS circuit design text.

This is an issue that has been bugging me for about five months, ever since my post Japan Inc., vs. Intel – Not Really. Something odd was going on there. When they used more processors, reducing per-processor performance to keep aggregate performance constant, they appeared to noticeably reduce the aggregate power consumption of the whole chip. For example, see this graph of the energy needed to execute tomcatv in a fixed amount of time (meeting a deadline); it's Figure 12 from this paper:



In particular, note the dark bars, the energy with power saving on. They decrease as the number of processors increases. The effect is most noticeable in the 1% case I've marked in red above, where nearly all the energy is spent on actively switching the logic (the dynamic power). (Tomcatv is a short mesh-generation program from the SPEC92 benchmark suite.)

More processors, less power. What's going on here?

One thing definitely not going on is a change in the micro-architecture of the processors. In particular, the approach is not one of using simpler processors – casting aside superscalar, VLIW, out-of-order execution – and thereby cramming more of them onto the same silicon area. The simplification route is a popular direction taken by Oracle (Sun) Niagara, IBM's Blue Genes, Intel's Larrabee, and others, and people kept pointing me to it when I brought up this issue; it's a modern slightly milder variation of what legions of I'm-an-engineer-I-don't-write-software guys have been saying for decades: armies of ants outperform a few strong horses. They always have, they always will, but on the things ants can do. Nevertheless, that's not what's happening here, since the same silicon is used in all the cases.

Instead, what's happening appears to have to do with the power supply voltage (VDD) and clock frequency. That leaves me at a disadvantage, since I'm definitely not a circuits guy. Once upon a time I was EE, but that was more decades ago than I want to admit. But my digging did come up with something, so even if you don't know an inductor from a capacitor (or don't know why EE types call sqrt(-1) "j", not "I"), just hang in there and I'll walk you through it.

The standard formula for the power used by a circuit is P = CV2f, where:

C is the capacitance being driven. This is basically the size of the bucket you're trying to fill up with electrons; a longer, fatter wire is a bigger bucket, for example, and it's more work to fill up a bigger bucket. However, this doesn't change per circuit when you add more processors of the same kind, so the source of the effect can't be here.

f is the clock frequency. Lowering this does reduce power, but only linearly: Halve the speed, you only halve the power, so two processors each half the speed use the same amount of power as one running full speed. You stay in the same place. No help here.

V is the power supply voltage, more properly called VDD to distinguish it from other voltages hanging around circuits. (No, I don't know why it's a sagging double-D.) This has promise, since it is related to the power by a quadratic term: Halve the voltage and the power goes down 4X.

Unfortunately, V doesn't have an obvious numerical relationship to processor speed. Clearly it does have some relationship: If you have less oomph, it takes longer to fill up that bucket, so you can't run as fast a frequency when you lower the supply voltage. But how much lower does f go as you reduce V? I heard several answers to that which boiled down to "I don't know, but it's probably complicated." One circuit design text just completely punted and recommended using a feedback circuit – test the clock rate, and lower/raise the voltage if the clock is too fast/slow.

But we can ignore that for a little bit, and just look at the power used by an N-way parallel implementation. Kang and Leblibici, the text referenced above, do just that. They consider what happens to power when you parallelize any circuit, reducing the clock to 1/Nth of the original, and combine the results in one final register of capacitance Creg that runs at the original speed, (fCLK is that original clock speed):


 If the original circuit is really big, like a whole core, its capacitance Ctotal overwhelms the place where you deposit the result, Creg, so that "1+" term at the front is essentially 1. So, OK, that means the power ratio is proportional to the square of the voltage ratio. What does that have to do with the clock speed?


They suggest you solve the equations giving the propagation time through an inverter to figure that out. Here they are, purely for your amusement:




Solve them? No thank you. I'll rely on Kang and Leblibici's analysis, which states, after verbally staring at a graph of those equations for a half a page: "The net result is that the dependence of switching power dissipation on the power supply voltage becomes stronger than a simple quadratic relationship,…" (emphasis added).


So, there you go. More parallel means less power. By a lot. Actually stronger than quadratic. And it doesn't depend on micro-architecture: It will work just as well on full-tilt single-thread-optimized out-of-order super scalar or whatever cores – so you can get through serial portions with those cores faster, then spread out and cool down on nicely parallel parts.

You can't get that full quadratic effect in practice, of course, as I mentioned earlier. Some things this does not count:


  • Less than perfect software scaling.
  • Domination by static, not switching, power dissipation: The power that's expended just sitting there doing nothing, not even running the clock. This can be an issue, since leakage current is a significant concern nowadays.
  • Hardware that doesn't participate in this, like memory, communications, and other IO gear. (Update: Bjoern Knafia pointed out on Twitter that caches need this treatment, too. I've no clue how SRAM works in this regard, but it is pretty power-hungry. eDRAM, not so much.)
  • Power supply overheads
  • And… a lack of hardware that actually lets you do this.
The latter kind of makes this all irrelevant in practice, today. Nobody I could find, except for the "super CPU" discussed in that earlier post, allows software control of voltage/frequency settings. Embedded systems today, like IBM/Frescale PowerPC and ARM licensees, typically just provide on and off, with several variants of "off" using less power the longer it takes to turn on again. Starting with Nehalem (Core i7) Intel has Turbo Boost, which changes the frequency but not the voltage (as far as I can tell), thereby missing the crucial ingredient to make this work. And there were hints that Larraabee will (can?) automatically reduce frequency if its internal temperature monitoring indicates things are getting too hot.

But the right stuff just isn't there to exploit this circuit-design law of physics. It should be.


And, clearly, software should be coding to exploit it when it becomes available. Nah. Never happen.


Ultimately, though, it's hard to ignore Mother Nature when she's beating you over the head with a square law. It'll happen when people realize what they have to gain.

15 comments:

crander said...

I believe Intel says it's SCC research chip allows software voltage and frequency control.

Quote: "The novel many-core architecture includes innovations for scalability in terms of energy-efficiency including improved core-core communication and techniques that enable software to dynamically configure voltage and frequency to attain power consumptions from 125W to as low as 25W"

Link: http://techresearch.intel.com/articles/Tera-Scale/1826.htm

Matt Reilly said...

As a chip designer and kinda circuits-guy, I can say that for practical and contemporary CMOS processes, over the narrow range of permissible VDD values (permissible defined by the foundry technology files) circuit delays are more or less inversely proportional (i.e. nearly linear) to VDD.
Thus, the familiar P = k*C*f*V^2 becomes P = k' * C * f^3
I pointed this out in my workshop presentation on power aware architecture development at ISCC back in 2002.

And, in fact, it is worse than this. Circuit centric models ignore the fact that very fast pipelines tend to have more storage flops per logic element than long-tic pipelines. Pipeline flops don't actually do any work, but they consume lots of power. (And in fact, as the pipe stages get shorter, the flops must get faster, which burns more power....)

You are right, properly designed multicore widgets can be more power efficient than lower processor count widgets. Ants waste less food than horses.

David Kanter said...

Matt's response was probably the most elegant way of explaining the relationship. Dropping frequency and voltage to reduce power is always a winning combination - a 15% reduction in V and F can easily halve your power.

The tricky part is that CMOS circuits don't work well much under 1V. If you look at some of Intel's designs, they spend a lot of extra transistors to enable operation around 0.7V - using register files instead of SRAM for the L1D in Atom and Nehalem.

One of the trends in CPU design is DVFS - Dynamic Voltage and Frequency Scaling. That is adjusting V and F to the optimum point for your design. The more aggressive systems do this solely under hardware control, without letting software get in the way...although it seems like an optimal system should enable SW to provide hints.

Wes Felter said...

People have been changing frequency and voltage for ten years in laptops and five years in servers, so you're a little behind there. Also, this thinking was explored to some extent in "Energy-Efficient Server Clusters" from PACS 2002. The EPI throttling paper also hints at this from a different direction by showing that for a fixed power budget it is faster to run parallel than serial.

Besides the various technical obstacles, I think there's also an economic factor: if you buy a multi-core you'll naturally want to use it to run faster, not lower power at the same speed.

Tomas said...

Awesome post. Well put together technical view, especially for a non circuits guy.

-Tom
http://ravingsoftom.blogspot.com/

Sassa said...
This comment has been removed by the author.
Sassa said...

@Matt, @Greg,

I am not a circuit guy, but you are using a Power formula for circuits that work at top frequency, and then mix together two different frequencies.

Rewind.

P(t)=V(t)*I(t).

I(t) doesn't know the frequency of the generator, doesn't know when it will change the phase.

Stop.

I(t)=V(t)/R_c

P(t)=V(t)^2/R_c

Play from here.

Greg Pfister said...

Wow, interesting comments. Thanks!

@Wes - I think it's possible you're missing a key point: Going parallel for lower power at the *same* performance. Yes, laptops have had save-the-battery modes for a long time, but it reduces performance.

@David - Yes, in figuring this out, I came across lots of references to DVFS. Pure hardware does "stay out of the way," but then how can it tell the SW is deliberately using parallelism to maintain response time while allowing power to decrease?

@Sass - First, for those reading here, he first posted and then sent me offline a long explanation, with lots of equations and a very non-trivial graph, of what he thinks is going on. Wow. Thanks! Unfortunately, (a) It would take me huge time to really understand what you're saying; (b) the net conclusion seems to be that no power is saved, which I don't think is right. See the first graph in this post; aggregate power is definitely decreasing.

Unfortunately, in this post I left out my real net point as to why this is spectacularly important, *&^%! See upcoming additional blog post.

Sassa said...

No, it wasn't what I meant; I'd still get a square of frequency, but not the cube and not f_CLK in the power equation.

Initially I considered resistance depending only on resonance frequency of the circuit, and it didn't add up with the power equation using generator frequency. Now I recall the bigger picture, and the resonance frequency of the circuit dies out after long enough.

Which basically means I got it.

Gary Lauterbach said...

I'll add my 2 cents worth:

Riding the CV^2F curve downward is very compelling but as Greg mentioned there are real obstacles that make it difficult. As shown in the graph with a 30+% static power the gains are minimal. Watch Glen Hintons recent Stanford ee380 talk for a data point on Nehalems static power percentage (30%).

http://www.stanford.edu/class/ee380/

Matts observation regarding F ~ Vdd for mainstream processes is a result of Vdd>>Vt. As the equations of Kang and Leblibici show the dependence is on Vdd-Vt, intuitively as Vdd approaches Vt frequency approaches zero. This non-linear F dependence limits the gains from Vdd downward scaling a constant design.

In the past Vt was scaled along with Vdd but this is now precluded due to subthreshold leakage (a major contributor to static power). Subthreshold leakage is ~ to e^(1/Vt) so static power grows exponentially with a decrease in Vt. CMOS scaling has no more silver bullets to enable riding the CV^2F curve down for better power efficiency.

Micro-architecture can and will play an important role in enabling maximum CV^2F scaling. Minimizing static power is essential. The other way of looking at this is maximizing work done per unit of static dissipation is essential. A micro-architecture that produces more work per unit of static dissipation (work per gate) will achieve better downward Vdd scaling. The current high-end micro-architectures do not optimize for work done per gate.

For an experts insights of the limitations and possibilities try Googling:

"Scaling, Power, the future of CMOS"

and read what Mark Horowitz and co-authors have to say about this issue.

Greg Pfister said...

More great comments --

Gary, absolutely true. Static power, particularly leakage, is a lot of what got us into the heat issue that collapsed frequency increases in the first place. Reducing it is, I assume, a high priority for everybody. This post points to yet another reason to do so.

Sassa and I had more off-line exchanges, out of which I got this net (which will make sense if once upon a time you were an RF hacker):

Any circuit has a resonant frequency; clocking at that frequency works best. Moving off that frequency can reduce performance, or have other bad effects.

How quickly the bad effects happen depends on what RF guys called the Q of a circuit -- a measure of how bad things get when you move off resonance. (Or how well you reject adjacent frequency signals, the original issue.)

So an added constraint is that to get the "right" effect here, the circuits need to be designed to have a low Q. Don't know if anybody is thinking of this or thinking of it in those terms, but it makes sense to me, anyway.

(Motivation post coming soon, I promise!)

kme said...

Hobbyist overclockers have known for a long time that you need to bump up Vdd to maintain stability at higher clock rates.

Anonymous said...

Wes: "...if you buy a multi-core you'll naturally want to use it to run faster, not lower power at the same speed."

Speaking as a software guy: not always, or even most of the time. There are two ends to the spectrum: wanting lower latency on large amounts of calculation (as in high-frequency trading applications), and wanting more throughput (as when I have to go from handling a few thousand web requests per second to a few tens or hundreds of thousands).

For the former I move to multicore, at much greater software development expense, only because I can't get enough CPU cycles quick enough out of a single-core processor. But in this situation, given the choice between more cores and faster cores, I'll go for the latter any day, even if it costs me a bit more.

For the latter, if it adds a few milliseconds to my latency in delivering a page, I'm happy to live with that if it means I can fairly cheaply deliver a lot more pages in the same wall-clock time. And if that cheapness comes in part through reduced power usage at my ever-growing data centre, that can be worthwhile.

Yale Zhang said...

Gary Lauterbach, you're completely right.

Good old Dennard scaling (frequency ~= Vdd) has been over for 10 years, because it's simply impractical to lower transistor threshold voltage, due to leakage.

For all practical purposes, supply voltage Vdd has remained at a constant 1V these days.

Greg Pfister said...

Yale, thanks for your comment here and on the next post.

About your assertions, two comments:

1. Existence proof: Intel SpeedStep. It's aimed in the opposite direction as I am here, but it does vary both voltage and frequency, and sucks down more power.

2. While I did talk here and in the next post about this being a dynamic feature, it's still true and usable as a static feature: Just design for more cores at lower voltage and frequency. Of course, then you have to assume all your software is parallel, which is not all that good an assumption. It can probably be asserted that this is actually what the industry as a whole has done in response to the excessive heat issues of ever-increasing clocks.

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.