When I posted earlier on Qualcomm presenting with ANSYS on differential energy analysis, I assumed this was just the usual story on RTL power estimation being more accurate for relative estimation between different implementations. I sold them short. This turned out to be a much more interesting methodology for optimizing total energy using ANSYS PowerArtist.
Yadong Wang of Qualcomm presented and owns power modeling and analysis for Adreno GPUs. Before that he was a hardware power engineer in NVIDIA so he’s pretty experienced in this domain. He started by noting that the impact of power on heating is a big challenge in mobile GPUs. As you play a game on your phone, the temperature rises. Eventually thermal mitigation kicks in and clock speed drops; the game runs slower. The longer you play, the slower the game runs (down to some limit), which doesn’t make for great customer satisfaction. This is why thermal-constrained performance is becoming one of most important KPIs in mobile design.
This is a dynamic power problem. Assuming you’ve done all you can to minimize leakage (through process selection and power islands), and you accept you want to avoid switching to lower voltage/frequency options for the reason cited above, you really have to direct most of your attention to minimizing redundant activity, which you pretty much have to do at RTL. This is the low-cost place to perform design changes, you can iterate quickly on different options and the impact of changes is generally much higher than for any fixes that are practically possible at implementation. Yadong uses ANSYS PowerArtist in his work.
The common approach to optimizing power in these cases is run an analysis with some workload, look at the hierarchical breakdown of dynamic power components (switching power and internal power) through the design then look for cases where there might be redundant activity, such as a clock toggling on a register when the data input to that register isn’t changing. This process works but it doesn’t necessarily feel optimal. Maybe power savings may not be possible, but you might not know that until you’ve done quite a bit of searching. Wouldn’t it be better to know at the outset if there is opportunity to reduce power on this function and that the potential for reduction is significant? That’s where Qualcomm’s approach is really clever.
The core of the method looks at energy (power integrated over time) rather than power. And instead of hunting for redundant toggles, the method tweaks the workload (my view) by inserting bubbles in the path of incoming transitions or outgoing responses, to mimic starvation or stalls. This draws out the simulated time for that workload and therefore the time over which power is integrated to yield total energy.
Now they compare that energy report with the same report from a bubble-free run. The bubble-free case runs for less time with a higher average power, while the bubbled case runs for a longer time with lower average power. Ideally, total energy for these cases should be identical. But if there is power inefficiency in the design, the longer run-time in the bubbled case will amplify that inefficiency. So you know up-front whether there is opportunity to reduce total energy and you also have an idea of how much reduction may be possible.
Yadong took this further. In the experiments he described, he looked particularly at register-related dynamic power. Power estimation tools report switching and internal power separately; He noted that redundant D/Q toggles on a register will, in the bubbled case, cause an increase in both switching and internal energy, whereas redundant toggles on the clock input will increase only internal energy. Thus in comparison with the un-bubbled analysis there are 4 possibilities:
- No change in switching or internal energy – no improvements are possible
- Internal energy increases but switching energy is the same – there are redundant toggles on clock pins
- Switching energy increases but internal energy is the same – there are redundant toggles on D/Q pins when the clock is disabled
- Both switching energy and internal energy increase – there are redundant toggles on both D/Q and clock pins
They can drill down through detailed reports to find where they can make improvements to reduce redundant toggles.
What is especially startling is that Yadong said they were able to reduce dynamic power by 10% driven by this analysis. This is in a company (and a market) where reducing power is pretty close to a religion. But I’m not surprised the approach is so effective. This feels like a more scientific technique to measure power inefficiency overall and to isolate root causes. By comparison, traditional methods look rather ad-hoc.
Yadong mentioned at the end that a similar approach could be used to look at inefficiencies in memory, combinational logic and clock tree dynamic power. Analysis could pull similar data from the power estimation reports, though discriminating on differences in switching versus internal power might look different in each case. The Webinar is well worth watching. You can register to see it HERE.