I don’t look at the RTL power estimation topic too often these days, so I was interested to see that ANSYS still has a very strong position in this area. Qualcomm is using PowerArtist on one of the most demanding modern applications – mobile GPU power gaming. Mobile gaming heavily loads the GPU, so any optimization in that area will affect battery life. This is a world-class test because it’s not just ‘more of the same but bigger’. Gaming benchmarks are really going to stretch the range for that ever-present challenge in power estimation. bridging the gap between system-level use-cases and RTL-level power calculations.
There’s so much complexity in modern GPUs that averaged power estimates across relatively simple directed tests fall short. These are simply not going to be good enough to drive intelligent optimization choices in RTL design. Jiaze Li from Qualcomm presented a paper at a recent ANSYS Simulation World on their more realistic approach.
First Qualcomm start with realistic gaming loads. Jiaze mentioned Manhattan and Aztec Ruins as two popular games used for GPU benchmarking today. They extract multi-millisecond sequences from these games as their basis for testing. These are still long enough that simulation must run on an emulator, streaming data out to drive power analysis in ANSYS PowerArtist. Qualcomm uses this flow to track how power is changing as the design evolves and to optimize RTL for power reduction.
Jiaze added that the emulation flow is too cumbersome for detailed power debug. Instead they use a parallel simulation-based power flow. The tests they use here are derived from the same large gaming benchmarks. However, they greatly reduce size to capture the essentials of graphics features which can still run in reasonable time on the simulator. This reduction is very much a manual task, something into which Jiaze and the team put a lot of work, but they’ve figured out a process to efficiently build these reduced tests.
The second important point is that they divide the analysis time, by graphics features, into multiple windows. The systems team team defines the windows, which are not generally equal in size. PowerArtists then calculates power-estimates per window. This gives them a chunked timeline view of averages, in which they can see variations in average power as a function of feature. That he says gives them a lot of insight into contributors to power in any given window. Which also suggests how they might best optimize not only for average power but also for some sense of peak power.
Jiaze said that the flow is running in bi-weekly production regressions at Qualcomm. They have used the flow to drive a 5% reduction in power on their most recent design. Most of the improvements were through adding clock gating and eliminating redundant data toggling. He added a very nice bonus in their use of this method. They are able to very concretely justify the power reductions they are able to find. Much better than a more general ‘we suggested a bunch of improvements and see – it got better!’