There has been a lot of huffing and puffing lately about 64-bit cores making it into the Apple A7 and other mobile SoCs, and we could probably dedicate a post to that discussion. However, there are a couple other wrinkles to the Apple A7 that should be getting a lot more attention.
There are two primary causes of user frustration in multimedia applications. The first is effects-based lag, that nasty symptom when your game starts out with pristine imagery, but chokes as the action progresses and more and more polygons each with textures and effects fly around in the carnage. The solution is to turn down the model fidelity options, or to get a newer GPU.
The other cause is why desktop graphics cards bring along a lot of their own fast DRAM: memory bandwidth is critical. Even the fastest GPU engine will burp if it runs out of incoming data, which is possible in an application like streaming video. Users will wait for a relatively short buffering period, but if the video stutters too much, attention is gone.
Neither of those approaches works very well for a mobile GPU. Designers of mobile SoCs are all about getting as much memory bandwidth for the least amount of power possible. In a smartphone scenario, with a typical DDR3L configuration, running memory flat out can be half the total power consumption of the device.
As Henny Youngman would have said if he’d had a smartphone, “Doctor, it draws a lot of power when I watch video.” The correct response is not to advise the patient to stop watching video. There are several techniques to cut down mobile GPU memory bandwidth without completely sacrificing performance. The first we explored previously in a series on Imagination’s PVRTC: compress textures efficiently. Steps like non-power of two (NPOT) texture map sizes and flexible bit rate encoding mean less bandwidth.
The second is a lot smarter rendering. A modern GPU like the Imagination PowerVR Series6 “Rogue” G6430 in the A7 uses something called Tile-Based Deferred Rendering (TBDR). In layman’s terms, the scene is rendered in chunks based on what is visible; if a piece of the scene is not visible or not changing, it is not re-rendered until needed. TBDR has another advantage in that rendering can be easily dispatched to GPU cores in parallel.
The third piece is a lot more compression. Intermediate geometry data for TBDR is compressed, using a lossless algorithm that produces a 3:2 improvement. Also, the G6430 uses an optional lossless frame buffer compression technique where render targets are written out compressed, achieving typically 2:1 but in optimum scenarios of constant color as much as 30:1.
For more on these techniques, read the recent Imagination blog on “Building efficient multimedia architectures for consumer electronics and mobile computing”.
So, why would we focus on the Apple implementation using the G6430? Doesn’t all this just come along for the ride by grabbing advanced mobile GPU IP and tossing it into the SoC mix? Let’s return to the A7 for a second and see what else was done to make that solution world-class.
image courtesy ChipWorks
If you look at the specs of the A7 compared to the NVIDIA Tegra 4 and Qualcomm Snapdragon 800, Apple actually has less available memory bandwidth to the main SoC memory. The A7 memory bus is clocked at 800 MHz, compared to 933 MHz in other solutions. Recall what we said earlier, however: as much memory bandwidth for the least amount of power possible.
The A7 iPhone 5s variant uses a stacked package-on-package solution for its 1GB of dual-channel memory. This isn’t a new Apple technique, but it is becoming clear this is about more than reducing physical size. Shorter memory runs plus lower clock speeds mean more efficient drivers and less power consumption. There is also some serious latency tuning going on in the A7 memory controller, as the AnandTech iPhone 5s review illustrates.
The ChipWorks teardown of the A7 reveals something else: a block of 4MB of SRAM strategically placed near the GPU cores. According to their guess and some analysis by TabShowdown, that is a GPU cache – not nearly as big as the DRAM found on desktop GPU cards, but on-die and power efficient. That 12.8 Gbit/sec main memory bandwidth figure suddenly becomes a lot better for many GPU operations, and combined with PVRTC2, TBDR, and compression in the G6430 is a force multiplier for users.
I’ll go out on a limb and say that the long-standing relationship between Apple and Imagination goes a lot deeper than just grabbing IP and slapping it down. Making these kinds of performance/power optimizations takes mutual insight and cooperation. It wouldn’t surprise me at all to learn that there is a tight feedback loop from Apple to Imagination that directly resulted in improvements in the “Rogue” architecture, and in turn resulted in the A7 architecture.
lang: en_USShare this post via: