A mobile GPU is an expensive piece of SoC real estate in terms of footprint and power consumption, but critical to meeting user experience demands. GPU IP tuned for OpenGL ES is now a staple in high performance mobile devices, rendering polygons with shading and texture compression at impressive speeds.
Creative minds in the desktop space long ago figured out that GPUs can be viewed as vector engines, and can be put to work on accelerating computational tasks general purpose CPUs grapple with. This was an ideal use case for mobile space, with tasks like facial recognition, computational photography, and embedded vision growing in popularity.
There are a few problems, however. First is the programming model; CPUs and GPUs are radically different. To try to solve that, Apple and others got their arms around OpenCL, providing parallel constructs with the hope of getting heterogeneous processing units to work together. OpenCL made significant progress for many tasks.
The second problem is memory space – CPUs have theirs, GPUs have theirs, and betwixt is a performance problem usually solved by copying data between the two spaces. AMD and others brought HSA (Heterogeneous System Architecture) to the table, redefining the interface between CPU and GPU (or other execution units) around a shared memory space.
Which brings us to the third problem. Shared memory is fantastic, but real performance in a multicore CPU architecture means lots of cache, and with it cache coherence. Cache miss penalties can be brutal – especially on large files like images. Tossing GPUs into the processing mix without cache coherence may produce gains on very particular benchmarks. For greater gains and consistent performance, we need new IP that maintains coherence.
ARM has rethought their interconnect architecture, introducing two new IP blocks to bring in a new crop of fully coherent GPUs. We should mention here that ARM has a three-tiered product strategy for interconnect: a low-end CoreLink NIC for basic SoCs, a high-end CoreLink CCN for the AMBA 5 CHI server-class multicore crowd, and the mid-range where this announcement lives.
The new CoreLink CCI-550 is shown with six ACE interfaces, two for a big.LITTLE cluster, and four for the GPU. This is the scaled-up configuration offering up to 60% peak interconnect bandwidth compared to the CCI-500. The CCI-550 also scales down with fewer ACE and memory interfaces for more optimized solutions. The key feature of the CCI-550 is the integrated snoop filter, which foregoes the need to send all snoops to all processors, instead using one central snoop. This lowers snoop latency and relieves what would otherwise be quadratic scaling, and removes speculative DRAM accesses.
Those DRAM accesses come through a new DRAM controller, the CoreLink DMC-500. Tuned for up to LPDDR4-4267, the DMC-500 ups memory bandwidth by 27% and drops CPU latency by 25%. These solutions have been qualified to work together, reducing integration issues.
There is also some intrigue over the GPU itself in this diagram. During our pre-briefing, ARM declined to provide details on the Mali “Mimir” GPU other than confirming new IP is in the works. My guess is stay tuned, details to be announced at ARM TechCon coming up in a few weeks. I also asked if other GPU vendors are working on coherent IP; ARM said only that they are currently sharing information under the auspices of the HSA.
Full press release for this announcement:
Fully coherent ARM CPU/GPU combinations could get interesting, although the chances of something like the Quake-Catcher Network emerging on distributed mobile devices have an expensive metered 4G pipe sitting in the way. Still, removing the barrier of coherence for mobile SoCs means new algorithms taking full advantage of GPU compute power are up for grabs. This could also introduce an interesting dynamic for HSA and alternative CPU and GPU core solutions, beyond just an ARM offering.