Cache Coherency is the type of concept that you think you understand, until you try to explain it. It could be wise to come back to fundamentals, and ask what does coherency means to an expert. I have surf the web, found several white papers on ARM site, and now I can try to share these fresh lessons learned (or you may prefer to download directly these white paper!).
Starting from the fundamentals is always a good idea, so I suggest you to first read this Cache Coherency Fundamentals Part1 white paper from Neil Parris. You will find the definition of coherency: “Coherency is about ensuring all processors, or bus masters in the system see the same view of memory”, and the problem quickly arise: I have a processor which is creating a data structure then passing it to a DMA engine to move. If that data were cached in the CPU and the DMA reads from external DDR, the DMA will read old, stale data.
The author describes the three mechanisms to maintain coherency: disable caching, software managed coherency and hardware managed coherency. The first two are clearly impacting performance and power, and the hardware managed coherency through Cache Coherent Interconnect (CCI) appears as power and performance friendly and simplify software.
Extending hardware coherency to the system requires a coherent bus protocol, and in 2011 ARM released the AMBA 4 ACE specification which introduces the “AXI Coherency Extensions” on top of the popular AXI protocol. The ACE interface allows hardware coherency between processor clusters (remember the processor and DMA engine example), it was also the key enabler for a Symetric Multi-Processor (SMP) operating system to extend to more cores. We can see that the chip makers like Samsung or Qualcomm, designing quad-cores if not octal-core application processor have taken full benefit of CCI. These products have been the gate openers for big.LITTLE designs as well as GPU Compute in Mobile applications. GPU compute include: computational photography, computer vision, modern multimedia codecs targeting Ultra HD resolutions such as HEVC and VP9, complex image processing and gesture recognition.
Now, since you have read this first white paper, it may be time to download Cache Coherency Fundamentals Part2, and discover the products developed to support massive cache coherency:
- CoreLink CCI-400 Cache Coherent Interconnect
- Up to 2 clusters, 8 cores
- CoreLink CCN-504 Cache Coherent Network
- Up to 4 clusters, 16 cores
- Integrated L3 cache, 2 channel 72 bit DDR
- CoreLink CCN-508 Cache Coherent Network
- Up to 8 clusters, 32 cores
- Integrated L3 cache, 4 channel 72 bit DDR
CoreLink products have allowed ARM to target enterprise applications such as networking and server, supporting high performance serial interfaces such as PCI Express, Serial ATA and Ethernet. In most applications all of this data will be marked as shared as there will be many cases where the CPU needs to access data from these serial interfaces.
This second white paper proposes an exhaustive description and feature list of these CoreLink products. If you want to go deeper into enterprise dedicated solution, I would recommend reading this blog from Ian Forsyth, Coherent Interconnect Technology supports Exponential Data Flow Growth
CoreLink CCN-508, described in this paper, has been designed to support the performance requirement of up to 32 cores, also including the following low power features:
- Extensive clock gating
- Leakage mitigation hooks
- Granular DVFS (Dynamic Voltage and Frequency Switching) and CPU shutdown support
- Partial or full L3 (level-3) cache shutdown and retention modes.
In fact, low power, or better power efficiency, has been ARM’s differentiator explaining the incredible success of the company in mobile application, with probably more than 95% penetration in the billions of phone/smartphones shipped every year. Power efficiency will be the key for enterprise market penetration. Better power consumption is no more a “nice to have” feature in such power hungry market, it is becoming a “must have” and could be the Trojan horse for ARM to penetrate this high performance market. Just take a look at the (above) 32 cores architecture: it’s highly complex, high performing, but imagine that you have to integrate, package and cool thousands of such IC in the same space. And pay the electricity bill, about 2/3[SUP]rd[/SUP] of it being only dedicated to the cooling system!
Eric Esteve from IPNEST –