All the hubbub about FPGA-accelerated servers prompts a big question about cache coherency. Performance gains from external acceleration hardware can be wiped out if the system CPU cluster is frequently taking hits from cache misses after data is worked on by an accelerator.
ARM’s latest third-generation CoreLink CMN-600 Coherent Mesh Network interconnect announcement this week had a bunch of goodness about higher performance ARMv8 support. The interconnect runs at up to 2.5 GHz and cuts latency in half, resulting in five times the throughput of the prior CoreLink generation. It supports from 1 to 128 Cortex-A cores, and uses AMBA 5 CHI interfacing.
Also part of the latency/throughput equation is the new CoreLink DMC-620 Dynamic Memory Controller, with integrated TrustZone security and support for up to 8 channels of DDR4-3200, including 3D stacked DRAM. The combination definitely helps CPU-centric performance. However, from other news I’ve covered this week, I’ll reiterate: viewing cache from only a CPU-centric perspective is an outdated idea in a world of heterogeneous SoCs.
ARM is backing a different approach to solve the system-centric coherency issue. A few months ago, seven companies – AMD, ARM, Huawei, IBM, Mellanox, Qualcomm, and Xilinx – announced a rather cryptic initiative. It’s called Cache Coherent Interconnect for Accelerators, or CCIX. Everything we’ve known about that initiative is on a single web page, until now.
A big differentiator in Xilinx Zynq versus other attempts at FPGA SoCs so far is its cache coherency – not perfect, an early proprietary implementation, but still beats the heck out of having an FPGA sitting astride of a CPU on an external bus. (We can only assume Intel and Altera have learned the ‘Stellarton’ lesson, and we’ll see how in future products.)
CCIX only says the solution is a “driver-less and interrupt-less” usage model, offering orders of magnitude improvement in application latency. Presumably, the improvement has to do with connecting coherent agents and sharing cache updates more efficiently. In the ARM CMN-600 scheme, that’s part of what the Agile System Cache is supposed to do. ARM is also putting a lot of energy into cache-coherent GPU IP, and has its NIC-450 IP to help I/O subsystems.
But the sleeper in all this is CCIX support. What is likely developing here are three circles of influence: ARM and its ecosystem partners, IBM (see a new SemiWiki article on POWER9), and Xilinx in one camp; NVIDIA with NVLink also having a connection to IBM POWER8 and POWER9; and Intel and Altera in the other. CCIX is open, and presumably someone else could jump in on that side. Intel and Altera could proceed with a proprietary solution, or surprise me and a lot of other folks by joining in. There’s also one other processor architecture out there – RISC-V – that could weigh in on the CCIX side soon. (I’ll have some more thoughts on that in an upcoming piece on my conversation this week with SiFive.)
A full CCIX spec is supposedly due before year-end, at which point I’d expect ARM to be a lot more specific about what they are doing in the CMN-600 IP. It’s interesting that only AMD, Huawei, and Qualcomm are onboard with CCIX so far, leaving one to wonder what the other ARM-based server types like Broadcom and Cavium are up to as far as cache coherency. As for other NoC vendors, Arteris has its proprietary NCore, and NetSpeed Systems has alluded to CCIX on slides but nothing official yet.
For more on what little ARM did say about the CMN-600, the full press release:
ARM System IP boosts SoC performance from edge to cloud
As with any pre-approved specification, what constitutes CCIX support right now may be subject to change down the road. Given the intensity at which open servers and other applications are being explored, betting on an open specification for connecting FPGAs coherently to SoCs is smart money.