Like most technology firms, Apple has been home to many successes, and some spectacular defeats. One failure was Project Aquarius. At the dawn of the RISC era, before ARM architecture was “discovered” in Cupertino, engineers were hunkered over a Cray X-MP/48. The objective was to design Apple’s own quad core RISC processor to speed up the Macintosh.
As if designing an instruction set, execution units, and pipeline is not hard enough, getting four cores to work together is more than simply a matter of cloning and connecting. Aquarius never got close to silicon. I’m guessing Apple ran head on into the pitfalls of bus arbitration and cache coherency in multiprocessor scenarios. After three years of effort, Aquarius was scuttled, with Apple soon thereafter turning to IBM and Motorola for help in designing PowerPC.
The dream of an Apple homegrown quad core processor didn’t die, but it did have to wait for technology to catch up. Fortunately for Apple and all SoC designers, ARM and others have since made tremendous progress on processor cores, bus interconnect, and cache coherency.
However, entering 2015 we are far from having all the issues around cache conveniently solved.
Why is cache coherency so hard to get right? I asked Bill Neifert of Carbon Design Systems that question, and he pointed me to an article he co-wrote recently with Adnan Hamid of Breker Verification Systems over in EETimes.
Fast, Thorough Verification of Multiprocessor SoC Cache Coherency
The good news is on the CPU side. ARMv8 IP has migrated toward a cluster strategy, building a quad-core complex with cache built in. Each processor core has its own L1 cache, and the cluster shares an L2 cache. Carbon has fast models for these clusters, and everything is great for verification using virtual prototypes.
Until designers stick that ARM cluster in an actual SoC design. Three things happen:
1) To differentiate SoC designs, folks are modifying the ARM CPU clusters.
2) To make SoC designs do actual work, folks are adding other types of IP cores.
3) To help performance, folks are adding L3 cache in the interconnect fabric.
As Neifert puts it, if you change it, by definition you break it. The first of these is manageable; changes in the CPU cluster are known, the model is updated, verification is run. Perhaps not simple, but straightforward, and the virtual prototyping tools for solving this are solid.
The second issue is also manageable, even for homegrown IP. Let’s assume each IP block in the design has an accurate model and flies through verification – again, non-trivial but achievable.
Now comes the third step. We put those blocks into a modeled interconnect, and … hey, why didn’t that work? The block-level functional verification effort was fine, but the nuances of system-level interaction and timing kick in. What was a perfectly accurate model at the block-level may be inadequate at system-level. If the interconnect has L3 cache – and ARM CoreLink certainly does – system-level caching can quickly turn into an unbounded issue if there are any tweaks in IP.
Cache is a funny thing. IBM has a lot of experience with multicore caching, and they use the term “cache pressure.” If there were only one thread of execution using cache, things might behave as expected. As more tasks are added, at some point cache contention slows all the threads using the cache, not just the one experiencing a cache miss.
Expand that thought across a bunch of heterogeneous cores – CPU, GPU, DSP, PCIe, SATA, USB, Ethernet – each with their own L1/L2 and all using a sea of L3 cache in the interconnect. Cache implementations vary wildly; there are different line widths, update policies, and address maps to deal with. “Some problems only happen when mixing blocks with accurate models,” says Neifert, which qualifies for the understatement of the year.
Exposing these problems by hand is excruciating. Breker and Carbon have teamed up for one solution, using automatically generated test cases against a virtual prototype with 100% accurate models. This allows a robust set of test cases to execute cache stress tests against known-good IP block models in a system-level configuration. Leveraging the fast models in the Carbon CPAK also many tests to run in a reasonable amount of time.
If all cache were created equal, we wouldn’t be having this discussion. Cache coherency in SoCs with many heterogeneous cores and a fabric interconnect is the frontier for SoC design. Neifert suggests there is a lot of “special sauce” being used right now, and teams are reluctant to share solutions – partly because the solution depends on their exact configuration. The ARM out-of-the-box IP is a good starting point, but given modifications and incorporation of other IP from third party and homegrown development, help is needed.
Don’t be like the John Sculley Apple. The autogenerated verification test bench described in the EETimes article is worth looking at, exploring the issues in system-level multicore SoC cache coherency and an approach to uncovering them using fast models and virtual prototypes.