We know about formal methods for cache coherence state machines. What sorts of tests are possible using dynamic coherence verification? Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO) and I continue our series on research ideas. As always, feedback welcome.
This month’s pick is McVerSi: A Test Generation Framework for Fast Memory Consistency Verification in Simulation. The paper was published in the 2016 IEEE HPCA The authors are from the University of Edinburgh.
This is a slightly dated paper but is well cited in an important area of verification not widely covered. The authors’ goal is to automatically generate tests for multicore systems which will preferentially trigger consistency errors. Their focus is on homogenous designs using Gem5 as the simulator. Modeling is cycle accurate; in similar work CPU models may be fast/virtual and coherent network/caches and pipelines are modeled in RTL running on an emulator.
The method generates tests as software threads, one per CPU, each a sequence of loads, modifies, stores and barrier instructions. Initial generation is random. Tests aim to find races between threads where values differ between iterations, i.e. exhibit non-determinism. The authors argue that such cases are more likely to trigger consistency errors.
The authors then use genetic programming to combine such non-deterministic test components from sequences. From these they build new sequences to strengthen the likelihood of races which are likely to fail consistency tests. Where they find inconsistencies, they run a check to classify these as valid or invalid per the memory consistency model. They have a mechanism to measure coverage to guide the genetic algorithm and to determine when they should stop testing.
The authors describe bugs that such a system should find in a cache coherent network and their method performs well. The authors note limitations of formal methods to early and significantly abstracted models. In contrast this method is suitable for full system dynamic coherence verification.
Memory consistency verification is hard, especially in pre-silicon. But it is a very important topic that we find increasingly center stage in our discussions with many customers.
The heart of this paper is a genetic algorithm for mutating randomized CPU instruction tests to make them have more and more race conditions on memory writes and reads. The authors achieve this using a clever scoring system to rank memory addresses based on how many read or write race conditions there are on that address in a given test. Read or write instructions on addresses scoring high (which they call “racey”) are targeted by the genetic algorithm for splicing and recombination with other tests to “evolve” more and more racey tests. It’s a neat idea, and it really works, which is probably why this paper is so well cited!
The authors benchmark their algorithm on a popular open-source architecture simulator, Gem5, using the Ruby memory subsystem and GARNET interconnect fabric. They are able to identify two previously unreported corner case bugs in Gem5, and also show that their clever scoring system is necessary to create tests racy enough to catch these previously unreported bugs. Also, the authors show that their algorithm is able to find all other previously reported bugs much faster than other methods.
Overall, I found this a thought-provoking paper with a lot of detail. I had to read it a few times to fully appreciate the depth of its contributions, but it was worth it!
Our October 2021 article, RTLCheck: Verifying the Memory Consistency of RTL Designs addressed memory consistency by generating assertions and then checking them. This is a different approach which modifies the well-known constrained random test pattern generation by generating the tests using a genetic algorithm with a particularly clever “selective crossover”.
They run experiments for the x86-64 ISA running Linux across 8 cores. The cores they model as simple out-of-order processors, with L1 and L2 caches. Tests run in 512MB of memory with either 1KB or 8KB of address range. Each test runs 10 times. Three experiments run: pseudo randomly generated tests, genetic without selective crossover, and genetic with selective crossover. The evaluation detected 11 bugs, 9 already known and 2 newly discovered. Within 24 hours only the full algorithm (at 8KB) finds all 11 bugs, the other approaches find 5-9 bugs. The full algorithm also beats other approaches in terms of coverage for MESI and TSO-CC (cache-coherence protocols). However this is by a small margin.
The paper is highly instructive although a challenging read unless you’re an expert in MCM. The authors provide their software in GitHub, which no doubt encouraged subsequent papers which cite this work 😀. Given enough expertise, this is certainly one arrow in the quiver to tackle full system memory consistency verification!
As Raúl says, this is a challenging read. I have trimmed out many important details such as memory range and stride. Also I skipped runtime optimization, simply to keep this short summary easily digestible. Methods like this have an objective to automatically concentrate coherency suspects in a relatively bounded testplan. I believe this will be essential to dynamic coherence verification methods to catch long-cycle coherence problems.
Also read:Share this post via: