Covering configurable systems is a challenge. What’s a good strategy to pick a small subset of settings and still get high confidence in coverage? Paul Cunningham (GM, Verification at Cadence), Jim Hogan and I continue our series on research ideas, here an idea from software testing which should also apply to hardware. Feel free to comment.
This month’s pick is Combinatorial Methods for Testing and Analysis of Critical Software and Security Systems. The link is to a long but readable tutorial generated by NIST and SBA Research (Austria), on orthogonal array testing.
Testing coverage for highly configurable systems is a constant concern. Testing one configuration is hard work; multiplying that load by all possible configurations becomes a nightmare. The authors explain a rationale for reducing the set of configurations to test in a way that should not significantly compromise coverage. They analyzed a broad range of software, looking at the cumulative percent of faults versus the number of contributing configuration variables, call this “t”. They find a strong correlation between those fault distributions and branch conditions with t variables, with very high coverage for t<=6, and 90%+ for t<=3.
This observation is central to how they build tests. Covering all possible permutations of small subsets of the variables can be done very efficiently using orthogonal arrays. Exhaustively testing all permutations of 10 binary variables takes 210 tests. Taken 3 at time, this reduces to 120 3-bit combinations, 120*23 = 960 sequences to test, not a big reduction. Instead they suggest using a covering array, which can cover all of those 960 possibilities in just 13 tests. This concept is not new – it was developed in the 1990’s. However figuring out good algorithms to find covering arrays is a difficult optimization problem. The authors say such algorithms have appeared only recently, and NIST has implemented these in their ACTS tool.
As an example, they consider a case with 34 switches requiring 17 billion combinations to fully test. They can cover all 3-way combinations of switches in just 33 tests, for 70% or better coverage, and all 4-way in 85 tests for 90% or better coverage.
Run-times on one test-case, an air traffic collision avoidance system, are impressive. For 12 variables they show run-times to generate tests up to little more than 60 seconds for t=6. The authors show examples of application to a subway control system and a smart home assistant,. Also to network deadlock detection, security protocol testing and hardware trojan detection.
As a computer scientist by training, I have to begin by appreciating the pure intellectual beauty of this problem. From reading up on it a bit, looks like the concept originated in Latin squares analysis from the 17th century, and was then picked up in the 1950’s and promoted by Taguchi for statistical methods in experimental design and manufacturing, e.g. in applications to optimize spot welding and semiconductor processes.
From the perspective of chip verification, most of the examples the authors consider have a very small number of parameters – O(10) to O(100) boolean variables. A typical constrained random testbench for a modern SOC IP block might have 1,000-10,000 parameters. Many of these will be 16- or 32-bit variables, so a much larger space to consider. But otherwise the approach is absolutely applicable to chip verification and could be very powerful at catching corner case bugs.
This technique relies on the premise that most bugs are excited by a specific combination of values on only a small number of control knobs. This premise is based on empirical data from software programs. It is possible that it doesn’t apply to digital circuits, especially since hardware is a massively concurrent computing paradigm. As the authors acknowledge in their presentation, orthogonal array-finding complexity rises exponentially with the number of control knobs. Increased parallelism in digital circuits may require an increased number of control knobs to catch most bugs. If so, the approach could quickly become unscalable for chip verification.
I definitely see application to extending current verification flows, as a tool feature, but there’s another intriguing possibility. RISC-V IP vendors have a unique challenge. They verify their IP as carefully as possible, across as many knob settings as possible. Then they deliver it to customers who may further extend the IP. They then need to reverify in their applications. This is already a big challenge. How are they going to ensure their testing covers enough of this parameter space for their purposes? Would it be valuable to the IP vendors to publish a tool like this along with their IP. As an aid to their customers to ensure good configuration coverage?
I also think this has interesting parallels with our first blog, BTW.
Expanding on Paul’s last point, further study is essential. I think that as long as the parameters are independent (orthogonal), most faults will be triggered by a small number. This seems reasonable for external parameters (as for our first blog), less certain for parameters used in CR testing.
You can read the previous blog HERE.