Mutation testing is an intriguing idea, but is it useful? Paul Cunningham (GM of Verification at Cadence), Jim Hogan and I continue our series on novel research ideas, here looking at a paper examining the pros and cons of this topic. Feel free to comment if you agree or disagree.
This month’s pick is Which Software Faults Are Tests Not Detecting? The paper was presented at the 2020 Evaluation and Assessment in Software Engineering conference. The authors are all from Lancaster University in the UK.
The contribution in this paper is analysis of testing efficiency in software, to find methods to improve the ability of tests to uncover more bugs. The authors measure efficiency through a combination of code coverage and mutation analyses. In mutation testing functional errors are inserted in the code, testing is re-run and test efficiency is determined by ability to detect the mutation. They apply their analysis to 10 open-source systems with associated unit tests, using a tool to automatically insert faults. From this they analyze efficiency of the tests by fault type.
They report that in 6 of the systems, less than 50% of the injected faults are detected and some fault types are detected more frequently than others, particularly conditional boundary checks. They also find that the lowest performing tests are 10X less efficient in detecting boundary faults.
The authors also discuss challenges in mutation testing. One study finds that most post-release faults are complex and can only be fixed through modifications in several locations. Attempting to model these through mutation would explode rapidly. Also, a study at Google confirms that even simple mutation testing is very expensive. Many mutants are unproductive, being either redundant or equivalent, yet are not easily weeded out.
This is something we’re looking at closely as a natural area of interest in our metric driven verification (MDV) strategy. We’re always interested in ways to help improve test effectiveness; this paper adds to our understanding.
Testing mutated code is computationally expensive, whether it’s software or hardware, since you have to run all your tests not only on the original code but also on each mutated version. In hardware verification, testing the non-mutated design is already swamping verification resources. If we are going to do mutation testing in hardware, we need to focus on high ROI mutations. A second concern is that mutation testing exposes limitations in tests, not bugs in the design. Which is still valuable but not a first-order concern, making it a tougher sell for schedule-constrained projects.
Nevertheless, selective use of high ROI mutation coverage could still be helpful in hardware, especially for modules where there is no good functional coverage model available. The paper cites boundary condition mutation, for example, mutating “<=” with “<” as more likely to find useful gaps in tests than mutating “+” with “-“. Buffer overflow security attacks are given as a good example where boundary condition mutation can catch gaps in test suites. This example applies equally to both software and hardware test.
The observation I want to make here, as an investor is that I have seen a decline in research in functional verification at the RTL level, at least judging by the number of papers we see. Not application-level stuff, how to better use the tools we’ve already got, that’s common. I’m talking about original research, from universities or outfits like Google.
This isn’t because all the problems are solved – they definitely aren’t. I think it’s more for universities because grants are directed to problems in other areas, and in the hyperscalars because software is their biggest driver for innovation. What then should we do in functional verification for hardware? Learn from research in software verification! The two domains are very closely related, not identical but the overlap is significant. I want to see more of these software parallels.
On this topic specifically, I want to better understand the associated costs. That’s a huge factor in ROI; the “R” will have to be equally impressive.
Security seems like a good application for mutation testing. Here there may be more willingness to accept the added overhead, also the recently released Mitre list of common weaknesses in hardware should provide inspiration for more security-related high-value mutations beyond boundary conditions.