After we detect a bug, can we use AI to locate the fault, or at least get close? Paul Cunningham (GM of Verification at Cadence), Jim Hogan and I continue our series on novel research ideas, through a paper in software verification we find equally relevant to hardware. Feel free to comment.
This month’s pick is Precise Learn-to-Rank Fault Localization Using Dynamic and Static Features of Target Programs. You can find the paper in the 2019 ACM Transactions on Software Engineering and Methodology. The authors are from KAIST, South Korea
There’s an apparent paradox emerging in the papers we have reviewed. Our verification targets are huge designs – 2N states, 2N^2 transitions. How can anything useful be deduced from such complex state-machines using quite small machine learning engines, here a 9-level genetic programming tree? We believe the answer is approximation. If you want to find the exact location of a fault, the mismatch is fatal. If you want to get close, the method works very well. Automating that zoom-in is a high-value contribution, even though we have to finish manually.
This paper is very detailed. Again we’ll summarize only takeaways. A key innovation is to use multiple techniques to score suspiciousness of program elements. Two are dynamic, spectrum based (number of passing and failing tests that execute an element) and mutation-based (the number of failing tests which pass when an element is mutated). Three are static, dependency and complexity metrics on files, functions and statements. The method uses these features to learn a ranking of most probable suspicious statements from program examples with known faults. In inference, the method takes a program failing at least one test and generates a ranked list of probable causes.
The learning method the authors use is a genetic programming tree. Think of arbitrary expressions of these features, represented as numbers. Each expression can be represented as a tree. They’re training expressions through random selection to find those that most closely fit the training examples. A genetic programming tree is mathematically not so different from a neural net, doing a conceptually similar job in a somewhat different way. It’s a nice change to see a paper highlighting how powerful genetic programing can be, as a contrast to the sea of neural net papers we more commonly find.
This overall concept is growing on me, that a program could look at code and observed behaviors, and draw conclusions with the same insight as an experienced programmer. We code in quite restricted ways, but we don’t know how to capture that intuition in rules. However, we do know where to look as experienced programmers when we see suspicious characteristics, in behaviors in testing, in complexities and interdependencies in files and functions. This isn’t insight based on how each state toggles over time. We have a more intuitive, experience-based view, drawing on those higher-level features. I see this innovation capturing that intuition.
I think this is a very practical paper. The authors do a very systematic analysis of how each feature contributes, for example taking away one feature at a time to see how accuracy of detection is affected. Slicing and dicing their own analysis to understand where it might be soft, why it’s robust. For example, they only had to use 20% of the mutants that a mutation-only analysis needed, an important consideration because the mutation analysis is by far the biggest contributor to run-time. Reducing the number of mutants they consider reduces run-time by 4-5X. Yet their overall accuracy when combining this reduced mutation coverage with other features is way better than the full mutation-based approach alone.
The big revelation for me is that while they use multiple methods, individually are not new (spectrum-based and mutation-based fault location, static complexity metrics, genetic programming), they show that putting these together, the whole is much greater than the sum of the parts. This is a very well thought-through paper. I could easily see how this idea could be turned into a commercial tool.
I like Paul’s point about the pieces individually not being that exciting, but when you pull them together, you’re getting results that are much more interesting. Our regular engineering expectation is that each component in what we build has to make an important contribution. In that way, when you add them together you expect a sum of important contributions. Maybe in these intuitive approaches it doesn’t always work that way, a few small components can add up to a much bigger outcome.
I’m starting to feel there’s a real pony in here. I have a sense that this is the biggest of the ideas we’ve seen so fa. I’m hearing from my team-mates that the authors have done well on analyzing their method from all possible angles. There are indications that there’s a significant new idea here, built on well-established principles, showing a whole new level of results. I’d say this is investable.
I’m going to steal a point that Paul made in our off-line discussion. This feels similar to sensor fusion. One sensor can detect certain things, another can detect different things. Fused together you have a much better read than you could have got from each sensor signal separately. Perhaps we should explore this idea more widely in machine learning applications in verification.
You can read the previous Innovation in Verification blog HERE.