Can we predict where bugs are most likely to be found, to better direct testing? Paul Cunningham (GM of Verification at Cadence), Jim Hogan and I continue our series on novel research ideas, again through a paper in software verification we find equally relevant to hardware. Feel free to comment if you agree or disagree.
This month’s pick is Software Defect Prediction via Attention-Based Recurrent Neural Network. The paper is published by Hindawi, an open-access scientific journal. The authors are from several technical institutes in Shanghai.
This is a deeply technical paper. In the interests of space and time we’re going to abstract our high-level takeaways. Predicting defects in software has history in complexity analysis, measuring how difficult it would be to fully test a function. This paper is one of many taking a modern look at prediction, comparing complexity and other static metrics with machine learning guides to prediction.
In this paper, the “image” in which ML will recognize features is derived from abstract syntax trees extracted from the code, a novel feature the authors say provides a richer base for recognizing code semantics than do other methods. From this they build a recurrent neural network to learn syntactic and semantic features of the code. That is fed into a stage to capture local contextual information for code segments and next into an attention layer, a technique commonly used in machine translation to highlight critical features in the network. From there layers are run through activation functions to generate probabilities, in their case measured per input source file.
In their testing they ran training on Java projects with labeled bugs, looking at a range of different approaches from random forest networks, to deep belief networks and CNNs. They find that their method predicts more accurately than the comparison methods by 7%-14%, depending on test case. It should be noted however that their accuracy on average is ~50%. This is currently a technique for general guidance, not pinpoint accuracy.
I see an analogy with a chess grandmaster, looking at the board to assess intuitively whether he sees a good position for white or a good position for black. Here you’re effectively looking at code to intuitively spot weak points. Definitely an interesting idea.
Generally, for a commercial tool I want to see well into the 90’s in terms of accuracy, but I have to admit I’m impressed that they can get to 50% accuracy just on eyeballing the code. On that alone I give it a thumbs up. I also noticed this is a very active area of research – there are a lot of related paper citations. I see nothing in this method restricted to software. Similar methods on Verilog and VHDL should be equally fruitful. I’d very much like to see research start in that direction.
There’s a lot to learn from the paper, worth a longer discussion, such as how they use LSTM to combine understanding of other nearby lines in the code with learnings from similar looking lines of code in other files or projects.
Here I’ll mention a more application-centric point. Should 50% accuracy mean this is only of academic interest today? I don’t think so. If, based on my intuitive eyeball I decide some code looks a bit dodgy, that’s maybe not interesting to invest my own personal time to figure out specifically why, but I could use that information to prioritize DV runs, run those tests first. I could use my “intuitive lint” solely for that guidance. Think of it as a smart scheduler, yet another way to squeeze more efficiency into the overall verification flow.
You know, anything to do with AI, ML is hot. We’re still in that golden age for investment where the core technology is pretty solid and we’re still scratching the surface of possible applications. They’re popping up all over the place. Anywhere you can get something more than you can through traditional statistical methods, that’s interesting. I’m not too worried about the 50% accuracy. They’ll be able to tune the accuracy by application I’m guessing. Some applications may allow for more accuracy. I think this is an area that definitely warrants more investment. Someone should do a proof of concept fairly quickly with a handful of engineers, to see if they can increase the confidence level.
I’ll also add that this is extra interesting because Lint is getting hot again. Look at SpyGlass. Look at Real Intent. Put ML together with a verification method finding a second wind, two hot domains coming together? That is always going to look like a good bet.
This is my home turf, so of course I’m going to like it. We did some work on McCabe complexity metrics back in Atrenta. I don’t know how many customers used that capability, but the concept remains intuitively reasonable. Combine that general principle with ML and training to fold some level of experience into the mix and it really starts to sound interesting. Now I’m wondering if a similar line of reasoning could apply to grading testbenches..
You can access the previous blog HERE.