This is another look at refining the accuracy of fault localization. Once a bug has been detected, such techniques aim to pin down the most likely code locations for a root cause. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
This month’s pick is DeepFL: Integrating Multiple Fault Diagnosis Dimensions for Deep Fault Localization. The paper published in the 2019 ACM International Symposium on Software Testing and Analysis. The authors are from UT Dallas and the Southern University of Science and Technology, China.
This is an active area of research in software development. The authors build on the widely adopted technique of spectrum-based fault localization (SBFL). Failures and passes by test are correlated with coverage statistics by code “element”. An element could be a method, a statement, or other component recognized in coverage. Elements which correlate with failures are considered suspicious and can be ranked by strength of correlation. The method is intuitively reasonable though is susceptible to false negatives and positives.
DeepFL uses learning based on a variety of features to refine localization. Methods used include SBFL, mutation-based testing (MBFL), code complexity and textual similarity comparisons between code elements and tests. In this last case, intuition is that related text in an element and a test may suggest a closer relationship. The method shows a significant improvement in correct localization over SBFL alone. Further the method shows promising value between projects, so that learning on one project can benefit others.
I really appreciate the detail in this paper. It serves as a great literature survey, providing extensive citations on all the work in the ML and software debug communities to apply deep learning to fault localization. I can see why this paper is itself so heavily cited!
The key contribution is a new kind of neural network topology that, for 77% of the bugs in the Defects4J benchmark, ranks the Java class method containing the bug as one of the top 5 most suspicious looking methods for that bug. This compares to 71% from prior work, a significant improvement.
The authors’ neural network topology is based on the observation that different suspiciousness features (code coverage based, mutation based, code complexity based) are best processed first by their own independent hidden network layer, before combining into a final output function. Using a traditional topology, where every node in the hidden layer is a flat convolution of all suspiciousness features, is less effective.
I couldn’t help but notice that keeping a traditional topology and just adding a second hidden layer also improved the results significantly – not to the same level as the authors’ feature-grouped topology, but close. I wonder if added a third hidden layer with a traditional topology would have further narrowed the gap?
Overall, this is great paper, well written, with clear results and a clear contribution on an important topic. It is definitely applicable to chip verification as well as software verification. If anything, chip verification could benefit more since RTL code coverage is more sophisticated than software code coverage, for example in expression coverage.
This is a nice follow-on to earlier reviews on using ML to increase test coverage, predict test coverage and for power estimation. These authors use ML for fault localization in SW as an extension of learning-to-rank fault localization. The latter uses multiple suspiciousness values as learning features for supervised machine learning. The paper has 95 citations.
DeepFL uses these dimensions in suspiciousness ranking: SBFL as statistical analysis on the coverage data of failed/passed tests with 34 suspiciousness values for code elements; MBFL uses mutants (140 variants) to check the impact of code elements on test outcomes for precise fault localization; fault-proneness-based features (e.g., code complexity, 37 of these); finally, 15 textual similarity-based features from the information retrieval area. All of these they use to drive multiple deep learning models.
The authors run experiments on the Defects4J benchmark with 395 known bugs, widely used in software testing research. They compare DeepFL to SBFL/MBFL and to various learning-to-rank approaches. They also look at how often the bug location ranked as top probability, or in the Top-3 and Top-5. The authors’ method outperforms other methods within a project and between projects.
They note that the biggest contributor to performance is the MBFL technique. In comparing runtime to a learn-to-rank approach their method takes ~10X to train but is up to 1000X faster in test (runtimes in the range of .04s – 400s).
A very interesting part of their result analysis clarifies how DeepFL models perform for fault localization and whether deep learning for fault localization is necessary at all. Even though the authors conclude that “MLPDFL can significantly outperform LIBSVM” (Support Vector Machines have just one layer), the difference in Top-5 is just 309 vs. 299, a comparatively small gain.
I wish they had written more about cross-project prediction and had compared runtimes to traditional methods. Still, this is a very nice paper showing a SW debugging technique which seems applicable to RTL and higher level HW descriptions and once again highlights an application of ML to EDA.
There are several interesting papers in this area, some experimenting primarily with features used in the learning method. Some look at most recent code changes for example. Some also play with the ML approach (eg graph-based methods). Each shows incremental improvement in localization accuracy. This domain feels to me like a rich vein to mine for further improvements.