We have talked about fault localization (root cause analysis) in several reviews. This early-release paper looks at applying LLM technology to the task. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
The Innovation
This month’s pick is A Preliminary Evaluation of LLM-Based Fault Localization. This article was published in arXiv.org in August. The authors are from KAIST in South Korea.
It had to happen. LLMs are being applied everywhere so why not in fault localization? More seriously there is an intriguing spin in this paper, enabled by the LLM approach – explainability. Not only does this paper produce a root cause; it also explains why it chose that root cause. For me this might add a real jump to success rates for localization. Not because the top candidate will necessarily be more accurate but because a human verifier (or designer) can judge whether they think the explanation is worth following further. If an explanation comes with each of the top 3 or 5 candidates, perhaps augmented by spectrum-based localization scores, intuitively this might increase localization accuracies significantly.
Paul’s view
A very timely blog this month: using LLMs to root cause bugs. No question we’re going to see lot more innovations published here over the next few years!
In 2022 we reviewed DeepFL which used an RNN to rank methods based on suspiciousness features (complexity, mutation, spectrum, text). In 2023 we reviewed TRANSFER-FL which used another RNN to improve ranking by pre-classifying bugs into one of 15 types based on training across a much larger dataset of bugs from GitHub.
This paper implements the entire “fault localization” problem using prompt engineering on OpenAI’s GPT-3.5. Two cutting edge LLM-based techniques are leveraged: chain-of-thought prompting and function calling. The former is where the question to the LLM includes an example not only of how to answer the question but the suggested thought process the LLM should follow to obtain the answer. The latter is where the LLM is given the ability to ask for additional information automatically by calling on user-provided functions.
The authors’ LLM prompt includes the error message and a few relevant lines of source code referenced by the error message. The LLM is given functions that enable to query if the test covered a particular method and to query the source code or comments for a method.
As is typical for fault localization papers, results are benchmarked on Defects4J, an open source database of Java code bugs. Somewhat amazingly, despite no pre-training on the code being debugged or prior history of passing and failing test results, the buggy method is ranked in the top-5 by the LLM in 55% of the cases benchmarked! This compares to 77% for DeepFL, but DeepFL required extensive pre-training using Defects4J data (i.e. leave-out-one cross validation). TRANSFER-FL is hard to compare since it is a more precise ranker (statement-level accurate not method-level). Most likely, a combination of LLM-based and non-LLM based methods will be the long term optimal approach here.
Raúl’s view
This paper is the first to use LLMs for fault localization (FL) and was published in August 2023. A search for “Use of LLM in Fault Localization” reveals another paper from CMU, published in April 2024, but it employs a different methodology.
The main idea in this paper is to overcome the LLM limit of 32,000 tokens (in this case ChatGPT), which is insufficient if the prompt includes, for example, 96,000 lines of code. Instead, to navigate the source code, the LLM can call functions, in particular (the names are self-explanatory) get_class_covered get_method_covered, get_code_snippet and get_comments.
The actual technique used, called AutoFL, requires only a single failing test. It works by first prompting the LLM to provide a step-by-step explanation on how the bug occurred, with some prompt engineering required (Listing 1). The LLM goes through the code with the functions and gives an explanation. Using this, AutoFL then prompts ChatGPT to find the fault location (Listing 3) assuming the LLM has implicitly identified it in the previous phase. To improve the technique, the authors restrict the function calls to 9, and do the whole process 5 times using all 5 results to rank the possible locations.
The paper compares AutoFL with seven other methods (reference [41]) on a benchmark with 353 cases. AutoFL finds the right bug location more often than the next best (Spectrum Based FL) when using one suggestion: 149 vs. 125. But it does worse when using 3 or 5 suggestions: 180 vs. 195 and 194 vs. 218. The authors also note that 1) AutoFL needs to call the functions to explore the code, otherwise the result gets much worse; 2) more than 5 runs still improves the results; and 3) “One possible threat to validity is that the Defects4J bug data was part of the LLM training data by OpenAI”.
The approach is experimental with sufficient details to be replicated and enhanced. The method is simple to apply and use for debugging. The main idea of letting the LLM explore the code with some basic functions seems to work well.
Share this post via:
Comments
There are no comments yet.
You must register or log in to view/post comments.