A little thinking outside the box this time. Microsoft is adding automation to their (and LinkedIn) code reviews; maybe we should consider this option also? Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome..
This month’s pick is Automating Code Review Activities by Large-Scale Pre-training. The paper published in the 2022 European Software Engineering Conference and Symposium. The authors are from Microsoft, LinkedIn, and Peking University.
This paper is interesting on two counts: first that it is a method to automate code change review and second that it uses a transformer model, very appropriate to text analysis. HuggingFace reports availability of CodeReviewer based on work by the same authors. Training is based on (past) real-world code change fragments, together with reviewer comments where available.
Changes are measured first on quality as judged by reviewer comments. Changes without comments are judged to be minor and of sufficient quality. Changes with comments suggest suspect quality. In training, comments are interpreted through natural language processing, looking for common patterns which can then be used to suggest comments on for new code changes. Finally, they combine this learning together with observed changes from the training set to suggest potential code changes to satisfy review comments.
Our first blog on generative AI in verification and wow does it pack some punch! A global team of authors from Microsoft/LinkedIn and a few universities in China look at automatically generating code reviews. The paper was published late last year and describes a generative AI model called CodeReviewer that is based on a Transformer Large Language Model of similar complexity to OpenAI’s GPT-1.
Like any AI system, good training data is vital, and quite a bit of the paper is devoted to how the authors mine GitHub to create an impressive dataset covering 9 different programming languages and over 7.9M code review tickets.
I still find the whole process of training a Transformer super cool: you basically teach it different skills to build up to the desired generative capability. The paper eloquently walks us through the training steps used for CodeReviewer, teaching it first to understand the “+” and “-“ line prefix syntax for source code change diffs, then to “understand” code changes, then to “speak” the English language used to write a code review, and then finally to do the actual job of generating a code review in plain English from a code diff.
To benchmark CodeReviewer the authors split their dataset into two buckets: projects with 2.5k or more code reviews are used as training data and the remaining projects for benchmarking. Results are rock solid: 8% more accurate (72-74% vs. 64-66%) than the best of prior works at determining if a code change is good quality (meaning no review comments needed, it can be committed as is). For code review benchmarking the authors ask 6 expert programmers to personally inspect 100 randomly selected reviews and score them 1-5 for both relevance and informativeness. The average score for CodeReviewer is 3.2 compared to 2.5 for the best of prior works. Nice. And for a bit of fun the authors also do some qualitative comparisons of CodeReviewer with GitHub CoPilot, showing a few examples where CodeReviewer generates much better reviews than CoPilot.
Wonderful paper, well written and easy to read. Expect more from us on generative AI in future blogs – it’s going to transform (no pun intended) verification as well as so many other things in our daily lives!
The code review process as modeled in this paper consists of proposing a code change Code diff to an original code C0 resulting in a code C1, and then (1) estimating the quality of the code change, (2) generating a review comment RNL in natural language, and finally (3) code refinement in which a new version of the code is generated taking as inputs C1 and RNL. The authors construct a model called CodeReviewer for tasks 1, 2 and 3, with an encoder-decoder model based on Transformer, with 12 Transformer encoder layers and 12 decoder layers, 12 attention heads in each layer and the hidden size is 768. The total parameter size of the model is 223M. The paper goes into great detail on how to get the data to pre-train and fine tune the model. The used dataset is collected from GitHub and the pre-training set consists of 1,161 projects with a total of 7,933,000 pull requests.
Results are compared with three baselines, a state-of-the-art (SOTA) model architecture Transformer trained from scratch and two pre-trained models: T5 for code review and CodeT5 . Table 4 shows that CodeReviewer is superior than all 3 networks for quality estimation (1) in terms of precision (true positive / (true + false positive)), recall (true positive / (true positive + false negative)), F1 (weighted average of precision and recall) and accuracy ((true positive + negative) / total). Performance on review generation is also better in terms of BLEU scores (bilingual evaluation understudy which evaluates quality of machine translation on a scale of 0-100) and human evaluations. The BLEU score is still lower than 10, indicating it is a hard task. In terms of code refinement (3) CodeReviewer successfully generates the repaired code exactly as ground truth for more than 30% cases, which is two times as the result of T5 and 25% more than CodeT5 relatively. Interestingly, table 8 gives results for the influence of the multilingual dataset, showing that for Java, C# and Ruby training with all languages improves the accuracy by 2.32% and the F1 score by 1.10% on average.
The presented results are better than the state of the art. They hinge on collecting and organizing a large-scale dataset from GitHub. Unfortunately, to my knowledge, there are no comparable data collections for hardware designs written in Verilog, VHDL, SystemC, etc., so it is an open question whether CodeReviewer can be used for hardware design. Perhaps closer to home, whether a code review of EDA SW would yield similar results than the ones reported, given that CodeReviewer was trained so carefully with different kinds of SW, is an interesting question which EDA companies can try to answer. Given that “multilingual dataset benefits the CodeReviewer for understanding specific languages significantly… It also proves the broad applicability of CodeReviewer in different programming languages” there is reason to speculate for broad applicability for different kinds of SW.Share this post via: