Can we order regression tests for continuous integration (CI) flows, minimizing time between code commits and feedback on failures? Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
This month’s pick is Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration. The paper published in the 2017 International Symposium on Software Testing and Analysis, with 96 citations to date. The authors are from the Simula Research Lab and the University of Stavanger, both in Norway.
Efficiently ordering tests in a regression suite can meaningfully impact CI cycle times. The method reduces run-times further by truncating sequences for reasonably ordered tests. This is a natural application for learning, but the investment cannot outweigh the time saved, either in training or in runtime. The authors contend that their adaptive approach through reinforcement learning is an ideal compromise. Training is on the fly, requires no prior knowledge/model and surpasses other methods within 60 days of use.
Ranking is very simple on a binary pass/fail per test, run duration and historical data of the same type, accumulated through successive CI passes. The method applies this information to define different types of reward, driving prioritization through either tableau or neural net models. The paper presents several comparisons to judge effectiveness against multiple factors.
This was a great choice of paper – another example of a topic that is widely discussed in the software design community but which lacks a similar level of attention in the hardware design community. For a given set of RTL code check-ins what tests are best to run and in what priority order?
The authors have structured the paper very well and is an easy read. It outlines a method to train a neural network to decide which tests to run and in which priority. The training uses only test pass/fail data from previous RTL code check-ins. It does not look at coverage or even what RTL code has been changed at each check-in. The authors’ method is therefore very lightweight and fast but somewhat primitive. They compare the performance of their neural network to a table-lookup based “tableau” ranking method and some basic sorting/weighting methods which essentially just prioritize tests that have historically failed the most often. The neural network does better, but not by much. I would be really interested to see what happens if some simple diff data on the RTL code check-ins was included in their model.
By the way, if you are interested in test case prioritization, the related work section in this paper contains a wonderful executive summary of other works on the topic. I’m having some fun gradually reading through them all.
This is a relatively short, self-contained paper which is a delight to read. It further connects us to the world of testing software using ML, something we already explored in our May blog (fault localization based on deep learning). The problem it tackles is test case selection and prioritization in Continuous Integration (CI) software development. The goal is to select and prioritize tests which are likely to fail and expose bugs, and to minimize the time it takes to run these tests. Context: the kind of SW development they are targeting uses hundreds to thousands of test cases which yield tens of thousands to millions of “verdicts” (a passing or failing of a piece of code); the number of CI cycles considered is about 300, that is a year if integration happens daily as in two of their examples, in one case it represents 16 days of hourly integration.
The method used, RETECS (reinforced test case selection) is reinforcement learning (RL). In RL, “an agent interacts with its environment by perceiving its state (previous tests and outcomes) and selecting an appropriate action (return test for current CI), either from a learned policy or by random exploration of possible actions. As a result, the agent receives feedback in terms of rewards, which rate the performance of its previous action”. They explore a tableau and an artificial neural network (ANN) implementation of the agent, and consider 3 reward functions. These are overall failure count, individual test case failures and ranked test cases (the order in which analysis executes test cases; failing test cases should execute early).
The analysis applies this to three industrial datasets, yielding 18 result tables. They measure results through a “normalized average percentage of faults detected” (NAPFD). They conclude that tableau with ranked test cases, and ANN with individual test case failures are “suitable combinations”. A second comparison with existing methods (sorting, weighting and random), shows that RETECS compares well after approximately 60 integration cycles.
The results don’t seem that impressive. For one of the datasets (GSDTSR) there is no improvement, perhaps even a slight degradation of results as RETECS learns. The comparison with existing methods only yields substantial improvements in one out of 9 cases. However, the method is lightweight, model-free, language-agnostic and requires no source code access. A “promising path for future research”, it would be interesting to see this applied to agile hardware design. All this in a well explained, self-contained, nice to read paper.
I confess I like this paper for the idea, despite the weak results. Perhaps with some small extensions in input to the reward function, the method could show more conclusive results.Share this post via: