This blog is the next in a series in which Paul Cunningham (GM of the Verification Group at Cadence), Jim Hogan and I pick a paper on a novel idea we appreciated and suggest opportunities to further build on that idea.
We’re getting a lot of hits on these blogs but would like really like to get feedback also.
Our next pick is Metamorphic Relations for Detection of Performance Anomalies. The paper was presented at the 2019 IEEE/ACM International Workshop on Metamorphic Testing. The authors are from Adobe, the University of Wollongong, Australia and the Swinburne University of Technology, Australia.
Metamorphic testing (MT) is a broad principle to get around the oracle problem – not having a golden reference to compare for correctness. Instead it checks relationships expected to hold between related tests. Maybe for a distribution in runtimes, or correspondence between two software runs with code changes, or many other examples.
The authors tested load times for an Adobe tag manager. Since multiple factors influence load, they expected a distribution. The metamorphic relationship they chose was that load times with tagging should be shifted (by tag support overhead) from load times without tagging, but that distributions should otherwise be similar.
The relationship held in most cases except one where the managed distribution became bimodal. This they tracked to a race condition between different elements of the code. Depending on execution order a certain function would or would not run, causing the bimodal distribution. This was a bug; the function should have run in either case. When fixed, the distribution again became unimodal. The authors also describe how they automated this testing.
I like this. I see it as a way to do statistical anomaly-based QA. You compare a lot of runs, looking at distributions to spot bugs. I see a lot of applications: anything performance-related, heuristic-based, machine-learning-based will be naturally statistical. Distribution analyses can then reveal more complex issues than pass/fail analyses. MT gives us tools to find those kinds of problem.
For functional verification, this is a new class of coverage we can plan and track alongside traditional static and dynamic coverage metrics. I’m excited by the idea that a whole new family of chip verification tools could be envisioned around MT, and I welcome any startups in this space who want to reach out to me.
The main contribution in this paper assumes, given some performance metric with random noise, you’re going to have a distribution. Mu/sigma alone don’t fully classify the distribution. If the it’s multi-modal, maybe there’s a race? Now I’m looking to distribution modality to detect things like race conditions. That’s great and got me thinking how we might use this in our QA.
They discuss mechanics to automate detecting bi-modality, but then raise another possibility – using machine learning to check for changes between distributions. Mathematical characterization may not be as general as training a neural network to detect anomalies between different sets of runs. Similar to what credit card companies do in analyzing your spending patterns. If an anomaly is detected, maybe you’ve been hacked.
MT could find problems sooner and at finer levels than traditional software testing. The latter will find obvious memory leaks or race conditions, but MT plus statistical analysis may probe more sensitively for problems that might otherwise be missed.
Finally, the authors discuss outliers in the distribution, that these should remain similar between distributions. I’m excited to see how they develop this further, how they might detect difference in outliers and what bugs those changes might uncover.
Generally, I see significant opportunity in exploiting these ideas.
This is the first of the papers we’ve looked at in this series which to me is more than just a feature. This paper would definitely be worth putting money behind, trying to get to production. It looks like a product, perhaps a new class of verification tool. It might even work as a startup.
It reminds me of Solido and Spice. We used similar techniques to get beyond the regular statistical distributions – they were at six sigma already, very hard to get better. They had to start doing stuff like this to go further. I heard “no-one’s going to buy more, they already have spice”. Well they did buy a lot more. There is appetite out there for innovation of this kind.
I’m also very interested in the security potential, especially for the DoD. Another worthy investment area.
As Paul says, MT is a rich vein, too rich to address in one blog. I’ll add one thought I found in this paper. We invest huge amounts of time and money in testing. For passing tests, the only value we get is that they didn’t fail. Can we extract more? Maybe we can through MT.
To see the previous paper click HERE.