Technology in and around the LHC can sometimes be a useful exemplar for how technologies may evolve in the more mundane world of IoT devices, clouds and intelligent systems. I wrote recently on how LHC teams manage Big Data; here I want to look at how they use machine learning to study and reduce that data.
The reason high-energy physics needs this kind of help is to manage the signal-to-noise problem. Of O(10[SUP]12[/SUP]) events/hour only ~300 produce Higgs bosons. Real-time pre-filtering significantly reduces this torrent of data to O(10[SUP]6[/SUP]) events/hour but that’s still a very high noise level for a 300 event signal. Despite this, the existence of Higgs has been confirmed with a significance of 5σ, but the physics doesn’t end there. Now we want to study the properties of the particle (there are actually multiple types), but the signal-to-noise problems appeared so daunting that CERN launched a challenge in 2014 to propose machine-learning methods to further reduce candidate interactions.
The tricky part here is that you don’t want to rush to publish your solution to quantum gravitation or dark matter only to find a systematic error in the machine learning-based data analysis. So standards for accuracy and lack of bias/systematic errors are very high, suggesting that the LHC may also be beating a path for the rest of us in machine learning.
The CERN machine-learning challenge required no understanding of high-energy physics. The winning method, provided by Gabor Melis, used an ensemble of neural nets. There’s a lot of detail to the method but one topic is especially interesting – the careful methods and intensive effort put into avoiding over-fitting data (aka false positives). I recently commented on a potential weakness in neural net methods. If you train to see X, you will have a bias to see X, even in random data. So how do you minimize that bias?
The method used both to generate training data and to test significance of “discoveries” in that data is Monte Carlo simulation, a technique which has been in use for many decades in high-energy physics (my starting point many years ago). The simulation models not only event dynamics but also detector efficiency. Out of this come many-dimensional representations of each event which form the input to training for each of the challenge participants’ methods. Since the data is simulated, it is easy to inject events of special interactions with any desired probability to test metrics for classification.
Deep neural nets and boosted tree algorithms dominated successful entries. The challenge was also important in enabling cross-validation and comparison between techniques. To ensure objectivity between entries, statistical likelihood measures were defined by CERN and used to grade the solutions from each competitor. The competition together with these measures is a large part of how CERN was able to have confidence in minimized bias in the algorithms. But they also commented that the statistical metrics used are still very much a work in progress.
I should also stress that these methods are not yet being used to detect particles. They are only being used to reduce the data set, based on classification, to a set that can be analyzed using more traditional methods. And in practice a wide variety of techniques are being used on Atlas and CMS experiments (two of the detectors at the LHC), including neural nets and boosted decision trees, plus pattern recognition on events, energy and momentum regressions, individual component identification in events and others.
And yet even with all this care, machine learning methods are not out of the woods yet. One of the event types of interest is decay of a Higgs boson to 2 photons – a so-called di-photon event. The existence of Higgs is in no doubt, but recent di-photon events looking in a different mass range found (with 3σ significance) an apparent resonance at 750 GeV, which might have heralded a major new physics discovery.
But subsequent experiments this year reversed the likelihood that a new particle had been detected. Whether the initial false detection points back to weaknesses in the machine learning algorithms or in human error, this should serve as a reminder that when you’re trying to see very weak signals in significant background, eliminating systematic errors is very, very hard. I think it also points to the power of multiple independent viewpoints or, if you like, the power of the crowd. This underpins a core strength of the scientific method: independent and repeatable validation.Share this post via: