Big Data Lessons from the LHC

Big Data Lessons from the LHC
by Bernard Murphy on 07-20-2016 at 7:00 am

Big Data techniques have become important in many domains, not just to drive marketing strategies but also for  semiconductor design, as evidenced by Ansys’ recent announcements around their use of Big Data analytics. And they should become even more important in the brave new world of the IoT. So it makes sense to look at an organization that is managing bigger data than anyone else in order to understand approaches we may need as we scale.

Before we think about measurement data, CERN (the organization that host the LHC) uses Big Data analytics for control, for the accelerator and instrumentation, independent of data gathering. Why? Because running an accelerator of this class is very complicated. You are accelerating charged particles at very close to the speed of light around a very-high-vacuum tube 27km in circumference, which takes many vacuum pumps, many cryogenic systems, many power controls, many sensors. And that’s just the main accelerator. Add to that the ion source that feeds the accelerator, control for multiple complex detectors and you have a system more complex than any other I can imagine.

Managing all of that is first a giant sensor / actuator / feedback problem (like using IoTs for maintenance on a massive scale) and second a big data problem because the data gathered from those systems is necessarily massive (in one example, just the cryo data runs to a billion records). Complexity is high enough that the system as a whole is in a fault-state 37% of available time. CERN decided that preventive maintenance is not enough to get maximum value out of the LHC, and since they want to plan for the next generation which will be even bigger and more complex, they have worked with multiple partners to build a Big Data analytics systems to better forecast potential problems before they happen.

This is where IoT for maintenance is already moving – not just knowing when something is broken, or scheduled for repair, but being able to do predictive analytics. Perhaps there will be synergies between the work being done at CERN and in other enterprises. Hopefully Oracle (who play a big role in the CERN control systems) can exploit some of these synergies.

The control aspect is critically important, but when most of us are thinking about Big Data and the LHC, we’re probably thinking about managing measurement data – the information that leads to new physics. The largest detector (Atlas, pictured above) generates ~1 petabyte per second of data, far beyond levels you could consider storing. And the vast majority of the data is uninteresting anyway because it only contains known collision events and the goal is to find new physics.

Filtering has to reduce a O(10[SUP]9[/SUP]) event rate to O(10[SUP]2[/SUP]) with low probability of rejecting interesting events, which they accomplish using a series of specialized and massively pipelined triggers (traditional compute would be far too slow for the first stage of triggers). Only after this filtering is data then set on for further processing and storage. The parallel for the IoT world is that no, you can’t just ship all data to the cloud. You have to pre-filter and, depending on how much data your devices produce, you may have to pre-filter very aggressively using very sophisticated logic.

The data that survives filtering still amount to ~30PB/year. This data falls in the Big Data class of “never throw it away”, since you don’t know in advance how it may be used in different analyses. So you want permanent storage, but what you may find interesting is that this is not on disk – it goes to a tape archive (who knew we still had tape?). In fact, they have ~100k processors writing at peaks of 20GB/s to 80 tape drives. The rationale for tape is that cost is still a lot lower than for disk and power requirements are zero when a tape is not being accessed. And since generally users of the data don’t require instantaneous access across the whole dataset, performance is not an issue.


On the other hand, you lose a lot in random access flexibility with tape, so a catalog of metadata is stored online. Once you’ve found what you need, a tape robot will load the appropriate tapes. Could we ever see this for IoT cloud data (or the cloud in general)? There’s arguably a security issue in tapes you can carry away, but since the whole thing is managed by a robot, you might actually have better physical security around a tape vault than we see in conventional systems. Then again maybe we’ll eventually see higher density read-only material advances (an upcoming blog) that will replace both disk and tape.

CERN Big Data is definitely far bigger and far more challenging than we are likely to see in the IoT for some time. Still, I find it interesting to look at how they handle data to get some idea of where we may eventually find ourselves. You can learn more about Big Data for control at the LHC HERE and Big Data for measurement HERE. For the truly dedicated, you can learn about how CERN does real-time filtering of measurement data HERE.

More articles by Bernard…