Watching a spirited debate on Twitter this morning between Tom Peters and some of his followers reminded me of the plot of many spy movies: silently killing an opponent with a lethal injection of some exotic, undetectable poison. We are building in enormous risks in more and more big data systems.
Peters described data as “putty”, and “opinions in an attractive wrapper.” Of course the data scientists charged in with their opinions that data is only as good as the analytics and insight applied to it. If the data in the lake is any good, useful knowledge can be gleaned from it with the right algorithm. Numbers never lie, right?
It’s exactly that assumption that could lead to huge trouble. Computer programmers invented the phrase “garbage in, garbage out”. In those days, once a program was validated most of the trouble was due to miskeyed information – an error in data entry would go undetected until somebody said something didn’t look right on a printed report or billing statement.
Now, with data coming from hundreds, thousands, or millions of sources, every stream is an opportunity for false data. A failure – benign or intentional – in a sensor could corrupt a single data stream, a problem in a gateway could corrupt many of them, a compromised predictive model could ruin everything. Very quickly. Bad data could permeate an entire system, invalidating all results. Hopefully, someone would be able to shut things down before damage was done. But with scale and interconnectedness, the odds of successful human intervention diminish.
One would think security would be the biggest concern. We should take measures to ensure that nobody can break in and steal our valuable data. However, we may be working on the wrong thing:
Talk cybersecurity guru.. Shit WILL hit fan. Biggest concern: Giant bank data cleaned out , false data substituted. Total fin service chaos.
— Tom Peters (@tom_peters) August 8, 2016
This kinda sounds like the plot of Mr. Robot, but it isn’t that far-fetched. Every IoT system could be subject to the same type of attack, and it would be extremely hard to spot because nobody is watching and things are making the decisions in real-time. Humans would blithely plow along, putting trust in their data, until it becomes bloody obvious that the system is all fouled up.
It doesn’t take much to melt an airline, or a bank, or a retailer. Currently, IT teams like to trace blame to some single point of failure: a power outage, a software patch, a network failure, a virus. Teams fix the problem, go in with a dustpan and sweep up the mess, and go back to business as usual assuming the data is all good when things restart.
Transactional data usually has some kind of checkpoint that can be audited, but IoT data is constantly streaming. People say all the time how valuable all this IoT data is. What if someone managed to corrupt it, and everything in the data lake is now corrupt – or at least suspect, because data got tossed into a predictive model? It could take person-years to sort something like that out, especially if snapshots of data allegedly backed up were corrupt as well. These aren’t new problems, but the scale and speed of big data systems increases the probability and severity.
Poisoning a data lake with a lethal data injection could undo all the efforts put into building it. Most embedded design teams haven’t imagined a problem this large, with the potential for a full-scale meltdown. It builds the case for end-to-end security and provisioning and reprogrammable network keys and many other steps to isolate and contain breaches, not from stealing data but from inserting it.