The embarrassing outage of Amazon Web Services this week should open our eyes to a growing problem. Complex systems are difficult to manage, but if they are connected in dependent ways, a fragile result emerges. Such structures are subject to unexpected malfunctions which can sprawl quickly. One of the most knowledgeable technology companies on the planet learned just such a lesson this week. Amazon’s star-child, their cloud services, had a major disruption. It was not a nation-state attack, sophisticated teams of cyber-hackers, or even malicious insiders bent on destruction. Nonetheless, the lessons are telling. The ramifications of which will be important to all of us.
Summary of the Amazon S3 Service Disruption: We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended…
It was one employee, typing a few wrong codes, that caused a significant outage to major portions of the Internet. Amazon worked furiously to contain and recover from the incident. It will have to rebuild trust with customers whom were sold on the resiliency of ‘cloud’ services to avoid such events. Amazon has already stated they will learn from the event and will apply some compartmentalization controls to lessen potential damage in the future. But there is a more significant realization to be made.
The greater lesson for us all is that when hugely sophisticated systems interconnect with each other, there is an exponential increase in complexity. Due to reliance, authority, and trust, these structures can fail in spectacular fashion. The AWS example show how such a situation allows a series of cascading unintended effects, that cannot easily have been predicted, to occur and cause widespread impacts. As bad as it may have appeared, it was not too severe. If it were an intentional attack from a capable, motivated, and sophisticated attacker, I believe the results would have been catastrophic.
With the AWS outage we can see the impact of an unintentional accident and the difficulty to recover when everyone is working together to resolve the issue. Now imagine what a malicious and focused cyber-threat could do while being stealthy, striving for maximum damage, and actively undermining countermeasures and recovery actions of response teams.
If this were a malicious insider or professional hack, the damage would be a thousand times worse. We would still be picking up the shattered pieces. There would be tears falling from the AWS cloud.
This week it was cloud storage services making websites unavailable. What happens when it is a fleet of autonomous vehicles which put lives at risk or the complex national power grid infrastructure?
We must take a fresh look at understanding threats, risks, countermeasures, and protection practices as individual pieces of the computing world are growing much more complex and being connected. Traditional methods are not sufficient in understanding how chain reactions can occur in the next generation of new technologies and services.