Amazon AWS S3 Outage Root Cause: Human Error Led Websites To Crash
In addition, Amazon is changing the administration console for the AWS Service Health Dashboard so it runs across multiple regions. This would be a normal fix, except the tech made a typo. The failure resulted in a lot of major websites that rely on Amazon’s AWS to experience major issues. The list of effected sites and apps included GitHub, Venmo, Quora, Medium, and Giphy. Amazon is conducting an audit of a system to ensure similar checks in place for all services. “The issue has been resolved and the service is operating normally”.
Now, Amazon has finally explained exactly what happened on Tuesday when it took down much of the web.
An S3 team member was attempting to execute a command that would remove a small set of servers for one of the S3 subsystems used by the billing system.
The engineer meant to take offline a small subset of servers for debugging, but the command instead took down a much larger group of servers. Folks with internet-of-things hardware like thermostats and lightbulbs were unable to control them as well. But uncovering the numbers as to just how big an impact the outage had is even more terrifying.
The AWS outage cost companies in the S&P 500 index $150 million, according to Cyence Inc., a startup that specializes in estimating cyberrisks. That’s why the outage lasted for several hours – well, technically it wasn’t an outage since only subsets of AWS’s complex architecture went down. More than half of the top 100 online retailers saw their websites slow by 20% or more. While most of the sites didn’t go down, many had broken links and were only partly functional. The outage affected many sites, apps, and utilities that rely on the service.
“Yet Apple, Walmart, Newegg, Best Buy, Costco, and surprisingly Amazon/Zappos were not affected by the outage”, an Apica spokesperson told Business Insider. Also, removing so much server capacity required a full system restart, which then took longer than expected, AWS said. Instead, the incorrect keystroke knocked even more servers offline and sent many people into a tizzy on Tuesday as websites failed to load. In a statement, Amazon disclosed that a technician error led to the service disruptions.