AWS Troubleshooting — Trivia & Interesting Facts¶
Surprising, historical, and little-known facts about AWS outages, debugging, and operational surprises.
The 2017 S3 outage was caused by a typo¶
On February 28, 2017, a single mistyped command during routine maintenance took down a large portion of S3 in us-east-1, which cascaded to break the internet for about four hours. The irony: AWS's own status dashboard was hosted on S3, so it couldn't display the outage. AWS later redesigned the dashboard to be multi-region.
us-east-1 has more services and more outages than any other region¶
AWS's us-east-1 (Northern Virginia) was the first AWS region, launched in 2006. Because new services launch there first, it runs the most complex infrastructure and historically has the most incidents. Experienced AWS engineers call it "the region of broken dreams" and actively architect around it.
IAM is a global service that can't go down — except when it did¶
AWS IAM is designed as a globally replicated service with no single point of failure. On June 15, 2021, a problem with IAM's control plane affected the ability to authenticate API calls across multiple regions simultaneously, effectively demonstrating that "global" services can still have correlated failures.
CloudTrail logs arrive with up to 15 minutes of delay¶
When troubleshooting AWS issues, engineers are often surprised that CloudTrail management events can take up to 15 minutes to appear, and data events can take even longer. This delay has caused countless hours of confusion during incident response when the logs haven't caught up to real-time events.
The "AZ mapping" gotcha has surprised thousands of teams¶
AWS maps Availability Zone names (like us-east-1a) differently for each account. Your us-east-1a might be a different physical datacenter than your colleague's us-east-1a. This was designed to prevent everyone from clustering in the "first" AZ, but it confuses cross-account troubleshooting regularly.
Lambda cold starts were originally measured in seconds¶
When AWS Lambda launched in 2014, cold starts for Java functions could exceed 10 seconds. By 2023, with SnapStart and Firecracker optimizations, cold starts were reduced to under 200 milliseconds for most runtimes. This 50x improvement happened incrementally over nine years of engineering.
The EBS outage of 2011 taught everyone about AZ independence¶
In April 2011, an EBS outage in a single AZ in us-east-1 cascaded and took down numerous major websites including Reddit, Foursquare, and Quora. This incident single-handedly drove the adoption of multi-AZ architectures and made "design for failure" a mainstream engineering principle rather than an academic one.
VPC Flow Logs were not available for the first 9 years of AWS¶
VPC Flow Logs weren't launched until June 2015, meaning for the first nine years of AWS's existence, there was no native way to log network traffic in a VPC. Network troubleshooting before Flow Logs involved packet captures on individual instances or expensive third-party tools.
The default service quotas are deliberately low to prevent accidents¶
AWS sets deliberately low default quotas (formerly called "limits") on most services. For example, the default Lambda concurrent execution limit is 1,000. These aren't technical limitations — they're guardrails. Many production outages have been caused by hitting unexpected quotas during traffic spikes because teams never requested increases.
AWS Support response times can exceed the outage duration¶
Under the Business support plan, AWS guarantees a response to "System impaired" cases within 12 hours. Many teams have discovered during actual outages that by the time AWS Support responds, the incident has already been resolved — or the business has already lost significant revenue. Enterprise and some Business plans offer faster response but at significant cost.