Ops War Stories — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts from real operations incidents and war stories.

A janitor unplugging a server to plug in a vacuum cleaner has happened more than once¶

The "janitor unplugged the server" story is so common in ops that it's become a cliche — but it really happens. In 2007, a cleaning crew at a Fisher & Paykel appliances datacenter in New Zealand accidentally unplugged critical servers while vacuuming. The incident caused a multi-day outage. This is why production racks have locking power cables and why datacenter access controls exist.

Amazon's S3 outage in February 2017 was caused by a typo in a command¶

An Amazon engineer ran a command to remove a small number of S3 subsystem servers during a debugging exercise, but a typo caused far more servers to be removed than intended. The cascading failure took down a massive portion of the US internet — including the AWS status page itself. The irony of the status page being hosted on the system it's supposed to report on became an industry parable about dependencies.

The 2013 Google outage that took down the entire internet for 5 minutes¶

On August 16, 2013, Google experienced a brief outage that was so widespread it caused global internet traffic to drop by 40%. The outage lasted only 1-5 minutes depending on the service, but it revealed just how much of the internet's traffic flows through Google's infrastructure (Search, YouTube, Gmail, Android services, Google DNS). The incident was caused by a bug in a network management system.

A squirrel has caused more power outages than most hackers¶

The "CyberSquirrel1" project (cybersquirrel1.com) tracked power infrastructure outages caused by animals. As of its last update, squirrels were responsible for hundreds of documented power outages affecting critical infrastructure, including at least one that took down a NASDAQ data feed. Birds, snakes, and raccoons also appear frequently. The project was created to put cybersecurity threat assessments in perspective.

Knights Capital lost $440 million in 45 minutes due to a deployment error¶

On August 1, 2012, Knight Capital Group deployed new trading software that contained old, dead code that had been accidentally reactivated. The code sent millions of erroneous orders to the market, buying high and selling low. In 45 minutes, the firm lost $440 million — more than the company's entire market capitalization. Knight Capital never recovered and was acquired by Getco the following year.

The GitLab database deletion was live-streamed on YouTube¶

When GitLab accidentally deleted their production database on January 31, 2017, they made the unprecedented decision to live-stream the recovery effort on YouTube. Thousands of people watched engineers scramble to recover data. The transparency was widely praised, and GitLab published a detailed post-mortem that became one of the most-read incident reports in the industry. The live stream peaked at 5,000+ concurrent viewers.

British Airways' 2017 outage was caused by a contractor accidentally turning off a UPS¶

On May 27, 2017, a power surge at British Airways' Heathrow datacenter caused a catastrophic outage that grounded all flights for two days. The root cause was reportedly a contractor who disconnected a UPS (Uninterruptible Power Supply) and the subsequent power surge when it was reconnected damaged servers and storage. The outage cost BA an estimated 80 million pounds and stranded 75,000 passengers.

Facebook's 6-hour outage in October 2021 was caused by a BGP withdrawal¶

On October 4, 2021, Facebook, Instagram, and WhatsApp went down for approximately 6 hours — the longest outage in the company's history. A routine BGP maintenance command accidentally withdrew the routes that allowed the outside world to reach Facebook's DNS servers and datacenters. Engineers couldn't fix it remotely because their remote access tools depended on the same infrastructure. They had to physically go to datacenters to fix it, but their badge access systems were also down.

A leap second bug crashed thousands of Linux servers in 2012¶

On June 30, 2012, when the leap second was added (23:59:60 UTC), a bug in the Linux kernel's time-keeping code caused high CPU usage and hangs on servers running certain kernel versions. Reddit, Mozilla, Gawker, LinkedIn, and many other services experienced outages. The incident accelerated the adoption of "leap second smearing" — Google's approach of spreading the extra second across a longer period — and renewed calls to abolish leap seconds entirely.

The Cloudflare outage of July 2019 was caused by a single regex¶

On July 2, 2019, a regular expression deployed in a Cloudflare WAF rule caused catastrophic backtracking, consuming 100% CPU on every server in Cloudflare's global network simultaneously. The outage lasted 27 minutes and affected millions of websites. The offending regex contained .*.*=.* — a pattern that causes exponential backtracking. Cloudflare responded by moving to a regex engine that doesn't support backtracking.

The Mars Pathfinder experienced a priority inversion bug that nearly ended the mission¶

In 1997, the Mars Pathfinder rover began randomly resetting itself on Mars. Engineers at JPL eventually diagnosed a priority inversion bug in the VxWorks real-time operating system: a low-priority task held a mutex needed by a high-priority task, while a medium-priority task starved the low-priority task. The fix was uploaded from Earth by enabling the "priority inheritance" flag in VxWorks. The bug is now a classic computer science case study.