Chaos Engineering — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about chaos engineering.

Netflix invented Chaos Monkey because they migrated to AWS and didn't trust it¶

In 2010, Netflix engineers built Chaos Monkey to randomly kill EC2 instances during business hours. The reasoning was brutally pragmatic: if individual instances could die at any time in the cloud, services had better handle it gracefully. The tool was open-sourced in 2012 and spawned an entire discipline.

The Simian Army had far more destructive members than Chaos Monkey¶

Netflix's full "Simian Army" included Chaos Gorilla (kills an entire AWS Availability Zone), Chaos Kong (kills an entire region), Latency Monkey (injects artificial delays), Conformity Monkey (shuts down non-conforming instances), and Janitor Monkey (cleans up unused resources). Chaos Kong was considered so dangerous it was run only a few times per year.

The term "chaos engineering" was coined in 2014¶

Casey Rosenthal, then at Netflix, coined the formal term "chaos engineering" around 2014 and co-authored the defining book "Chaos Engineering: System Resiliency in Practice" (O'Reilly, 2020). Before this, people called it "failure injection testing" or just "breaking things in production." The rebranding to "engineering" was deliberate — it emphasized the scientific, hypothesis-driven approach.

Game Days predate chaos engineering by decades¶

The concept of "Game Day" — a planned exercise where teams simulate failures — was practiced by NASA, the US military, and nuclear power plants long before tech adopted it. Amazon formalized Game Days for their infrastructure in the mid-2000s, years before Netflix built Chaos Monkey. Jesse Robbins, Amazon's "Master of Disaster," brought fire department incident command practices to tech.

Chaos engineering found a major AWS bug that Amazon itself hadn't caught¶

In one famous incident, Netflix's chaos experiments discovered that certain AWS API calls would cascade-fail under specific conditions that Amazon's own testing hadn't triggered. By running continuous chaos experiments against real production infrastructure, Netflix effectively became an unpaid (and extremely thorough) QA team for AWS.

Gremlin was the first chaos engineering startup and raised $26M¶

Gremlin, founded in 2016 by former Netflix and Amazon engineers, was the first company built entirely around chaos engineering as a service. The company raised $26.4 million in Series B funding in 2020. The existence of a funded startup validated that chaos engineering had moved from "Netflix curiosity" to "industry practice."

LitmusChaos brought chaos engineering to Kubernetes and became a CNCF project¶

LitmusChaos, created by ChaosNative (later acquired by Harness in 2022), became a CNCF Incubating project. It introduced the concept of "ChaosHub" — a public marketplace of chaos experiments that teams can share. The Kubernetes ecosystem's embrace of chaos engineering demonstrated how container orchestration made failure injection both easier and more necessary.

The Principles of Chaos Engineering has only five principles¶

The formal discipline is governed by just five principles, published at principlesofchaos.org: (1) build a hypothesis around steady state, (2) vary real-world events, (3) run experiments in production, (4) automate experiments to run continuously, and (5) minimize blast radius. The simplicity is intentional — the hard part is cultural, not technical.

Facebook's Storm project tested what happens when a datacenter literally loses power¶

Facebook (now Meta) built an internal tool called Storm that could simulate the complete loss of an entire datacenter. In their tests, they discovered that many services had hidden dependencies on specific datacenters even though they were supposedly geo-redundant. These discoveries led to fundamental architectural changes across Facebook's infrastructure.

Chaos engineering almost didn't survive Netflix's internal politics¶

Early Netflix chaos experiments caused real customer-facing outages, and there was significant internal pressure to shut down the program. The team survived because they could demonstrate that the controlled failures they caused were far less damaging than the uncontrolled failures they prevented. This political dynamic still plays out at every company adopting chaos engineering today.

Slack's "Disasterpiece Theater" is one of the best-named chaos programs¶

Slack runs chaos engineering exercises under the name "Disasterpiece Theater," a play on "Masterpiece Theatre." The program simulates failures across Slack's infrastructure and has become a model for how mid-size companies can adopt chaos engineering without Netflix's scale. The name helped with internal adoption — engineers actually wanted to participate.

The blast radius concept comes from explosive ordnance engineering¶

The chaos engineering term "blast radius" — meaning the scope of impact from a failure — is borrowed directly from military explosive ordnance terminology. The metaphor is apt: just as bomb engineers calculate minimum safe distances, chaos engineers calculate minimum safe failure scopes. The term entered software via Site Reliability Engineering at Google.