Incident Command — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about incident command systems and practices.

The Incident Command System was invented after a catastrophic California wildfire¶

ICS was developed in the 1970s after the 1970 California wildfires that burned 600,000+ acres and killed 16 people. The post-incident review found that most problems weren't firefighting failures — they were communication, coordination, and command failures. FIRESCOPE (Firefighting Resources of Southern California Organized for Potential Emergencies) created ICS to solve these problems. Tech adopted it decades later.

Jesse Robbins brought ICS to tech from his experience as a volunteer firefighter¶

Jesse Robbins, Amazon's first "Master of Disaster," was a volunteer firefighter who realized that the ICS framework used by fire departments could solve the same coordination chaos he saw during Amazon outages. He introduced Game Days and formal incident command at Amazon in the mid-2000s. Robbins later co-founded Chef (the configuration management company) and Orion Labs.

The Incident Commander's primary job is NOT to fix the problem¶

A common misconception is that the Incident Commander (IC) should be the most technically skilled person. In practice, the IC's job is coordination: maintaining situational awareness, delegating technical work, managing communication, and making decisions. The best ICs often contribute zero technical troubleshooting. Google's SRE book explicitly states that the IC should avoid "keyboard time" during an incident.

PagerDuty processes over 15 billion events per year¶

PagerDuty, founded in 2009, processes over 15 billion events annually and triggers millions of incidents. The company's data shows that the median acknowledgment time for pages is 3.5 minutes, and the median resolution time is 30 minutes. Organizations with mature incident management practices resolve incidents 5x faster than those without.

The "blameless postmortem" concept was popularized by John Allspaw at Etsy¶

John Allspaw, then CTO of Etsy, championed blameless postmortems in the early 2010s, arguing that blaming individuals for system failures was both unjust and counterproductive. His 2012 blog post "Blameless PostMortems and a Just Culture" became one of the most influential pieces in the DevOps movement. The concept was borrowed from aviation's "just culture" framework, which had been improving airline safety since the 1990s.

Statuspage was built because companies kept improvising status communication during outages¶

Statuspage.io, acquired by Atlassian in 2016 for approximately $25 million, was created because the founders noticed that every company built ad-hoc status pages during major outages — usually a hastily deployed static HTML page. The product formalized the communication workflow that every team was reinventing during their worst moments. Over 40,000 companies now use it.

The "5 Whys" technique used in postmortems was invented at Toyota in the 1930s¶

Sakichi Toyoda developed the "5 Whys" technique at Toyota as part of the Toyota Production System. The idea — asking "why" five times to reach the root cause — was adopted by the lean manufacturing movement, then by Agile, and finally by incident management. However, many incident management experts now caution against 5 Whys, arguing that complex system failures rarely have a single linear causal chain.

Major cloud provider outages often cascade because of shared control planes¶

Many notable cloud outages (AWS us-east-1 in December 2021, Azure Active Directory in March 2021) were caused by failures in shared control plane services that affected the ability to manage or recover other services. The December 2021 AWS outage was particularly ironic: the monitoring systems used to detect the outage were themselves affected by it, delaying detection and response.

The average cost of a major incident for a Fortune 1000 company exceeds $500,000¶

Gartner estimates that the average cost of IT downtime for large enterprises is $5,600 per minute. A 90-minute incident at a Fortune 1000 company can easily cost $500,000+ when factoring in lost revenue, productivity, recovery costs, and reputation damage. This calculation is why mature organizations invest heavily in incident management tooling and training — even expensive programs pay for themselves after preventing a single major incident.

Slack channels during incidents generate a median of 200+ messages per hour¶

Analysis of incident Slack channels shows that during active major incidents, communication volume typically exceeds 200 messages per hour, with peaks of 500+ during the most chaotic phases. This volume is why the "scribe" role exists in ICS — someone must maintain a coherent timeline because nobody can follow a real-time Slack channel with 5+ messages per minute.

Fire departments practice "hot wash" debriefs, the model for tech postmortems¶

After every significant fire response, crews conduct a "hot wash" — an immediate, informal debrief while memories are fresh. This practice, typically done standing in the parking lot before leaving the scene, directly inspired the "learning review" format used by progressive tech companies. The key principle borrowed from fire services: debrief while the details are still vivid, not three days later.