Incident Triage — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about incident triage in operations.

Triage was invented by Napoleon's surgeon on the battlefield¶

Baron Dominique Jean Larrey, Napoleon's chief surgeon, developed the concept of triage during the Napoleonic Wars in the early 1800s. He categorized wounded soldiers into three groups: those who would survive without treatment, those who would die regardless, and those who could be saved with immediate care. This same three-tier logic — ignore, defer, act now — underpins every modern incident severity classification system.

Most severity classification systems have too many levels to be useful under stress¶

Research on decision-making under stress shows that humans can reliably distinguish between 3-4 categories, but performance degrades rapidly beyond that. Despite this, many organizations use 5-level severity systems (SEV1-SEV5). In practice, engineers under stress collapse these into three categories anyway: "everything is fine," "something is wrong," and "everything is on fire." The most effective systems acknowledge this reality.

The first five minutes of triage determine the trajectory of the entire incident¶

Analysis of thousands of incidents by companies like PagerDuty and Datadog shows that incidents where correct triage happens in the first five minutes resolve 3-5x faster than those where initial triage is wrong. Misclassifying an infrastructure problem as an application problem (or vice versa) sends the wrong team scrambling, wasting the most critical minutes. This is why runbooks emphasize quick diagnostic checks over deep analysis.

"SEV1" means completely different things at different companies¶

There is no industry standard for severity levels. At Google, a "P0" might mean "multiple products are down for all users." At a startup, a "SEV1" might mean "the CEO noticed something looks wrong." This inconsistency becomes acutely painful during multi-company incidents or vendor escalations, where both sides are using the same words to mean different things.

The "20-minute rule" for escalation is widely used but rarely documented¶

Many experienced incident responders follow an informal rule: if you haven't made measurable progress in 20 minutes, escalate. This heuristic exists because humans are terrible at recognizing when they're stuck — the sunk cost fallacy keeps them debugging a dead end. Google's SRE practices formalize this as "if you're not making progress, you're not the right person for this problem."

Correlation is not causation, but during triage, correlation is all you have¶

During active triage, you don't have time for rigorous causal analysis. You're working with correlations: "this deployment happened 10 minutes before the alert, so it's probably related." Experienced triagers develop a mental model of "likely vs. unlikely correlations" that comes only with practice. The most common triage mistake is confusing temporal correlation (A happened before B) with causation (A caused B).

Auto-remediation handles 60-80% of incidents at mature organizations¶

Companies like Google, Netflix, and LinkedIn report that 60-80% of their incidents are detected and resolved automatically without human intervention. The humans only get paged for the remaining 20-40% that automation can't handle. This dramatically changes the nature of triage — the incidents that reach humans are, by definition, the weird ones that don't match known patterns.

The "blast radius assessment" should happen before any fix is attempted¶

Experienced incident responders assess blast radius (how many users, services, or systems are affected) before attempting any fix. This is counterintuitive — the instinct is to fix first, assess later. But a fix attempt that fails can expand the blast radius (e.g., a rollback that breaks more things), while understanding the current impact helps prioritize the response appropriately.

Most incidents are detected by monitoring, not by customers — but the ratio matters¶

Mature organizations aim for 90%+ of incidents to be detected by monitoring before customers notice. The industry term for this ratio is "detection coverage." When customers report issues before monitoring catches them, it indicates a gap in observability. Tracking the ratio of "monitoring-detected" vs. "customer-reported" incidents is one of the most revealing SRE metrics.

Runbook-driven triage reduces MTTR by 40-60% in studies¶

Organizations that maintain up-to-date runbooks for common failure scenarios consistently show 40-60% reduction in Mean Time to Resolve compared to ad-hoc troubleshooting. The key insight is that runbooks don't need to solve the problem — they just need to quickly eliminate the most common causes, narrowing the search space. A runbook that says "check these 5 things first" is worth more than one that covers every possible scenario.