Postmortems & SLOs — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about postmortems and Service Level Objectives.

NASA invented the blameless postmortem after the Apollo 13 crisis¶

NASA's post-incident review process, developed through the Apollo program in the 1960s, is the direct ancestor of modern blameless postmortems. After Apollo 13 (1970), NASA's review focused entirely on systemic failures and process improvements, never on blaming individual engineers. This approach was adopted by the aviation industry and eventually by the tech industry through Google's SRE practices.

Google publishes SLOs for every public service and ties them to error budgets¶

Google formalized the concept of error budgets in their 2016 SRE book: if a service has a 99.9% availability SLO, it has a "budget" of 43 minutes of downtime per month. When the error budget is exhausted, the development team must freeze feature releases and focus on reliability. This mechanism transformed SLOs from aspirational targets into enforceable contracts between development and operations teams.

The term "postmortem" literally means "after death" and originally referred to autopsies¶

Borrowed from medical pathology, the term "postmortem" was adopted by engineering to describe the analysis performed after a system failure. Some organizations prefer "incident review" or "retrospective" because the medical association implies something has died. Regardless of terminology, the practice of systematic post-incident analysis is considered the single most valuable reliability improvement practice.

99.99% availability allows only 4 minutes and 22 seconds of downtime per month¶

The difference between "three nines" (99.9%) and "four nines" (99.99%) is enormous in operational terms: 99.9% allows 43 minutes of monthly downtime, while 99.99% allows only 4.3 minutes. Achieving each additional nine typically costs 10x more in engineering effort, infrastructure redundancy, and operational rigor. Most teams that claim 99.99% availability have not actually measured it correctly.

Blameless postmortems increase reporting by 300% or more¶

Organizations that adopt blameless postmortem cultures consistently report a 3-5x increase in the number of incidents voluntarily reported. When engineers know they will not be punished, they report near-misses and minor incidents that were previously swept under the rug. This increased reporting dramatically improves the organization's ability to prevent major outages.

The "5 Whys" technique was invented at Toyota in the 1930s¶

The "5 Whys" root cause analysis technique, commonly used in postmortems, was developed by Sakichi Toyoda at the Toyota Motor Corporation in the 1930s. The technique involves asking "why" five times to drill past symptoms to root causes. While useful, modern incident analysis practitioners (like Sidney Dekker and John Allspaw) argue that complex system failures rarely have a single root cause, making "5 Whys" insufficient for software incidents.

SLIs must be measured from the user's perspective to be meaningful¶

A common mistake is measuring Service Level Indicators (SLIs) at the server rather than the client. A server reporting 99.99% success rate might mask client-side failures caused by DNS issues, CDN problems, or network partitions. Google's SRE practices recommend measuring SLIs as close to the user as possible — ideally from synthetic probes or real-user monitoring, not from server-side logs.

The best postmortems are written within 48 hours of the incident¶

Research on incident learning shows that the quality of postmortem analysis degrades rapidly with time. Details are forgotten, logs are rotated, and the emotional context that helps explain decisions fades. Google's SRE handbook recommends completing the postmortem within 48 hours, with the full review meeting within a week. Organizations that delay postmortems for weeks or months extract significantly less learning value.

Error budgets resolve the fundamental tension between reliability and velocity¶

Before error budgets, development and operations teams were in permanent conflict: developers wanted to ship features (risking reliability), while operators wanted stability (blocking releases). Error budgets provide an objective mechanism: if the error budget is healthy, ship faster; if it is exhausted, slow down. This simple concept, formalized by Google, resolved a decades-old organizational dysfunction.