Runbook Craft — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about runbooks and operational documentation.

The term "runbook" comes from mainframe operations in the 1960s¶

In the mainframe era, operators literally ran programs by following a book of procedures — loading tapes in specific sequences, setting switches, and responding to console messages. These physical binders, called "run books," sat next to every operator console. The term survived the transition to modern infrastructure, though the format evolved from laminated pages to wikis to automated playbooks.

NASA's checklists for Apollo were so critical that they were considered part of the spacecraft¶

NASA's Apollo mission checklists and procedures were developed with the same rigor as the spacecraft hardware. The crew carried cuff checklists (attached to their spacesuits) and procedure books. The famous checklist improvisation during Apollo 13 — where engineers in Houston wrote new procedures and read them up to the crew — saved three lives. NASA's approach directly influenced how aviation and later operations teams think about procedural documentation.

The Checklist Manifesto was inspired by a $0.05 intervention that saves thousands of lives¶

Atul Gawande's 2009 book "The Checklist Manifesto" was inspired by Peter Pronovost's ICU central line insertion checklist at Johns Hopkins. The checklist — five simple steps costing essentially nothing — reduced central line infection rates from 11% to zero over 15 months and saved an estimated 1,500 lives and $175 million. The book became required reading in operations because it demonstrated that simple procedures prevent catastrophic failures.

Google's SRE book dedicates an entire chapter to "Managing Incidents" with runbook emphasis¶

Google's SRE book (2016) explicitly states that on-call engineers should not need to think creatively during an incident — runbooks should provide step-by-step guidance for common failure scenarios. Google's internal runbooks are so detailed that new SREs can handle most pages by following the documented procedures, reserving creative problem-solving for truly novel failures.

Runbooks that aren't tested regularly have a failure rate exceeding 50%¶

Studies consistently show that untested runbooks fail more often than they succeed when needed. Procedures reference commands that no longer work, paths that have changed, or tools that have been replaced. PagerDuty's research suggests that runbooks should be tested at least quarterly, and ideally exercised during Game Days. The most common failure mode is a step that assumes access or permissions that have been revoked.

The US military's concept of "Standard Operating Procedures" dates to the 18th century¶

Friedrich Wilhelm von Steuben wrote the "Blue Book" (Regulations for the Order and Discipline of the Troops of the United States) in 1778 during the Revolutionary War. It standardized everything from marching formations to camp layouts. The military SOP tradition — detailed, tested, updated procedures for predictable situations — is the direct ancestor of modern operational runbooks.

Automation from runbooks follows a predictable maturity ladder¶

Organizations typically evolve through four stages: (1) tribal knowledge (procedures exist only in people's heads), (2) written runbooks (documented but manual), (3) semi-automated runbooks (copy-pasteable commands with human judgment), and (4) fully automated playbooks (triggered by alerts, executed by machines). Most organizations are stuck between stages 2 and 3, and many never reach stage 4.

The "bus factor" for operational knowledge at most companies is 1¶

The "bus factor" — how many people would need to be hit by a bus before critical knowledge is lost — is effectively 1 for most operational procedures at most companies. One person knows how to restart the payment system, one person knows the database failover procedure. Runbooks are the primary mitigation for this risk, but only if they capture the implicit knowledge that experts carry in their heads.

Copy-pasteable commands in runbooks are a deliberate design choice, not laziness¶

Experienced runbook authors include exact, copy-pasteable commands rather than describing what to do in prose. This is deliberate: during a 3 AM incident, cognitive function is impaired, and translating "check the replication lag on the secondary database" into the actual command introduces a failure point. The runbook should say mysql -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master — ready to paste.

PagerDuty open-sourced their entire incident response documentation¶

PagerDuty published their complete incident response documentation, including runbook templates, on-call guides, and postmortem processes, as an open-source resource at response.pagerduty.com. This became one of the most widely referenced operational documentation templates in the industry, used by thousands of companies as a starting point for their own incident management practices.

The OODA loop applies directly to runbook structure¶

Colonel John Boyd's OODA loop (Observe, Orient, Decide, Act), developed for military fighter pilot decision-making, maps precisely to runbook structure: observe (check metrics, read alerts), orient (understand the current state), decide (choose a remediation path), and act (execute the fix). Well-structured runbooks follow this pattern naturally, guiding the operator through each phase rather than jumping straight to action.