Skip to content

Runbook Craft Footguns

  1. Runbooks that nobody maintains. The runbook was written 18 months ago during onboarding. Since then, the service moved from Docker to Kubernetes, the database was migrated, and the monitoring system changed. Every command in the runbook is wrong. The on-call engineer discovers this at 3 AM during an incident.

Fix: Assign an owner to every runbook. Add a "last verified" date at the top. Set a calendar reminder for quarterly reviews. Run a CI check that flags runbooks not modified in 90 days. Update the runbook after every incident that uses it.

War story: Google's SRE book documents a case where a datacenter failover runbook had not been tested in 18 months. When the failover was needed, the runbook referenced a configuration system that had been replaced and a monitoring dashboard that no longer existed. The team improvised for 4 hours. Google's response was to make runbook "freshness" a trackable SLO — runbooks used in anger must be updated within 48 hours of the incident.

  1. Copy-paste commands without explaining what they do. The runbook says: kubectl delete pod -l app=gateway -n production. The engineer copies it, runs it, and deletes all gateway pods simultaneously instead of one at a time. The runbook didn't explain the blast radius or suggest --field-selector to target a single pod.

Fix: Every command needs a one-line comment explaining what it does and what its blast radius is. Include the expected output. If a command is destructive, mark it clearly and explain the consequences.

  1. No escalation path defined. The runbook covers three scenarios. The on-call engineer hits a fourth. There's no "if none of the above, do this" section. They spend 45 minutes trying variations before finally paging someone. The escalation could have happened at minute 10.

Fix: Every runbook ends with an escalation section: when to escalate (time limit, scenario mismatch, no progress), who to escalate to (names, roles, contact methods), and what information to provide when escalating.

  1. Assuming the reader knows the system intimately. The runbook says "check the connection pool" without specifying which pool, where the metric lives, what the normal value is, or what "too high" means. The author knew all this from context. The on-call engineer does not.

Fix: Write for someone who has never touched this service. Include exact metric names, dashboard URLs, normal baselines ("usually 10-20"), alert thresholds ("bad if > 40"), and exact commands to check.

  1. Writing runbooks as prose paragraphs instead of structured steps. "When the database is slow, you should first check the connection count, and if that's high then look for idle connections, and maybe there's a long-running query, in which case you might want to cancel it, unless it's a replication query." At 3 AM, this reads like a novel nobody will parse.

Fix: Numbered steps. One action per step. Decision trees for branches. Commands in code blocks. Short sentences. Bullet points. Format for scanning, not reading.

  1. No verification step after the fix. The runbook says to restart the service. The engineer restarts it. The runbook ends. But the restart didn't actually fix the problem — the service came back up and immediately hit the same error. Nobody checked because the runbook didn't tell them to.

Fix: Every action section must be followed by a verification section: exact commands to confirm the fix worked, expected output, how long to wait before declaring success, and what to do if verification fails.

Remember: The pattern for every runbook action step is: Do (the command), Verify (check it worked), Fallback (what to do if it didn't). A runbook without verification steps is a recipe, not a runbook. Recipes assume everything goes right. Runbooks assume everything will go wrong.

  1. Runbooks that have never been tested by someone other than the author. The author tested it once when they wrote it. On their machine. With their access levels. With their mental model of the system. A different engineer tries it during an incident and discovers three broken assumptions.

Fix: Test runbooks with a team member who did not write them. Game days and tabletop exercises catch gaps that the author can't see. New team member onboarding is perfect — have them follow runbooks and report confusion.

  1. Including commands that require interactive input. The runbook says mysql -u root -p which prompts for a password. Or vim /etc/config.yml which drops into an editor. At 3 AM, over a flaky SSH connection, interactive commands are fragile and dangerous.

Fix: Use non-interactive alternatives. Pass passwords via environment variables or credential managers. Use sed or kubectl patch instead of interactive editors. If interaction is truly necessary, document exactly what to type and what to expect at each prompt.

  1. One giant runbook that covers everything. The "Database Runbook" is 500 lines covering connection issues, replication lag, disk full, slow queries, failover, backup restoration, and schema migrations. The engineer has to scroll through all of it to find the relevant section during an active incident.

Fix: One runbook per alert or per failure mode. Link between them for related scenarios. "Database Connection Pool Exhaustion" should be separate from "Database Replication Lag." Keep each runbook focused and short enough to follow without scrolling.

  1. Not linking alerts to runbooks. The engineer gets paged. The alert says "HighMemoryUsage on prod-web-03." They open the wiki, search for "memory," find 12 results, spend 5 minutes finding the right one, and it's for a different service.

    Fix: Every alert must include a runbook_url annotation that links directly to the relevant runbook. If an alert doesn't have a runbook, either write one or question whether the alert is actionable. Unactionable alerts without runbooks are just noise.

    Gotcha: Prometheus alerting rules support a runbook_url annotation that Alertmanager passes through to PagerDuty, Slack, and other receivers. But the URL must be a direct deep-link to the specific runbook — not a wiki homepage. If the link goes to a search page or a table of contents, it is nearly useless at 3 AM. Use anchored URLs: https://wiki.example.com/runbooks/database#connection-pool-exhaustion.