Skip to content

Runbook Craft

← Back to all decks

16 cards — 🟢 3 easy | 🟡 4 medium | 🔴 3 hard

🟢 Easy (3)

1. What are the five sections every effective runbook must have?

Show answer (1) Trigger — what alert/symptom activates this runbook, (2) Diagnose — commands to run before taking action, (3) Act — the fix with decision trees for multiple scenarios, (4) Verify — confirm the fix worked, (5) Escalate — when and how to call for help.

Remember: "A runbook is a recipe for operations." It turns tribal knowledge into repeatable procedures.

Remember: "TDAVE" = Trigger, Diagnose, Act, Verify, Escalate. Every runbook needs all five.

2. Why should runbooks include observable thresholds with baselines instead of vague descriptions?

Show answer High latency is meaningless without a number. A good runbook says "Check p99 latency. If > 500ms (normal: 80-120ms), proceed." Including the normal baseline lets the responder know what "good" looks like, enabling faster diagnosis at 3 AM."

Gotcha: Runbooks rot fast — schedule quarterly reviews. A wrong runbook is worse than no runbook.

Remember: "Runbook rule #1: test it. Rule #2: keep it current."

Example: Instead of "check if latency is high," write: "Check p99 latency. If > 500ms (normal: 80-120ms), proceed to step 3."

3. What are the five automation levels for runbooks (L0-L4)?

Show answer L0: fully manual prose instructions. L1: copy-paste commands. L2: scripts with parameters. L3: triggered scripts with human approval (e.g., chatbot). L4: fully automated self-healing (no human in the loop). The goal is to move every runbook toward L4 over time, but L1 is infinitely better than nothing.

Remember: "L0=prose, L1=paste, L2=script, L3=approve, L4=auto." Every runbook should climb from L0 toward L4 over time.

🟡 Medium (4)

1. Why do complex incident runbooks need decision trees instead of linear steps?

Show answer Complex incidents have multiple possible causes requiring different remediation paths. Decision trees provide branching logic — "IF recent deploy caused it: rollback. IF dependency is down: page owning team and enable circuit breaker. IF pods are OOM-killed: increase memory limit." This prevents the on-call engineer from guessing which path to follow.

Analogy: A decision tree runbook is like a flowchart — each branch leads to a different fix. Linear runbooks assume only one cause.

2. What is a game day and how does it validate a runbook?

Show answer A scheduled event where a failure is injected into a system and a responder (deliberately chosen as someone unfamiliar with the service) follows the runbook to resolve it. Success criteria: resolution without asking for help, within time limit, all commands work as documented, and escalation path is clear.

Name origin: "Game day" comes from sports — a scheduled event where you test readiness under realistic conditions.

3. Under what five conditions should a runbook direct the on-call engineer to escalate?

Show answer (1) Diagnosis doesn't match any known scenario. (2) Fix didn't work after one attempt. (3) Multiple services are affected. (4) Data integrity may be compromised. (5) You've been working on it for 15 minutes without progress.

Remember: "Good runbook = copy-paste commands + expected output." If the operator has to think about syntax, the runbook failed.

Remember: "15-minute rule: escalate if no progress in 15 minutes." Better to escalate early than to extend an outage.

4. What are the five triggers that should prompt a runbook review?

Show answer (1) After every incident — update with lessons learned. (2) After every deploy — verify commands still work. (3) Monthly — owner reviews for accuracy. (4) Quarterly — full team walkthrough of critical runbooks. (5) New team member onboarding — have them follow runbooks and report gaps.

Gotcha: The most common trigger is missing: "after a production incident." Post-incident runbook review should be mandatory.

🔴 Hard (3)

1. Why are bad runbooks worse than no runbooks at all?

Show answer Bad runbooks give false confidence, contain outdated commands, and send engineers down wrong paths during incidents. An engineer following a stale runbook trusts it is correct, wasting precious incident time on invalid steps. No runbook at least signals uncertainty, prompting the engineer to investigate from first principles or escalate sooner.

War story: An engineer followed a stale runbook that said to restart the primary database. The runbook was written before the HA setup, and the restart caused a failover cascade.

2. How does chaos engineering relate to runbook validation, and what tool can automate this?

Show answer Chaos engineering provides automated failure injection that validates runbooks continuously, not just during scheduled game days. Tools like LitmusChaos define ChaosEngine experiments that target specific applications (e.g., pod-delete for api-gateway), automatically inject failures, and verify that the system recovers as documented in the runbook.

Name origin: Chaos engineering was pioneered by Netflix\'s Chaos Monkey (2011) — randomly terminating production instances to ensure resilience.

3. What makes an L2 runbook script better than an L1 copy-paste command, and what should it include?

Show answer An L2 script is parameterized, reusable, and includes validation. It should accept the deployment name and namespace as parameters, provide a usage message if arguments are missing, echo what it is doing for operator awareness, run the remediation command, and verify the result (e.g., kubectl rollout status with a timeout). This eliminates copy-paste errors and provides guardrails.

Example: A good L2 script starts with `#!/bin/bash
set -euo pipefail` and includes usage(), parameter validation, confirmation prompts, and verification steps.