Skip to content

Portal | Level: L1: Foundations | Topics: Runbook Craft, On-Call & Incident Command, Incident Response | Domain: DevOps & Tooling

Runbook Craft - Primer

Why This Matters

At 3 AM, your on-call engineer gets paged. They're half awake, stressed, and possibly dealing with a service they didn't write. The difference between a 5-minute resolution and a 2-hour firefight is often one thing: a good runbook.

A runbook is an operational playbook that transforms tribal knowledge into repeatable procedure. It's the document that lets a junior engineer resolve an incident that previously required a senior engineer. It's the bridge between "I know how to fix this" and "anyone on the team can fix this."

Bad runbooks are worse than no runbooks. They give false confidence, contain outdated commands, and send engineers down wrong paths during incidents. This primer teaches you to write runbooks that actually work at 3 AM.

Name origin: "Runbook" comes from mainframe operations in the 1950s-60s, where operators literally had physical binders (books) of procedures to "run" for different situations. The term survived the transition to distributed systems. Some organizations call them "playbooks" (Ansible borrowed this term) or "Standard Operating Procedures" (SOPs).


Runbook Anatomy

Every effective runbook has five sections:

 ┌──────────────────────────────────────────────┐
 │                 RUNBOOK                        │
 │                                                │
 │  ┌────────────┐                                │
 │  │ 1. Trigger  │  ← What fires this runbook?  │
 │  └──────┬─────┘                                │
 │         │                                      │
 │  ┌──────▼─────┐                                │
 │  │ 2. Diagnose │  ← Confirm the problem        │
 │  └──────┬─────┘                                │
 │         │                                      │
 │  ┌──────▼─────┐                                │
 │  │ 3. Act      │  ← Fix it (decision tree)     │
 │  └──────┬─────┘                                │
 │         │                                      │
 │  ┌──────▼─────┐                                │
 │  │ 4. Verify   │  ← Confirm it's fixed         │
 │  └──────┬─────┘                                │
 │         │                                      │
 │  ┌──────▼─────┐                                │
 │  │ 5. Escalate │  ← When to call for help      │
 │  └────────────┘                                │
 └──────────────────────────────────────────────┘

1. Trigger

What alert, symptom, or event activates this runbook. Be specific:

TRIGGER: PagerDuty alert "HighErrorRate-API-Gateway"
  - Alert fires when: 5xx error rate > 1% for 5 minutes
  - Dashboard: https://grafana.internal/d/api-gateway
  - Service: api-gateway (production)

2. Diagnose

Commands to run and questions to answer before taking action:

DIAGNOSE:
1. Check if the service is actually down or if monitoring is lying:
   $ curl -s https://api.example.com/health | jq .
   Expected: {"status": "healthy", "version": "2.14.3"}

2. Check recent deployments:
   $ kubectl rollout history deployment/api-gateway -n production | tail -5
   If a deploy happened in the last 30 minutes  go to "Recent Deploy" section

3. Check error logs:
   $ kubectl logs -n production -l app=api-gateway --since=10m | grep -i error | head -20
   Look for: connection refused, timeout, OOM, panic

4. Check dependencies:
   $ curl -s https://auth-service.internal/health
   $ curl -s https://database.internal:5432/health
   If a dependency is down  go to "Dependency Failure" section

3. Act

The fix. Use decision trees for multiple scenarios:

ACTION (choose based on diagnosis):

IF recent deploy caused it:
  $ kubectl rollout undo deployment/api-gateway -n production
   Go to VERIFY

IF dependency is down:
  1. Page the owning team: [escalation contact]
  2. If API gateway can degrade gracefully, enable circuit breaker:
     $ kubectl set env deployment/api-gateway CIRCUIT_BREAKER=true -n production
   Go to VERIFY

IF pods are OOM-killed:
  $ kubectl get pods -n production -l app=api-gateway -o wide
  Look for: OOMKilled in status
  Temporary fix: increase memory limit
  $ kubectl set resources deployment/api-gateway -n production \
    --limits=memory=2Gi
   Go to VERIFY

IF none of the above:
   Go to ESCALATE

4. Verify

Confirm the fix worked:

VERIFY:
1. Health check returns 200:
   $ curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health
   Expected: 200

2. Error rate back to normal:
   Check dashboard: https://grafana.internal/d/api-gateway
   Error rate should be < 0.1% within 5 minutes

3. No new error logs:
   $ kubectl logs -n production -l app=api-gateway --since=2m | grep -c error
   Expected: 0 or near 0

4. Alert auto-resolves in PagerDuty (may take 5-10 minutes)

5. Escalate

When and how to call for help:

ESCALATE if:
- Diagnosis doesn't match any known scenario
- Fix didn't work after one attempt
- Multiple services are affected
- Data integrity may be compromised
- You've been working on it for 15 minutes without progress

ESCALATION PATH:
1. Primary: @alice (api-gateway owner) — Slack + phone
2. Secondary: @bob (platform team lead) — Slack + phone
3. Incident Commander: @charlie — phone only
Phone numbers: [link to secure contact list]

Decision Trees

Complex incidents need branching logic. Represent it clearly:

 Alert: HighLatency-Database
 ├── Is the database reachable?
 │   ├── NO → Check network / security groups / DNS
 │   │        → Escalate to DBA if unreachable
 │   │
 │   └── YES → Check active connections
 │       ├── Connections near max_connections?
 │       │   ├── YES → Find connection leak
 │       │   │        $ SELECT count(*) FROM pg_stat_activity;
 │       │   │        $ SELECT * FROM pg_stat_activity WHERE state='idle' ORDER BY query_start;
 │       │   │        → Kill idle connections > 1 hour
 │       │   │
 │       │   └── NO → Check for long-running queries
 │       │       ├── Long query found?
 │       │       │   ├── YES → Evaluate if safe to cancel
 │       │       │   │        $ SELECT pg_cancel_backend(pid);
 │       │       │   │
 │       │       │   └── NO → Check disk I/O
 │       │       │       → iostat -xz 1 5 on DB host
 │       │       │       → If I/O saturated, escalate to DBA

Automation Levels

Runbooks exist on a spectrum from manual to fully automated:

Level Description Example
L0 Fully manual, prose instructions "SSH to server, edit config..."
L1 Copy-paste commands Specific commands in runbook
L2 Scripts with parameters ./fix-connections.sh --kill-idle
L3 Triggered scripts (human approval) Chatbot: "approve remediation?"
L4 Fully automated (self-healing) Auto-restart, auto-scale

Goal: Move every runbook from L0 toward L4 over time. But L1 (good, copy-paste commands) is infinitely better than nothing.

Gotcha: L4 (fully automated remediation) sounds ideal but has a dangerous failure mode: auto-remediation can mask underlying problems. If your system auto-restarts a crashing service every 5 minutes, you might not notice a memory leak for weeks -- until it overwhelms the auto-healer. Always pair L4 automation with alerts that fire when remediation frequency exceeds a threshold.

 Manual                                        Automated
 ├────────┼────────┼────────┼────────┼────────┤
 L0       L1       L2       L3       L4

 Start    Here is  Here is  Even     Machine
 reading  the      a script better   handles it
 and      command, you can           without
 figure   copy it  run              humans
 it out

Moving Up the Automation Ladder

# L1: Copy-paste command in runbook
kubectl rollout undo deployment/api-gateway -n production

# L2: Parameterized script
#!/bin/bash
# rollback.sh — Roll back a deployment
DEPLOY=${1:?"Usage: rollback.sh <deployment> [namespace]"}
NS=${2:-production}
echo "Rolling back $DEPLOY in $NS..."
kubectl rollout undo deployment/"$DEPLOY" -n "$NS"
kubectl rollout status deployment/"$DEPLOY" -n "$NS" --timeout=120s

# L3: Chatops with approval
# Slack bot: "Deployment api-gateway error rate is 5%. Rollback? [Approve] [Deny]"
# On approve → runs rollback.sh

# L4: Fully automated
# Prometheus alert → webhook → rollback script → notification
# No human in the loop

Testing Runbooks

A runbook you've never tested is a runbook that doesn't work. Test methods:

Game Days

Schedule a time, inject a failure, and have someone follow the runbook:

Game Day: API Gateway Failure Scenario
Date: 2026-03-20, 14:00 UTC
Facilitator: @alice
Responder: @dave (deliberately chosen: hasn't worked on api-gateway)

Scenario: Kill 2 of 3 api-gateway pods
Inject: kubectl delete pod -n production -l app=api-gateway --field-selector=status.phase=Running

Success criteria:
- Responder follows runbook without asking for help
- Resolution time < 10 minutes
- All commands in runbook work as documented
- Escalation path is clear and correct

Tabletop Exercises

Walk through the runbook verbally without injecting real failures:

"It's 2 AM. You get paged for HighErrorRate-API-Gateway.
You open the runbook. What's your first step?"

Walk through each step. Ask: "What if this command returns X instead of Y?"
Identify gaps, outdated commands, missing decision branches.

Chaos Engineering

Automated failure injection that validates runbooks continuously:

# Litmus chaos experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-gateway-pod-delete
spec:
  appinfo:
    appns: production
    applabel: app=api-gateway
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: FORCE
              value: "false"

Runbook Review Process

Runbooks rot faster than code. Review them:

Trigger Action
After every incident Update runbook with lessons learned
After every deploy Verify commands still work
Monthly (scheduled) Owner reviews for accuracy
Quarterly Full team walkthrough of critical runbooks
New team member onboarding Have them follow runbooks and report gaps

Metrics-Driven Runbooks

The best runbooks include observable thresholds:

Don't: "Check if latency is high"
Do:    "Check p99 latency. If > 500ms (normal: 80-120ms), proceed."

Don't: "See if there are too many connections"
Do:    "Check pg_stat_activity count. If > 180 (max: 200), proceed."

Don't: "Check if disk is almost full"
Do:    "Check df -h /data. If usage > 85% (normal: 40-60%), proceed."

Include the normal baseline so the responder knows what "good" looks like.

Remember: The "3 AM test" for runbook quality: Could a sleep-deprived engineer who has never seen this service follow the runbook and resolve the incident without calling anyone? If the answer is no, the runbook needs more detail, clearer decision trees, or better copy-paste commands.


Key Takeaways

  1. Every runbook needs: trigger, diagnose, act, verify, escalate.
  2. Decision trees handle the "it depends" — don't make the on-call engineer guess.
  3. Copy-paste commands (L1) beat prose instructions (L0) every time.
  4. Test runbooks with real humans who didn't write them.
  5. Include observable thresholds and baselines — "high" is meaningless without a number.
  6. Review runbooks after every incident. Stale runbooks are dangerous.
  7. The goal is automation (L4), but a good manual runbook (L1) saves lives today.

Interview tip: When asked "how do you handle on-call?", mention runbooks early. Interviewers want to hear that you systematize incident response rather than relying on heroics. Saying "we have runbooks for our top 10 failure modes, tested quarterly via game days" signals operational maturity.


Wiki Navigation

  • Incident Command & On-Call (Topic Pack, L2) — Incident Response, On-Call & Incident Command
  • The Psychology of Incidents (Topic Pack, L2) — Incident Response, On-Call & Incident Command
  • Vendor Management & Escalation (Topic Pack, L1) — Incident Response, On-Call & Incident Command
  • Change Management (Topic Pack, L1) — Incident Response
  • Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
  • Debugging Methodology (Topic Pack, L1) — Incident Response
  • Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
  • Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
  • Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
  • On Call Flashcards (CLI) (flashcard_deck, L1) — On-Call & Incident Command