The Art of the Runbook

lesson
runbook-design
operational-documentation
3am-readability
testing
maintenance
l1 ---# The Art of the Runbook

Topics: runbook design, operational documentation, 3am readability, testing, maintenance Level: L1 (Foundations — everyone should learn this) Time: 30–45 minutes Prerequisites: None

The Mission¶

It's 3am. You've been paged. You've never seen this alert before. You find the runbook. It says:

"Check the logs and restart the service if needed. Contact the team lead for escalation."

This is useless. Which logs? Which service? Restart how? The team lead's phone number isn't listed. You're on your own.

A good runbook is the difference between a 5-minute fix and a 2-hour investigation. This lesson teaches how to write runbooks that actually work at 3am.

The 3am Test¶

Every runbook must pass this test: Can a sleep-deprived engineer who has never seen this service before follow this runbook and resolve the issue?

If the answer is no, the runbook needs work.

What a 3am engineer needs¶

✓ Copy-pasteable commands (not "run the usual check")
✓ Expected output for each command ("you should see X")
✓ Decision points clearly marked ("if X then do A, if Y then do B")
✓ Escalation contacts with phone numbers (not "the team lead")
✓ What NOT to do (common mistakes that make things worse)

What a 3am engineer does NOT need¶

✗ Architecture explanations ("the system uses a microservice pattern...")
✗ History ("this service was created in 2019 when...")
✗ Theory ("this error occurs because of the CAP theorem...")
✗ Links to other docs that might have the answer

Runbook Template¶

# Runbook: [Alert Name]

## Alert
**Fires when:** [what condition triggers this alert]
**Severity:** SEV-[N]
**Service:** [service name]
**Dashboard:** [direct link to relevant Grafana dashboard]

## Triage (do this first, <2 minutes)

1. Check if it's a real problem:
   ```bash
   curl -s https://app.example.com/health | jq .
```text
   **Expected:** `{"status": "healthy"}`
   **If unhealthy:** proceed to Fix section
   **If healthy:** likely a false alarm. Check if alert auto-resolves in 5 minutes.

## Quick Fix (try this first)

1. Restart the service:
   ```bash
   kubectl rollout restart deployment/myapp -n production
```text
2. Watch for recovery (should take <60 seconds):
   ```bash
   kubectl rollout status deployment/myapp -n production
```text
3. Verify health:
   ```bash
   curl -s https://app.example.com/health | jq .
```text
   **If healthy:** incident resolved. Post to #incidents channel.
   **If still unhealthy:** proceed to Investigation.

## Investigation (if Quick Fix didn't work)

1. Check pod status:
   ```bash
   kubectl get pods -n production -l app=myapp
```text
   **If CrashLoopBackOff:** check logs (step 2)
   **If Pending:** check node resources (step 3)
   **If Running but unhealthy:** check dependencies (step 4)

2. Check logs:
   ```bash
   kubectl logs -n production -l app=myapp --tail=50
```text
   **If "connection refused" to database:** → Database runbook
   **If "out of memory":** → OOM runbook

## Do NOT

- Do NOT scale to 0 replicas (breaks the deployment)
- Do NOT delete the PVC (data loss)
- Do NOT restart the database without checking replication

## Escalation

| Level | Contact | When |
|-------|---------|------|
| L1 | On-call engineer | First response (this runbook) |
| L2 | @alice (platform) | If Quick Fix fails, phone: +1-555-0123 |
| L3 | @bob (database) | If database-related, phone: +1-555-0456 |

The Five Rules¶

1. Commands must be copy-pasteable¶

# BAD
"Check the database connection"

# GOOD
psql -h db.example.com -U myapp -d production -c "SELECT 1"
# Expected output: 1 row
# If "connection refused": database is down. See Database Runbook.

2. Include expected output¶

Without expected output, the engineer doesn't know if the command worked:

# Check replication lag
psql -c "SELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS lag_bytes FROM pg_stat_replication;"
# Expected:
#  client_addr | state     | lag_bytes
# -------------+-----------+-----------
#  10.0.2.50   | streaming | 1024       ← Normal: <1MB
#
# If lag_bytes > 100000000 (100MB): replica is falling behind.
# If no rows returned: replication is broken. Escalate to L3.

3. Decision trees, not paragraphs¶

Pod status:
  ├── Running but unhealthy → check /health endpoint, check logs
  ├── CrashLoopBackOff → check logs with --previous flag
  ├── ImagePullBackOff → check image name, check registry auth
  ├── Pending → check node resources, check PVC binding
  └── Evicted → check node disk pressure, check resource limits

4. Test the runbook¶

Run through the runbook yourself (in staging). Every command. Every decision point. Fix what doesn't work. Update what's outdated. Schedule quarterly reviews.

5. Link from the alert¶

The alert annotation should contain a direct link to the runbook:

annotations:
  runbook_url: "https://wiki.example.com/runbooks/api-high-error-rate"
  summary: "API error rate > 1%"

When the engineer gets paged, the runbook link is in the alert. One click.

Flashcard Check¶

Q1: The 3am test — what is it?

Can a sleep-deprived engineer who has never seen this service follow this runbook and resolve the issue? If no, the runbook needs work.

Q2: "Check the logs and restart if needed" — why is this bad?

Not copy-pasteable. Which logs? Which service? Restart how? A 3am engineer needs exact commands, expected output, and decision trees.

Q3: How do you keep runbooks up to date?

Test them quarterly (run through in staging). Schedule reviews. If an incident reveals a runbook gap, update it in the postmortem action items.

Takeaways¶

Copy-pasteable commands. Not "check the service" — exact commands with expected output.
Decision trees, not paragraphs. The engineer needs to know what to do next based on what they see. Trees, not prose.
Test the runbook. If you haven't run through it, it doesn't work. Quarterly reviews catch drift.
Link from the alert. The runbook URL in the alert annotation. One click from page to procedure.
Include "Do NOT" sections. The things that make it worse are as important as the things that fix it.

How Incident Response Actually Works — when runbooks are the first tool you reach for
The Art of the Postmortem — runbook gaps as contributing factors
Prometheus and the Art of Not Alerting — linking alerts to runbooks