Runbook Craft - Street-Level Ops¶
Quick Diagnosis Commands¶
# ── Service Health ──
curl -s https://api.example.com/health | jq . # Application health
kubectl get pods -n production -l app=myapp -o wide # Pod status
kubectl top pods -n production -l app=myapp # Resource usage
kubectl get events -n production --sort-by=.metadata.creationTimestamp | tail -20
# ── Recent Changes ──
kubectl rollout history deployment/myapp -n production # Deploy history
helm history myapp -n production # Helm release history
git log --oneline --since="4 hours ago" -- deploy/ # Recent deploy changes
# ── Logs ──
kubectl logs -n production -l app=myapp --since=10m --tail=100 # Recent logs
kubectl logs -n production -l app=myapp -p --tail=50 # Previous container logs (crash)
journalctl -u myapp --since "10 min ago" --no-pager # Systemd service logs
# ── Dependencies ──
kubectl exec -it deploy/myapp -n production -- \
curl -s http://dependency-service:8080/health # Internal dependency
nslookup database.internal # DNS resolution
nc -zv database.internal 5432 # Port connectivity
Pattern: Runbook Template (Copy and Adapt)¶
# Runbook: [Alert Name]
**Last updated:** YYYY-MM-DD
**Owner:** @team-or-person
**Service:** service-name
**Dashboard:** https://grafana.internal/d/xxxxx
---
## Trigger
This runbook is activated by:
- Alert: `AlertName` (PagerDuty / OpsGenie)
- Fires when: [condition, e.g., "error rate > 1% for 5 min"]
- Severity: [critical/warning]
---
## Diagnose
### Step 1: Confirm the problem is real
[Command to check if the service is actually affected]
### Step 2: Check for recent changes
[Command to check recent deploys, config changes]
### Step 3: Check dependencies
[Commands to verify upstream/downstream services]
### Step 4: Check resource pressure
[Commands to check CPU, memory, disk, connections]
---
## Act
### Scenario A: [Most common cause]
1. [Exact command]
2. [Exact command]
→ Go to Verify
### Scenario B: [Second most common cause]
1. [Exact command]
2. [Exact command]
→ Go to Verify
### Scenario C: [Third cause]
→ Escalate to [team/person]
---
## Verify
1. Health check: `curl -s https://service/health` → expect 200
2. Error rate: Check dashboard, should be < 0.1%
3. Latency: p99 should return to < [X]ms within 5 min
4. Alert: Should auto-resolve within [X] minutes
---
## Escalate
Escalate if:
- None of the scenarios above match
- Fix didn't resolve the issue
- You've been troubleshooting for > [X] minutes
Escalation path:
1. [Name/team] — [contact method]
2. [Name/team] — [contact method]
3. Incident Commander — [contact method]
Pattern: Pager-to-Resolution Flow¶
The mental model for every alert response:
Page received (0:00)
│
├── Acknowledge alert (< 1 min)
│
├── Open runbook + dashboard (< 2 min)
│ Have both on screen before touching anything
│
├── Run diagnosis commands (2-5 min)
│ Answer: What's broken? What changed?
│
├── Decision point (5 min)
│ ├── Known scenario → Follow runbook action
│ ├── Unknown scenario → Escalate NOW
│ └── False alarm → Document and close
│
├── Execute fix (5-10 min)
│ Follow runbook commands exactly
│
├── Verify fix (10-15 min)
│ Run verification commands
│ Watch dashboard for 5+ minutes
│
└── Close out (15-20 min)
├── Resolve alert
├── Update incident channel
└── Note any runbook gaps for later update
Target: Known incidents resolved in < 15 minutes. If you're past 15 minutes without progress, escalate.
One-liner: A runbook's value is inversely proportional to the reader's stress level. At 3 AM with systems down, the reader needs copy-pasteable commands, not architecture explanations.
Gotcha: Runbooks With Outdated Commands¶
Six months ago, the team migrated from Docker Compose to Kubernetes. The runbook still says docker restart myapp. The on-call engineer runs it and nothing happens, or worse, restarts a different service.
# Prevention: Add a "last verified" date and owner
# At the top of every runbook:
# Last verified: 2026-03-01
# Verified by: @alice
# Next review: 2026-06-01
# Automated staleness check (add to CI):
#!/bin/bash
find docs/runbooks/ -name "*.md" -mtime +90 -print
# List runbooks not updated in 90 days
# Route output to Slack #runbook-maintenance
Pattern: Metrics-Driven Diagnosis Section¶
Replace vague instructions with observable numbers:
## Diagnose: HighLatency-API
### Check 1: Overall latency
Dashboard: https://grafana.internal/d/api-latency
- Normal p99: 80-120ms
- Alert threshold: > 500ms
- Panic threshold: > 2000ms
### Check 2: Database query time
$ kubectl exec -it deploy/api -n prod -- curl -s localhost:9090/metrics | \
grep db_query_duration
- Normal p99: 15-30ms
- If > 100ms → Database is the bottleneck (go to DB runbook)
### Check 3: Connection pool
$ kubectl exec -it deploy/api -n prod -- curl -s localhost:9090/metrics | \
grep pool_active_connections
- Max pool size: 50
- Normal active: 10-20
- If > 40 → Connection pool exhaustion (go to Scenario B)
### Check 4: Upstream dependency
$ curl -s -w "\n%{time_total}s" https://auth-service.internal/health
- Normal response: < 50ms
- If > 200ms → Auth service is slow (page auth team)
Pattern: Runbook-Linked Alerts¶
Wire alerts directly to runbooks so the responder doesn't have to search:
# Prometheus alerting rule
- alert: HighErrorRate_APIGateway
expr: |
rate(http_requests_total{status=~"5..", service="api-gateway"}[5m])
/ rate(http_requests_total{service="api-gateway"}[5m]) > 0.01
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "API Gateway error rate is {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.internal/runbooks/api-gateway-high-error-rate"
dashboard_url: "https://grafana.internal/d/api-gw?orgId=1"
Every alert annotation should include runbook_url. If an alert doesn't have a runbook link, it either needs a runbook written or the alert isn't actionable and should be removed.
Scale note: Teams with 50+ alerts should audit for runbook coverage quarterly. In practice, about 30% of alerts are either unactionable (should be removed or downgraded to dashboards) or duplicate another alert's signal. Fewer, higher-quality alerts with runbooks attached beat a wall of noise.
Gotcha: Copy-Paste Commands Without Variable Substitution¶
The runbook says:
The on-call engineer copies it verbatim. Pod name has changed. Command fails. Panic.
# Bad:
kubectl delete pod api-gateway-7b9f8c6d4-xxxxx -n production
# Good:
# Find the problematic pod first:
kubectl get pods -n production -l app=api-gateway
# Then delete the specific pod (replace <pod-name> with actual):
kubectl delete pod <pod-name> -n production
# Better:
# Delete the oldest pod (likely the problematic one):
kubectl delete pod -n production \
$(kubectl get pods -n production -l app=api-gateway \
--sort-by=.metadata.creationTimestamp -o jsonpath='{.items[0].metadata.name}')
Pattern: Runbook Review Workflow¶
Trigger: Incident resolved using runbook
│
├── During post-incident review:
│ 1. Was the runbook used?
│ 2. Were all commands accurate?
│ 3. Were there missing scenarios?
│ 4. Was the escalation path correct?
│ 5. What would have made resolution faster?
│
├── Runbook owner updates within 48 hours:
│ - Fix incorrect commands
│ - Add new scenario if applicable
│ - Update thresholds/baselines
│ - Update "last verified" date
│
└── Quarterly full review:
- Team walks through each critical runbook
- New team members attempt to follow without help
- Commands executed in staging to verify
Pattern: Graduated Escalation Runbook¶
For incidents that might resolve on their own but need monitoring:
## Alert: ElevatedLatency-Database
### T+0 (alert fires)
- Acknowledge alert
- Open dashboard: [link]
- Check if latency is trending up or was a spike
### T+5 (if not self-resolved)
- Run diagnosis commands
- Check for long-running queries
- Check connection count
### T+10 (if still elevated)
- Kill long-running queries (> 60s, non-critical)
$ SELECT pg_cancel_backend(pid)
FROM pg_stat_activity
WHERE duration > interval '60 seconds'
AND query NOT LIKE '%replication%';
- Notify #database channel
### T+15 (if no improvement)
- Page DBA: @dba-oncall
- Begin preparing failover if primary is degraded
### T+30 (if degraded and DBA not resolved)
- Incident Commander decision: failover to replica
- Follow "Database Failover" runbook
Gotcha: Assuming Prior Knowledge¶
The runbook says "drain the node." The on-call engineer joined two weeks ago and doesn't know what draining means. They Google it, find outdated advice, and run the wrong command.
# Bad:
Drain the node.
# Good:
Drain the node (removes all pods gracefully, prevents new scheduling):
$ kubectl drain node-name --ignore-daemonsets --delete-emptydir-data
# Wait for all pods to be evicted (may take 1-5 minutes)
$ kubectl get pods --all-namespaces --field-selector spec.nodeName=node-name
# Should return only DaemonSet pods
Write for the person who has the least context. If a senior engineer reads it, they'll skim. If a junior engineer reads it, they'll be grateful.