Runbook Craft - Street-Level Ops¶

Quick Diagnosis Commands¶

# ── Service Health ──
curl -s https://api.example.com/health | jq .          # Application health
kubectl get pods -n production -l app=myapp -o wide     # Pod status
kubectl top pods -n production -l app=myapp             # Resource usage
kubectl get events -n production --sort-by=.metadata.creationTimestamp | tail -20

# ── Recent Changes ──
kubectl rollout history deployment/myapp -n production  # Deploy history
helm history myapp -n production                        # Helm release history
git log --oneline --since="4 hours ago" -- deploy/      # Recent deploy changes

# ── Logs ──
kubectl logs -n production -l app=myapp --since=10m --tail=100  # Recent logs
kubectl logs -n production -l app=myapp -p --tail=50    # Previous container logs (crash)
journalctl -u myapp --since "10 min ago" --no-pager     # Systemd service logs

# ── Dependencies ──
kubectl exec -it deploy/myapp -n production -- \
  curl -s http://dependency-service:8080/health         # Internal dependency
nslookup database.internal                              # DNS resolution
nc -zv database.internal 5432                           # Port connectivity

Pattern: Runbook Template (Copy and Adapt)¶

# Runbook: [Alert Name]

**Last updated:** YYYY-MM-DD
**Owner:** @team-or-person
**Service:** service-name
**Dashboard:** https://grafana.internal/d/xxxxx

---

## Trigger

This runbook is activated by:
- Alert: `AlertName` (PagerDuty / OpsGenie)
- Fires when: [condition, e.g., "error rate > 1% for 5 min"]
- Severity: [critical/warning]

---

## Diagnose

### Step 1: Confirm the problem is real
[Command to check if the service is actually affected]

### Step 2: Check for recent changes
[Command to check recent deploys, config changes]

### Step 3: Check dependencies
[Commands to verify upstream/downstream services]

### Step 4: Check resource pressure
[Commands to check CPU, memory, disk, connections]

---

## Act

### Scenario A: [Most common cause]
1. [Exact command]
2. [Exact command]
→ Go to Verify

### Scenario B: [Second most common cause]
1. [Exact command]
2. [Exact command]
→ Go to Verify

### Scenario C: [Third cause]
→ Escalate to [team/person]

---

## Verify

1. Health check: `curl -s https://service/health` → expect 200
2. Error rate: Check dashboard, should be < 0.1%
3. Latency: p99 should return to < [X]ms within 5 min
4. Alert: Should auto-resolve within [X] minutes

---

## Escalate

Escalate if:
- None of the scenarios above match
- Fix didn't resolve the issue
- You've been troubleshooting for > [X] minutes

Escalation path:
1. [Name/team] — [contact method]
2. [Name/team] — [contact method]
3. Incident Commander — [contact method]

The mental model for every alert response:

 Page received (0:00)
 │
 ├── Acknowledge alert (< 1 min)
 │
 ├── Open runbook + dashboard (< 2 min)
 │   Have both on screen before touching anything
 │
 ├── Run diagnosis commands (2-5 min)
 │   Answer: What's broken? What changed?
 │
 ├── Decision point (5 min)
 │   ├── Known scenario → Follow runbook action
 │   ├── Unknown scenario → Escalate NOW
 │   └── False alarm → Document and close
 │
 ├── Execute fix (5-10 min)
 │   Follow runbook commands exactly
 │
 ├── Verify fix (10-15 min)
 │   Run verification commands
 │   Watch dashboard for 5+ minutes
 │
 └── Close out (15-20 min)
     ├── Resolve alert
     ├── Update incident channel
     └── Note any runbook gaps for later update

Target: Known incidents resolved in < 15 minutes. If you're past 15 minutes without progress, escalate.

One-liner: A runbook's value is inversely proportional to the reader's stress level. At 3 AM with systems down, the reader needs copy-pasteable commands, not architecture explanations.

Gotcha: Runbooks With Outdated Commands¶

Six months ago, the team migrated from Docker Compose to Kubernetes. The runbook still says docker restart myapp. The on-call engineer runs it and nothing happens, or worse, restarts a different service.

# Prevention: Add a "last verified" date and owner
# At the top of every runbook:
# Last verified: 2026-03-01
# Verified by: @alice
# Next review: 2026-06-01

# Automated staleness check (add to CI):
#!/bin/bash
find docs/runbooks/ -name "*.md" -mtime +90 -print
# List runbooks not updated in 90 days
# Route output to Slack #runbook-maintenance

Pattern: Metrics-Driven Diagnosis Section¶

Replace vague instructions with observable numbers:

## Diagnose: HighLatency-API

### Check 1: Overall latency
Dashboard: https://grafana.internal/d/api-latency
- Normal p99: 80-120ms
- Alert threshold: > 500ms
- Panic threshold: > 2000ms

### Check 2: Database query time
$ kubectl exec -it deploy/api -n prod -- curl -s localhost:9090/metrics | \
  grep db_query_duration
- Normal p99: 15-30ms
- If > 100ms → Database is the bottleneck (go to DB runbook)

### Check 3: Connection pool
$ kubectl exec -it deploy/api -n prod -- curl -s localhost:9090/metrics | \
  grep pool_active_connections
- Max pool size: 50
- Normal active: 10-20
- If > 40 → Connection pool exhaustion (go to Scenario B)

### Check 4: Upstream dependency
$ curl -s -w "\n%{time_total}s" https://auth-service.internal/health
- Normal response: < 50ms
- If > 200ms → Auth service is slow (page auth team)

Pattern: Runbook-Linked Alerts¶

Wire alerts directly to runbooks so the responder doesn't have to search:

# Prometheus alerting rule
- alert: HighErrorRate_APIGateway
  expr: |
    rate(http_requests_total{status=~"5..", service="api-gateway"}[5m])
    / rate(http_requests_total{service="api-gateway"}[5m]) > 0.01
  for: 5m
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "API Gateway error rate is {{ $value | humanizePercentage }}"
    runbook_url: "https://wiki.internal/runbooks/api-gateway-high-error-rate"
    dashboard_url: "https://grafana.internal/d/api-gw?orgId=1"

Every alert annotation should include runbook_url. If an alert doesn't have a runbook link, it either needs a runbook written or the alert isn't actionable and should be removed.

Scale note: Teams with 50+ alerts should audit for runbook coverage quarterly. In practice, about 30% of alerts are either unactionable (should be removed or downgraded to dashboards) or duplicate another alert's signal. Fewer, higher-quality alerts with runbooks attached beat a wall of noise.

Gotcha: Copy-Paste Commands Without Variable Substitution¶

The runbook says:

kubectl delete pod api-gateway-7b9f8c6d4-xxxxx -n production

The on-call engineer copies it verbatim. Pod name has changed. Command fails. Panic.

# Bad:
kubectl delete pod api-gateway-7b9f8c6d4-xxxxx -n production

# Good:
# Find the problematic pod first:
kubectl get pods -n production -l app=api-gateway
# Then delete the specific pod (replace <pod-name> with actual):
kubectl delete pod <pod-name> -n production

# Better:
# Delete the oldest pod (likely the problematic one):
kubectl delete pod -n production \
  $(kubectl get pods -n production -l app=api-gateway \
    --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[0].metadata.name}')

Pattern: Runbook Review Workflow¶

 Trigger: Incident resolved using runbook
 │
 ├── During post-incident review:
 │   1. Was the runbook used?
 │   2. Were all commands accurate?
 │   3. Were there missing scenarios?
 │   4. Was the escalation path correct?
 │   5. What would have made resolution faster?
 │
 ├── Runbook owner updates within 48 hours:
 │   - Fix incorrect commands
 │   - Add new scenario if applicable
 │   - Update thresholds/baselines
 │   - Update "last verified" date
 │
 └── Quarterly full review:
     - Team walks through each critical runbook
     - New team members attempt to follow without help
     - Commands executed in staging to verify

Pattern: Graduated Escalation Runbook¶

For incidents that might resolve on their own but need monitoring:

## Alert: ElevatedLatency-Database

### T+0 (alert fires)
- Acknowledge alert
- Open dashboard: [link]
- Check if latency is trending up or was a spike

### T+5 (if not self-resolved)
- Run diagnosis commands
- Check for long-running queries
- Check connection count

### T+10 (if still elevated)
- Kill long-running queries (> 60s, non-critical)
  $ SELECT pg_cancel_backend(pid)
    FROM pg_stat_activity
    WHERE duration > interval '60 seconds'
    AND query NOT LIKE '%replication%';
- Notify #database channel

### T+15 (if no improvement)
- Page DBA: @dba-oncall
- Begin preparing failover if primary is degraded

### T+30 (if degraded and DBA not resolved)
- Incident Commander decision: failover to replica
- Follow "Database Failover" runbook

Gotcha: Assuming Prior Knowledge¶

The runbook says "drain the node." The on-call engineer joined two weeks ago and doesn't know what draining means. They Google it, find outdated advice, and run the wrong command.

# Bad:
Drain the node.

# Good:
Drain the node (removes all pods gracefully, prevents new scheduling):
$ kubectl drain node-name --ignore-daemonsets --delete-emptydir-data
# Wait for all pods to be evicted (may take 1-5 minutes)
$ kubectl get pods --all-namespaces --field-selector spec.nodeName=node-name
# Should return only DaemonSet pods

Write for the person who has the least context. If a senior engineer reads it, they'll skim. If a junior engineer reads it, they'll be grateful.