Decision Tree: Should I Page Someone?¶

Category: Operational Decisions Starting Question: "Something looks wrong — should I page the on-call?" Estimated traversal: 2-5 minutes Domains: incident-response, on-call, alerting, SRE

The Tree¶

Something looks wrong — should I page the on-call?
│
├── [Check 1] Is there active user impact right now?
│   │         (errors visible, service returning 5xx, latency > SLO, feature unavailable)
│   │
│   ├── YES — users are impacted
│   │   ├── [Check 2] What is the severity?
│   │   │   │
│   │   │   ├── P1: Total outage (100% error rate, service completely unreachable,
│   │   │   │        revenue-generating path down, SLA breach imminent)
│   │   │   │   └── → ✅ PAGE IMMEDIATELY — do not investigate first
│   │   │   │         Call on-call, post in #incidents, start bridge
│   │   │   │
│   │   │   ├── P2: Partial degradation (elevated errors 10–50%, core feature broken,
│   │   │   │        subset of users affected, no workaround)
│   │   │   │   ├── [Check 3] Is it business hours (Mon–Fri 09:00–18:00 local)?
│   │   │   │   │   ├── YES → ✅ PAGE TEAM LEAD directly (Slack + phone)
│   │   │   │   │   └── NO (off-hours / weekend)
│   │   │   │   │       ├── [Check 4] Is it getting worse? (error rate rising in last 5 min)
│   │   │   │   │       │   ├── YES → ✅ PAGE ON-CALL NOW
│   │   │   │   │       │   └── NO (stable, not worsening)
│   │   │   │   │       │       ├── [Check 5] Have you been working on it > 15 minutes?
│   │   │   │   │       │       │   ├── YES → ✅ PAGE ON-CALL
│   │   │   │   │       │       │   └── NO → attempt runbook for 15 min then page
│   │   │   │   └── (continue to Check 5)
│   │   │   │
│   │   │   └── P3: Degraded with workaround (slow responses, non-critical feature down,
│   │   │            error rate < 10%, users can complete task via alternate path)
│   │   │       ├── [Check 5] Is there a runbook you can execute alone?
│   │   │       │   ├── YES + you have required access
│   │   │       │   │   ├── [Check 6] Have you spent > 15 min without progress?
│   │   │       │   │   │   ├── YES → ✅ PAGE ON-CALL (you've tried, escalate)
│   │   │       │   │   │   └── NO → ✅ EXECUTE RUNBOOK, page if not resolved in 15 min
│   │   │       │   └── NO runbook or you lack access
│   │   │       │       ├── [Check 7] Is it business hours?
│   │   │       │       │   ├── YES → ✅ PAGE TEAM LEAD (Slack, no phone)
│   │   │       │       │   └── NO → ✅ PAGE AT START OF BUSINESS (add to handoff)
│   │
│   └── NO — no confirmed user impact
│       ├── [Check 8] What is the signal type?
│       │   │
│       │   ├── Symptom-based alert (latency spike, CPU high, queue depth rising)
│       │   │   ├── [Check 9] Is it trending toward SLO breach in < 30 minutes?
│       │   │   │   ├── YES (burn rate alert firing) → ✅ PAGE ON-CALL (preventive)
│       │   │   │   └── NO → ✅ DO NOT PAGE — document in ops log, monitor
│       │   │
│       │   ├── Anomaly / "looks weird" (metric looks unusual but no alert firing)
│       │   │   ├── [Check 10] Can you correlate with a recent change (deploy, config, cron)?
│       │   │   │   ├── YES + change is the cause → notify deployer, no page needed
│       │   │   │   └── NO correlation found
│       │   │   │       ├── Have you spent > 15 min investigating?
│       │   │   │       │   ├── YES → ✅ PAGE — unknown cause after real effort
│       │   │   │       │   └── NO → spend 15 min investigating first
│       │   │
│       │   └── Informational / low-priority alert (disk at 70%, cert expires in 60 days)
│       │       └── → ✅ DO NOT PAGE — create ticket, schedule work
│       │
│       └── [Check 11] Does acting alone risk a blast radius > your authority level?
│           ├── YES (would affect other teams, prod data, shared infra) → PAGE FIRST
│           └── NO (isolated, reversible, low blast radius) → act, then notify async

Node Details¶

Check 1: Confirm active user impact¶

Command/method:

# Real-time error rate (last 5 minutes)
kubectl exec -it prometheus-pod -- promtool query instant \
  'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'

# Check if synthetic monitors are failing
curl -o /dev/null -sw "%{http_code} %{time_total}s\n" https://service.example.com/health

# Check error budget burn rate (if using SLO tooling)
kubectl exec -it sloth-pod -- sloth check slo --service myapp

# Recent user-visible errors in logs
kubectl logs -l app=myapp --since=5m | grep -E '"status":(5[0-9]{2})' | wc -l

What you're looking for: Error rate > 1% sustained for > 2 minutes = confirmed impact. A single spike < 30 seconds may be a transient blip. Common pitfall: Healthcheck endpoints returning 200 does not mean the service is healthy for real user traffic. Check user-facing endpoints and actual transaction success rates.

Check 2: Determine severity (P1 / P2 / P3)¶

Command/method:

# P1 indicators
kubectl get deployment myapp -o jsonpath='{.status.readyReplicas}' # 0 = P1
curl -f https://service.example.com/api/critical-path || echo "P1 CONFIRMED"

# Error rate across time to assess P2 vs P3
kubectl exec -it prometheus-pod -- promtool query range \
  --start=$(date -d '10 minutes ago' +%s) --end=$(date +%s) --step=60 \
  'rate(http_requests_total{status=~"5.."}[1m])'

# Check PagerDuty / OpsGenie for already-open incidents
pd incident list --statuses triggered,acknowledged

What you're looking for: 0 ready replicas, 100% error rate, or any payment/auth path returning 5xx = P1. 10–50% error rate or partial feature failure = P2. < 10% error rate with workaround = P3. Common pitfall: Classifying based on perceived business importance rather than actual user impact. A broken internal tool with no workaround may deserve higher severity than you initially assign.

Check 4: Is it getting worse?¶

Command/method:

# Compare error rate now vs 5 minutes ago
now=$(date +%s)
five_min_ago=$((now - 300))

kubectl exec -it prometheus-pod -- sh -c "
  promtool query instant 'rate(http_requests_total{status=~\"5..\"}[1m])' &&
  echo '---' &&
  promtool query instant --time=$five_min_ago 'rate(http_requests_total{status=~\"5..\"}[1m])'
"

# Check if the set of affected services is expanding
kubectl get pods -A | grep -c CrashLoopBackOff

What you're looking for: Error rate increasing by > 10% over 5 minutes = spreading. New pods entering CrashLoopBackOff = cascading failure risk. Common pitfall: A single service degrading looks "stable" until a downstream dependency starts timing out. Check all downstream services, not just the one that alerted.

Check 5: Runbook availability and your access level¶

Command/method:

# Search runbooks for this alert or service
ls /workspace/runbooks/ | grep -i myservice
find /workspace/runbooks/ -name "*.md" -exec grep -l "alert-name-or-symptom" {} \;

# Verify you have required access
kubectl auth can-i restart deployment myapp -n production
aws iam simulate-principal-policy --policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \
  --action-names ec2:RebootInstances

What you're looking for: A runbook that matches the symptom exactly with step-by-step commands. Verify you have kubectl exec, kubectl scale, or whatever the runbook requires. Common pitfall: A runbook exists but was written for a different environment or an older version of the service. Validate the commands are still accurate before executing them on production.

Check 9: SLO burn rate assessment¶

Command/method:

# 1-hour burn rate (> 14.4x = page; this exhausts 30-day budget in 2 hours)
kubectl exec -it prometheus-pod -- promtool query instant \
  '(1 - sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / (1 - 0.999)'

# Error budget remaining this month
kubectl exec -it sloth-pod -- sloth status --service myapp

What you're looking for: Burn rate > 14.4x = page immediately (2-hour budget window). Burn rate 1–14.4x = monitor closely. Burn rate < 1 = within normal SLO consumption. Common pitfall: Confusing error rate with burn rate. A 0.1% error rate on a 99.9% SLO is exactly at the boundary — not acceptable without investigation.

Check 11: Blast radius vs authority level¶

Command/method: Mentally map the change you are about to make. What you're looking for: Will this affect other teams' services, shared databases, production data, or shared infrastructure? If yes, get sign-off before acting. Common pitfall: "It's just a restart" on a shared service that 5 other teams depend on. Check service dependency maps before acting on shared infrastructure.

Terminal Actions¶

✅ Action: Page Immediately (P1)¶

Do:

# 1. Trigger PagerDuty alert manually if auto-alert hasn't fired
pd incident create \
  --title "P1: [service-name] complete outage" \
  --service-id SVC_ID \
  --body "Error rate 100%. Started at $(date -u +%H:%M UTC). No progress in N minutes."

# 2. Post in #incidents Slack channel
# Template: "P1 INCIDENT: [service] is down. Error rate: X%. Bridge: [link]. IC: [your name]"

# 3. Start incident bridge / war room

# 4. Open incident tracking doc and begin timeline

Verify: On-call has acknowledged within 5 minutes (SLA). If not, escalate to secondary on-call. Runbook: incident-response.md

✅ Action: Do Not Page — Informational / Non-Urgent¶

Do:

# 1. Document in ops log
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) [OBS] myapp: cpu spike to 80% for 3 min, resolved, correlated with batch job" \
  >> /var/log/ops/ops-log.txt

# 2. Create a ticket if action is needed
gh issue create --repo org/myapp \
  --title "Disk at 70% on prod-worker-1 — plan capacity increase" \
  --label "operational,non-urgent"

# 3. Add to handoff notes for next shift

Verify: The issue is tracked and not forgotten. Revisit during next business hours.

✅ Action: Page at Start of Business (P3 stable, off-hours)¶

Do:

# 1. Add to shift handoff document
# 2. Set a Slack reminder for 09:00 local
/remind me "Follow up: myapp P3 degradation from [time], runbook attempted, not resolved" at 9am

# 3. If any risk of worsening overnight, set up a monitoring alert
# 4. Document current state and what you tried

Verify: The issue is clearly documented, the right person will see it at start of business, and there is a clear escalation path if it worsens overnight.

✅ Action: Execute Runbook Then Page If Not Resolved in 15 Minutes¶

Do:

# 1. Start a timer
start_time=$(date +%s)

# 2. Follow runbook steps exactly
# 3. After each step, check if issue resolved
# 4. If 15 minutes pass without resolution:
current_time=$(date +%s)
elapsed=$((current_time - start_time))
if [ $elapsed -gt 900 ]; then
  echo "15 minutes elapsed without resolution — page on-call"
fi

# 5. When you page, include: what you tried, what changed, current state

Verify: You have a clear notes trail of what you tried so the on-call person doesn't repeat your steps.

⚠️ Warning: Blast Radius Exceeds Your Authority¶

When: The fix you are considering would affect shared infrastructure, other teams' services, or production data that you don't own. Risk: Making a well-intentioned change that causes a secondary outage in a service you didn't realize was affected. Mitigation: Page the owner of the affected system. Describe what you want to do and why. Get explicit sign-off before acting.

Edge Cases¶

You are the on-call: The question is whether to escalate to team lead or manager. Apply the same severity framework — P1 always escalates up the chain.
Alert fired but you believe it is a false positive: Still investigate before dismissing. "I think it's a false positive" has caused many real incidents to go unnoticed. Prove it is false positive before silencing.
Off-hours P2 that you successfully resolve: Still send a post-incident message in the morning. "I handled a P2 at 2am" should be visible to the team even if no one was paged.
Multiple alerts firing simultaneously: Do not treat them as separate incidents without checking if they share a common cause. A cascading failure often fires 10 alerts at once.
You are new to the on-call rotation: When in doubt, page. The cost of waking someone who wasn't needed is far lower than the cost of a missed P1. Your team should make this explicit in onboarding.

Cross-References¶

Topic Packs: incident-response, On-Call
Runbooks: incident-response.md, escalation-policy.md
Related trees: rollback-or-fix-forward.md, scale-or-optimize.md

Decision Tree: Should I Page Someone?¶

The Tree¶

Node Details¶

Check 1: Confirm active user impact¶

Check 2: Determine severity (P1 / P2 / P3)¶

Check 4: Is it getting worse?¶

Check 5: Runbook availability and your access level¶

Check 9: SLO burn rate assessment¶

Check 11: Blast radius vs authority level¶

Terminal Actions¶

✅ Action: Page Immediately (P1)¶

✅ Action: Do Not Page — Informational / Non-Urgent¶

✅ Action: Page at Start of Business (P3 stable, off-hours)¶

✅ Action: Execute Runbook Then Page If Not Resolved in 15 Minutes¶

⚠️ Warning: Blast Radius Exceeds Your Authority¶

Edge Cases¶

Cross-References¶

Pages that link here¶