Decision Tree: Should I Page Someone?¶
Category: Operational Decisions Starting Question: "Something looks wrong — should I page the on-call?" Estimated traversal: 2-5 minutes Domains: incident-response, on-call, alerting, SRE
The Tree¶
Something looks wrong — should I page the on-call?
│
├── [Check 1] Is there active user impact right now?
│ │ (errors visible, service returning 5xx, latency > SLO, feature unavailable)
│ │
│ ├── YES — users are impacted
│ │ ├── [Check 2] What is the severity?
│ │ │ │
│ │ │ ├── P1: Total outage (100% error rate, service completely unreachable,
│ │ │ │ revenue-generating path down, SLA breach imminent)
│ │ │ │ └── → ✅ PAGE IMMEDIATELY — do not investigate first
│ │ │ │ Call on-call, post in #incidents, start bridge
│ │ │ │
│ │ │ ├── P2: Partial degradation (elevated errors 10–50%, core feature broken,
│ │ │ │ subset of users affected, no workaround)
│ │ │ │ ├── [Check 3] Is it business hours (Mon–Fri 09:00–18:00 local)?
│ │ │ │ │ ├── YES → ✅ PAGE TEAM LEAD directly (Slack + phone)
│ │ │ │ │ └── NO (off-hours / weekend)
│ │ │ │ │ ├── [Check 4] Is it getting worse? (error rate rising in last 5 min)
│ │ │ │ │ │ ├── YES → ✅ PAGE ON-CALL NOW
│ │ │ │ │ │ └── NO (stable, not worsening)
│ │ │ │ │ │ ├── [Check 5] Have you been working on it > 15 minutes?
│ │ │ │ │ │ │ ├── YES → ✅ PAGE ON-CALL
│ │ │ │ │ │ │ └── NO → attempt runbook for 15 min then page
│ │ │ │ └── (continue to Check 5)
│ │ │ │
│ │ │ └── P3: Degraded with workaround (slow responses, non-critical feature down,
│ │ │ error rate < 10%, users can complete task via alternate path)
│ │ │ ├── [Check 5] Is there a runbook you can execute alone?
│ │ │ │ ├── YES + you have required access
│ │ │ │ │ ├── [Check 6] Have you spent > 15 min without progress?
│ │ │ │ │ │ ├── YES → ✅ PAGE ON-CALL (you've tried, escalate)
│ │ │ │ │ │ └── NO → ✅ EXECUTE RUNBOOK, page if not resolved in 15 min
│ │ │ │ └── NO runbook or you lack access
│ │ │ │ ├── [Check 7] Is it business hours?
│ │ │ │ │ ├── YES → ✅ PAGE TEAM LEAD (Slack, no phone)
│ │ │ │ │ └── NO → ✅ PAGE AT START OF BUSINESS (add to handoff)
│ │
│ └── NO — no confirmed user impact
│ ├── [Check 8] What is the signal type?
│ │ │
│ │ ├── Symptom-based alert (latency spike, CPU high, queue depth rising)
│ │ │ ├── [Check 9] Is it trending toward SLO breach in < 30 minutes?
│ │ │ │ ├── YES (burn rate alert firing) → ✅ PAGE ON-CALL (preventive)
│ │ │ │ └── NO → ✅ DO NOT PAGE — document in ops log, monitor
│ │ │
│ │ ├── Anomaly / "looks weird" (metric looks unusual but no alert firing)
│ │ │ ├── [Check 10] Can you correlate with a recent change (deploy, config, cron)?
│ │ │ │ ├── YES + change is the cause → notify deployer, no page needed
│ │ │ │ └── NO correlation found
│ │ │ │ ├── Have you spent > 15 min investigating?
│ │ │ │ │ ├── YES → ✅ PAGE — unknown cause after real effort
│ │ │ │ │ └── NO → spend 15 min investigating first
│ │ │
│ │ └── Informational / low-priority alert (disk at 70%, cert expires in 60 days)
│ │ └── → ✅ DO NOT PAGE — create ticket, schedule work
│ │
│ └── [Check 11] Does acting alone risk a blast radius > your authority level?
│ ├── YES (would affect other teams, prod data, shared infra) → PAGE FIRST
│ └── NO (isolated, reversible, low blast radius) → act, then notify async
Node Details¶
Check 1: Confirm active user impact¶
Command/method:
# Real-time error rate (last 5 minutes)
kubectl exec -it prometheus-pod -- promtool query instant \
'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
# Check if synthetic monitors are failing
curl -o /dev/null -sw "%{http_code} %{time_total}s\n" https://service.example.com/health
# Check error budget burn rate (if using SLO tooling)
kubectl exec -it sloth-pod -- sloth check slo --service myapp
# Recent user-visible errors in logs
kubectl logs -l app=myapp --since=5m | grep -E '"status":(5[0-9]{2})' | wc -l
Check 2: Determine severity (P1 / P2 / P3)¶
Command/method:
# P1 indicators
kubectl get deployment myapp -o jsonpath='{.status.readyReplicas}' # 0 = P1
curl -f https://service.example.com/api/critical-path || echo "P1 CONFIRMED"
# Error rate across time to assess P2 vs P3
kubectl exec -it prometheus-pod -- promtool query range \
--start=$(date -d '10 minutes ago' +%s) --end=$(date +%s) --step=60 \
'rate(http_requests_total{status=~"5.."}[1m])'
# Check PagerDuty / OpsGenie for already-open incidents
pd incident list --statuses triggered,acknowledged
Check 4: Is it getting worse?¶
Command/method:
# Compare error rate now vs 5 minutes ago
now=$(date +%s)
five_min_ago=$((now - 300))
kubectl exec -it prometheus-pod -- sh -c "
promtool query instant 'rate(http_requests_total{status=~\"5..\"}[1m])' &&
echo '---' &&
promtool query instant --time=$five_min_ago 'rate(http_requests_total{status=~\"5..\"}[1m])'
"
# Check if the set of affected services is expanding
kubectl get pods -A | grep -c CrashLoopBackOff
Check 5: Runbook availability and your access level¶
Command/method:
# Search runbooks for this alert or service
ls /workspace/runbooks/ | grep -i myservice
find /workspace/runbooks/ -name "*.md" -exec grep -l "alert-name-or-symptom" {} \;
# Verify you have required access
kubectl auth can-i restart deployment myapp -n production
aws iam simulate-principal-policy --policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \
--action-names ec2:RebootInstances
kubectl exec, kubectl scale, or whatever the runbook requires.
Common pitfall: A runbook exists but was written for a different environment or an older version of the service. Validate the commands are still accurate before executing them on production.
Check 9: SLO burn rate assessment¶
Command/method:
# 1-hour burn rate (> 14.4x = page; this exhausts 30-day budget in 2 hours)
kubectl exec -it prometheus-pod -- promtool query instant \
'(1 - sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / (1 - 0.999)'
# Error budget remaining this month
kubectl exec -it sloth-pod -- sloth status --service myapp
Check 11: Blast radius vs authority level¶
Command/method: Mentally map the change you are about to make. What you're looking for: Will this affect other teams' services, shared databases, production data, or shared infrastructure? If yes, get sign-off before acting. Common pitfall: "It's just a restart" on a shared service that 5 other teams depend on. Check service dependency maps before acting on shared infrastructure.
Terminal Actions¶
✅ Action: Page Immediately (P1)¶
Do:
# 1. Trigger PagerDuty alert manually if auto-alert hasn't fired
pd incident create \
--title "P1: [service-name] complete outage" \
--service-id SVC_ID \
--body "Error rate 100%. Started at $(date -u +%H:%M UTC). No progress in N minutes."
# 2. Post in #incidents Slack channel
# Template: "P1 INCIDENT: [service] is down. Error rate: X%. Bridge: [link]. IC: [your name]"
# 3. Start incident bridge / war room
# 4. Open incident tracking doc and begin timeline
✅ Action: Do Not Page — Informational / Non-Urgent¶
Do:
# 1. Document in ops log
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) [OBS] myapp: cpu spike to 80% for 3 min, resolved, correlated with batch job" \
>> /var/log/ops/ops-log.txt
# 2. Create a ticket if action is needed
gh issue create --repo org/myapp \
--title "Disk at 70% on prod-worker-1 — plan capacity increase" \
--label "operational,non-urgent"
# 3. Add to handoff notes for next shift
✅ Action: Page at Start of Business (P3 stable, off-hours)¶
Do:
# 1. Add to shift handoff document
# 2. Set a Slack reminder for 09:00 local
/remind me "Follow up: myapp P3 degradation from [time], runbook attempted, not resolved" at 9am
# 3. If any risk of worsening overnight, set up a monitoring alert
# 4. Document current state and what you tried
✅ Action: Execute Runbook Then Page If Not Resolved in 15 Minutes¶
Do:
# 1. Start a timer
start_time=$(date +%s)
# 2. Follow runbook steps exactly
# 3. After each step, check if issue resolved
# 4. If 15 minutes pass without resolution:
current_time=$(date +%s)
elapsed=$((current_time - start_time))
if [ $elapsed -gt 900 ]; then
echo "15 minutes elapsed without resolution — page on-call"
fi
# 5. When you page, include: what you tried, what changed, current state
⚠️ Warning: Blast Radius Exceeds Your Authority¶
When: The fix you are considering would affect shared infrastructure, other teams' services, or production data that you don't own. Risk: Making a well-intentioned change that causes a secondary outage in a service you didn't realize was affected. Mitigation: Page the owner of the affected system. Describe what you want to do and why. Get explicit sign-off before acting.
Edge Cases¶
- You are the on-call: The question is whether to escalate to team lead or manager. Apply the same severity framework — P1 always escalates up the chain.
- Alert fired but you believe it is a false positive: Still investigate before dismissing. "I think it's a false positive" has caused many real incidents to go unnoticed. Prove it is false positive before silencing.
- Off-hours P2 that you successfully resolve: Still send a post-incident message in the morning. "I handled a P2 at 2am" should be visible to the team even if no one was paged.
- Multiple alerts firing simultaneously: Do not treat them as separate incidents without checking if they share a common cause. A cascading failure often fires 10 alerts at once.
- You are new to the on-call rotation: When in doubt, page. The cost of waking someone who wasn't needed is far lower than the cost of a missed P1. Your team should make this explicit in onboarding.
Cross-References¶
- Topic Packs: incident-response, On-Call
- Runbooks: incident-response.md, escalation-policy.md
- Related trees: rollback-or-fix-forward.md, scale-or-optimize.md