Thinking Out Loud: Incident Triage¶
A senior SRE's internal monologue while working through a real incident triage. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
PagerDuty fires: "Checkout Success Rate < 95% for 5 minutes." It's 10:32 AM on a Tuesday — peak traffic time. The checkout success rate is normally 99.7%. I'm the Incident Commander on call.
The Monologue¶
Checkout success rate dropped to below 95%. That's a Sev-1. Let me start the clock and follow the process. First thing: declare the incident and start the communication channel.
# Create incident channel (our tooling does this automatically, but let me verify)
curl -s -X POST "https://slack.com/api/conversations.create" \
-H "Authorization: Bearer $SLACK_TOKEN" \
-d "name=inc-$(date +%Y%m%d)-checkout-failure"
Incident channel created. Let me post the initial status. As IC, my job is NOT to debug — it's to coordinate. But since I'm the first responder, let me do initial triage to assign the right people.
Step 1: What is the user impact? How bad is it and who is affected?
curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(checkout_total{status="failure"}[5m]))/sum(rate(checkout_total[5m]))' | jq '.data.result[0].value[1]'
Failure rate: 8.3%. So 8.3% of checkouts are failing. At our current traffic, that's roughly 50 failed checkouts per minute. At an average order value of $80, we're losing about $4,000/minute in revenue. That gives me 15 minutes before this becomes a significant financial event.
Mental Model: Impact-First Triage¶
Before diagnosing root cause, quantify user impact: what percentage of users are affected, what is the financial impact per minute, and is the impact growing or stable? This sets the severity level, determines who to page, and establishes the urgency. A 2% error rate that's stable is different from a 2% error rate that's climbing 1% per minute.
Step 2: Is the error rate stable, growing, or recovering?
curl -s 'http://prometheus:9090/api/v1/query_range?query=sum(rate(checkout_total{status="failure"}[1m]))/sum(rate(checkout_total[1m]))&start='"$(date -d '15 min ago' +%s)"'&end='"$(date +%s)"'&step=60' | jq '.data.result[0].values[-5:]'
5%, 6%, 7%, 8%, 8.3%. It's climbing slowly. Not a cliff, but not stable. This is a degradation, not a binary outage. That means some component is partially failing.
Step 3: Where in the checkout flow is it failing? The checkout flow is: cart -> payment validation -> payment processing -> order creation -> confirmation.
curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(checkout_step_failures[5m])) by (step)' | jq '.data.result[] | {step: .metric.step, rate: .value[1]}'
Payment processing step has the highest failure rate. The other steps are normal. Let me check the payment service specifically.
kubectl get pods -n payments -l app=payment-processor -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount,AGE:.metadata.creationTimestamp'
All pods running, no restarts. Let me check the payment service's own error metrics.
curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(payment_requests_total{status=~"5.."}[5m])) by (payment_provider)' | jq '.data.result[] | {provider: .metric.payment_provider, error_rate: .value[1]}'
Stripe: 0.1% errors (normal). PayPal: 42% errors. The PayPal integration is failing. Let me check the PayPal-specific errors.
Mental Model: Narrow the Blast Radius by Dimension¶
When you have a partial failure (some requests fail, some succeed), the fastest diagnostic is to break down the failure by every available dimension: region, provider, endpoint, user segment, feature flag. The dimension that shows a stark split (e.g., "Stripe is fine, PayPal is broken") immediately points to the root cause.
"PayPal API returned 503 Service Unavailable." PayPal's API is returning 503s. This is an external dependency failure — not our code, not our infrastructure. PayPal is having an outage.
Let me confirm by checking PayPal's status page.
curl -s https://www.paypal-status.com/api/v3/summary | jq '.components[] | select(.name | test("Checkout"; "i")) | {name: .name, status: .status}'
PayPal status: "Degraded Performance" on their Checkout component. It's them, not us. But our users are still impacted. What can I do?
Options: 1. Do nothing and wait for PayPal to recover — bad, we keep losing revenue 2. Temporarily disable PayPal as a payment option and show users only Stripe — fast, reduces error rate 3. Add a retry with exponential backoff for PayPal failures — might help if PayPal is intermittent
Option 2 is the fastest impact reduction. Our checkout flow has a feature flag for payment providers.
kubectl patch configmap payment-config -n payments --type merge -p '{"data":{"PAYPAL_ENABLED":"false"}}'
kubectl rollout restart deployment/payment-processor -n payments
Wait — a ConfigMap change needs a pod restart to pick up. Let me check if the app reads from the ConfigMap on every request or only at startup.
Ah, it reads at startup. The restart will pick up the change. Let me also check if this will cause any issues — are there in-flight PayPal transactions?
kubectl exec -it payment-processor-5d8f7-k2x9m -n payments -- curl -s localhost:8080/metrics | grep 'payment_in_flight.*paypal'
12 in-flight PayPal transactions. The restart will interrupt those. Let me do a rolling restart instead of a hard restart to minimize disruption.
kubectl rollout restart deployment/payment-processor -n payments
kubectl rollout status deployment/payment-processor -n payments
Rolling restart complete. New pods don't have PayPal enabled. Let me check the error rate.
curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(checkout_total{status="failure"}[1m]))/sum(rate(checkout_total[1m]))' | jq '.data.result[0].value[1]'
Error rate: 0.8%. Down from 8.3%. Checkout is healthy again. Users see only Stripe as a payment option, which is acceptable during the PayPal outage.
Now let me set up monitoring to re-enable PayPal when they recover. I'll check their status every 5 minutes.
As IC, let me update the incident channel with the timeline: - 10:27: PayPal began returning 503s - 10:32: Alert fired for checkout success rate < 95% - 10:38: Root cause identified — PayPal API outage - 10:42: PayPal disabled as payment option, error rate recovered - ACTION: Re-enable PayPal when their status page shows "Operational"
One more thing — I need to add a circuit breaker for PayPal in the payment service. If PayPal returns >5% errors, the circuit breaker should automatically disable it instead of requiring manual intervention. That's a post-incident action item.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Start debugging the payment service code | Quantify user impact first: error rate, financial impact per minute, trend direction | Impact quantification sets the severity and urgency — you can't triage without it |
| Look at aggregate error rates | Break down failures by dimension (payment provider) to find the split | Dimensional analysis turns "8% of checkouts fail" into "PayPal is down" in one query |
| Wait for PayPal to fix their outage | Disable the failing integration and fall back to the working provider | You can't fix a third-party outage, but you can reduce user impact by disabling the broken path |
| Fix the immediate issue and stop | Add circuit breaker as a post-incident action to automate the response next time | Manual intervention at 10 AM is possible; at 3 AM, the circuit breaker handles it automatically |
Key Heuristics Used¶
- Impact-First Triage: Quantify user impact (percentage, financial, trend) before debugging. This sets the response urgency and determines who needs to be involved.
- Dimensional Breakdown: When failures are partial, break down by every dimension (provider, region, endpoint). The dimension with a stark split reveals the root cause.
- Degrade Gracefully, Don't Wait: When a dependency fails, disable it and fall back to alternatives rather than waiting for it to recover. Users prefer a reduced-feature checkout over a broken one.
Cross-References¶
- Primer — Incident management framework, severity levels, and the IC role
- Street Ops — Incident triage commands, impact quantification queries, and communication templates
- Footguns — Debugging instead of mitigating, not quantifying impact, and no circuit breakers for external dependencies