Skip to content

Thinking Out Loud: Incident Triage

A senior SRE's internal monologue while working through a real incident triage. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

PagerDuty fires: "Checkout Success Rate < 95% for 5 minutes." It's 10:32 AM on a Tuesday — peak traffic time. The checkout success rate is normally 99.7%. I'm the Incident Commander on call.

The Monologue

Checkout success rate dropped to below 95%. That's a Sev-1. Let me start the clock and follow the process. First thing: declare the incident and start the communication channel.

# Create incident channel (our tooling does this automatically, but let me verify)
curl -s -X POST "https://slack.com/api/conversations.create" \
  -H "Authorization: Bearer $SLACK_TOKEN" \
  -d "name=inc-$(date +%Y%m%d)-checkout-failure"

Incident channel created. Let me post the initial status. As IC, my job is NOT to debug — it's to coordinate. But since I'm the first responder, let me do initial triage to assign the right people.

Step 1: What is the user impact? How bad is it and who is affected?

curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(checkout_total{status="failure"}[5m]))/sum(rate(checkout_total[5m]))' | jq '.data.result[0].value[1]'

Failure rate: 8.3%. So 8.3% of checkouts are failing. At our current traffic, that's roughly 50 failed checkouts per minute. At an average order value of $80, we're losing about $4,000/minute in revenue. That gives me 15 minutes before this becomes a significant financial event.

Mental Model: Impact-First Triage

Before diagnosing root cause, quantify user impact: what percentage of users are affected, what is the financial impact per minute, and is the impact growing or stable? This sets the severity level, determines who to page, and establishes the urgency. A 2% error rate that's stable is different from a 2% error rate that's climbing 1% per minute.

Step 2: Is the error rate stable, growing, or recovering?

curl -s 'http://prometheus:9090/api/v1/query_range?query=sum(rate(checkout_total{status="failure"}[1m]))/sum(rate(checkout_total[1m]))&start='"$(date -d '15 min ago' +%s)"'&end='"$(date +%s)"'&step=60' | jq '.data.result[0].values[-5:]'

5%, 6%, 7%, 8%, 8.3%. It's climbing slowly. Not a cliff, but not stable. This is a degradation, not a binary outage. That means some component is partially failing.

Step 3: Where in the checkout flow is it failing? The checkout flow is: cart -> payment validation -> payment processing -> order creation -> confirmation.

curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(checkout_step_failures[5m])) by (step)' | jq '.data.result[] | {step: .metric.step, rate: .value[1]}'

Payment processing step has the highest failure rate. The other steps are normal. Let me check the payment service specifically.

kubectl get pods -n payments -l app=payment-processor -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount,AGE:.metadata.creationTimestamp'

All pods running, no restarts. Let me check the payment service's own error metrics.

curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(payment_requests_total{status=~"5.."}[5m])) by (payment_provider)' | jq '.data.result[] | {provider: .metric.payment_provider, error_rate: .value[1]}'

Stripe: 0.1% errors (normal). PayPal: 42% errors. The PayPal integration is failing. Let me check the PayPal-specific errors.

Mental Model: Narrow the Blast Radius by Dimension

When you have a partial failure (some requests fail, some succeed), the fastest diagnostic is to break down the failure by every available dimension: region, provider, endpoint, user segment, feature flag. The dimension that shows a stark split (e.g., "Stripe is fine, PayPal is broken") immediately points to the root cause.

kubectl logs deployment/payment-processor -n payments --tail=30 | grep -i paypal | grep -i error

"PayPal API returned 503 Service Unavailable." PayPal's API is returning 503s. This is an external dependency failure — not our code, not our infrastructure. PayPal is having an outage.

Let me confirm by checking PayPal's status page.

curl -s https://www.paypal-status.com/api/v3/summary | jq '.components[] | select(.name | test("Checkout"; "i")) | {name: .name, status: .status}'

PayPal status: "Degraded Performance" on their Checkout component. It's them, not us. But our users are still impacted. What can I do?

Options: 1. Do nothing and wait for PayPal to recover — bad, we keep losing revenue 2. Temporarily disable PayPal as a payment option and show users only Stripe — fast, reduces error rate 3. Add a retry with exponential backoff for PayPal failures — might help if PayPal is intermittent

Option 2 is the fastest impact reduction. Our checkout flow has a feature flag for payment providers.

kubectl patch configmap payment-config -n payments --type merge -p '{"data":{"PAYPAL_ENABLED":"false"}}'
kubectl rollout restart deployment/payment-processor -n payments

Wait — a ConfigMap change needs a pod restart to pick up. Let me check if the app reads from the ConfigMap on every request or only at startup.

Ah, it reads at startup. The restart will pick up the change. Let me also check if this will cause any issues — are there in-flight PayPal transactions?

kubectl exec -it payment-processor-5d8f7-k2x9m -n payments -- curl -s localhost:8080/metrics | grep 'payment_in_flight.*paypal'

12 in-flight PayPal transactions. The restart will interrupt those. Let me do a rolling restart instead of a hard restart to minimize disruption.

kubectl rollout restart deployment/payment-processor -n payments
kubectl rollout status deployment/payment-processor -n payments

Rolling restart complete. New pods don't have PayPal enabled. Let me check the error rate.

curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(checkout_total{status="failure"}[1m]))/sum(rate(checkout_total[1m]))' | jq '.data.result[0].value[1]'

Error rate: 0.8%. Down from 8.3%. Checkout is healthy again. Users see only Stripe as a payment option, which is acceptable during the PayPal outage.

Now let me set up monitoring to re-enable PayPal when they recover. I'll check their status every 5 minutes.

As IC, let me update the incident channel with the timeline: - 10:27: PayPal began returning 503s - 10:32: Alert fired for checkout success rate < 95% - 10:38: Root cause identified — PayPal API outage - 10:42: PayPal disabled as payment option, error rate recovered - ACTION: Re-enable PayPal when their status page shows "Operational"

One more thing — I need to add a circuit breaker for PayPal in the payment service. If PayPal returns >5% errors, the circuit breaker should automatically disable it instead of requiring manual intervention. That's a post-incident action item.

What Made This Senior-Level

Junior Would... Senior Does... Why
Start debugging the payment service code Quantify user impact first: error rate, financial impact per minute, trend direction Impact quantification sets the severity and urgency — you can't triage without it
Look at aggregate error rates Break down failures by dimension (payment provider) to find the split Dimensional analysis turns "8% of checkouts fail" into "PayPal is down" in one query
Wait for PayPal to fix their outage Disable the failing integration and fall back to the working provider You can't fix a third-party outage, but you can reduce user impact by disabling the broken path
Fix the immediate issue and stop Add circuit breaker as a post-incident action to automate the response next time Manual intervention at 10 AM is possible; at 3 AM, the circuit breaker handles it automatically

Key Heuristics Used

  1. Impact-First Triage: Quantify user impact (percentage, financial, trend) before debugging. This sets the response urgency and determines who needs to be involved.
  2. Dimensional Breakdown: When failures are partial, break down by every dimension (provider, region, endpoint). The dimension with a stark split reveals the root cause.
  3. Degrade Gracefully, Don't Wait: When a dependency fails, disable it and fall back to alternatives rather than waiting for it to recover. Users prefer a reduced-feature checkout over a broken one.

Cross-References

  • Primer — Incident management framework, severity levels, and the IC role
  • Street Ops — Incident triage commands, impact quantification queries, and communication templates
  • Footguns — Debugging instead of mitigating, not quantifying impact, and no circuit breakers for external dependencies