Incident Triage - Street-Level Ops¶
Real-world workflows for responding to alerts, assessing impact, and coordinating incident response.
First 2 Minutes: Acknowledge and Assess¶
# Acknowledge the alert (stop escalation timer)
# PagerDuty CLI
pd incident acknowledge --id INC-12345
# Read the alert context
# What service? What metric? What threshold was breached?
# Check the alert's runbook link first — someone may have already solved this.
# Quick sanity check: is the service actually down?
curl -s -o /dev/null -w '%{http_code}' https://api.myapp.com/health
# 503
# Check if this is a known issue
# Search recent incident channel in Slack
# Check status page for ongoing incidents
Minutes 2-5: Verify the Signal¶
# Check multiple data sources — is the alert real?
# 1. Service health endpoints
for svc in api payments orders users; do
code=$(curl -s -o /dev/null -w '%{http_code}' "https://${svc}.internal/health")
echo "${svc}: ${code}"
done
# Output:
# api: 503
# payments: 503
# orders: 200
# users: 200
# 2. Error rate in metrics (Prometheus)
# rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# 3. Check for recent deployments (most common cause)
kubectl rollout history deployment/api-service | tail -5
git log --oneline --since="2 hours ago" -- api/
# 4. Check for related alerts
# Are multiple alerts firing? Correlated symptoms point to a shared root cause.
Minutes 5-10: Assess Blast Radius¶
# Which pods are unhealthy?
kubectl get pods -n production | grep -v Running
# Output:
# NAME READY STATUS RESTARTS
# api-service-7d4b8c9f6-x2k9l 0/1 CrashLoopBackOff 5
# api-service-7d4b8c9f6-m3n7p 0/1 CrashLoopBackOff 5
# api-service-7d4b8c9f6-q8r2s 0/1 CrashLoopBackOff 5
# Check recent pod events
kubectl describe pod api-service-7d4b8c9f6-x2k9l -n production | tail -20
# Check logs for the crash reason
kubectl logs api-service-7d4b8c9f6-x2k9l -n production --previous | tail -30
# Check node health
kubectl get nodes -o wide
kubectl top nodes
# Check external dependencies
# Database
psql -h db.internal -c "SELECT 1" mydb && echo "DB OK" || echo "DB DOWN"
# Redis
redis-cli -h redis.internal ping
# Check recent changes in the cluster
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
Communicate: Open Incident Channel¶
# Post in #incidents (or create a dedicated channel)
INCIDENT: [SEV-2] API service returning 503 errors
STATUS: Investigating
IMPACT: All API endpoints returning 503. Checkout flow blocked for all users.
IC: @your-name (Incident Commander)
LAST UPDATE: 14:35 UTC — API pods in CrashLoopBackOff, investigating root cause
NEXT UPDATE: 14:50 UTC
# Update status page
# Set component status to "Major Outage" or "Degraded Performance"
Mitigate: Common Quick Actions¶
# Rollback a bad deployment (most common fix)
kubectl rollout undo deployment/api-service -n production
# Verify rollback is progressing
kubectl rollout status deployment/api-service -n production
# Scale up to handle load (if performance issue)
kubectl scale deployment/api-service --replicas=10 -n production
# Restart pods (if stuck in bad state)
kubectl rollout restart deployment/api-service -n production
# Failover to secondary region (if regional issue)
# Update DNS weight or load balancer target
# Temporary feature flag to disable broken feature
curl -X POST https://api.internal/admin/feature-flags \
-d '{"payment_v2": false}'
Escalation Template¶
# When you cannot resolve in 15 minutes, escalate with this context:
ESCALATION REQUEST
------------------
Service: api-service (production)
Severity: SEV-2
Started: 14:30 UTC
Duration: 15 minutes
Symptoms: All API pods CrashLoopBackOff, 100% error rate
Tried: Rollback to previous version, pod restart
Findings: Crash log shows OOM kill, heap usage 100%
Blast radius: API and Payments down, Orders and Users unaffected
Need: DBA or platform engineer — suspected database connection pool exhaustion
Dashboard: https://grafana.internal/d/api-overview
Logs: https://kibana.internal/app/logs?query=api-service
Post-Incident Wrap-up¶
# While memory is fresh, capture the timeline
cat > /tmp/incident-timeline.md << 'EOF'
## Incident Timeline: API Outage 2024-03-15
- 14:30 UTC — Alert fires: API error rate > 5%
- 14:32 UTC — Acknowledged by @your-name
- 14:35 UTC — Verified: all API pods in CrashLoopBackOff
- 14:38 UTC — Blast radius: API + Payments down
- 14:40 UTC — Cause identified: OOM due to connection pool leak in v2.3.1
- 14:42 UTC — Rollback to v2.3.0 initiated
- 14:45 UTC — Rollback complete, pods healthy
- 14:50 UTC — Error rate back to normal
- 14:55 UTC — Monitoring, status page updated to Operational
- 15:00 UTC — Incident resolved
## Root Cause
Connection pool leak introduced in PR #456 (merged 14:00 UTC)
## Action Items
- [ ] Fix connection pool leak in api-service
- [ ] Add connection pool metrics to dashboard
- [ ] Add canary deployment for api-service
EOF
# Schedule postmortem within 48 hours for SEV-1/2
Quick Reference: Severity Decision¶
Customer-facing AND >50% users affected → SEV-1
Customer-facing AND <50% users, no workaround → SEV-2
Customer-facing AND workaround exists → SEV-3
Internal-only, blocking operations → SEV-3
Internal-only, non-blocking → SEV-4