Skip to content

Incident Triage - Street-Level Ops

Real-world workflows for responding to alerts, assessing impact, and coordinating incident response.

First 2 Minutes: Acknowledge and Assess

# Acknowledge the alert (stop escalation timer)
# PagerDuty CLI
pd incident acknowledge --id INC-12345

# Read the alert context
# What service? What metric? What threshold was breached?
# Check the alert's runbook link first — someone may have already solved this.

# Quick sanity check: is the service actually down?
curl -s -o /dev/null -w '%{http_code}' https://api.myapp.com/health
# 503

# Check if this is a known issue
# Search recent incident channel in Slack
# Check status page for ongoing incidents

Minutes 2-5: Verify the Signal

# Check multiple data sources — is the alert real?

# 1. Service health endpoints
for svc in api payments orders users; do
    code=$(curl -s -o /dev/null -w '%{http_code}' "https://${svc}.internal/health")
    echo "${svc}: ${code}"
done

# Output:
# api: 503
# payments: 503
# orders: 200
# users: 200

# 2. Error rate in metrics (Prometheus)
# rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# 3. Check for recent deployments (most common cause)
kubectl rollout history deployment/api-service | tail -5
git log --oneline --since="2 hours ago" -- api/

# 4. Check for related alerts
# Are multiple alerts firing? Correlated symptoms point to a shared root cause.

Minutes 5-10: Assess Blast Radius

# Which pods are unhealthy?
kubectl get pods -n production | grep -v Running

# Output:
# NAME                          READY   STATUS             RESTARTS
# api-service-7d4b8c9f6-x2k9l  0/1     CrashLoopBackOff   5
# api-service-7d4b8c9f6-m3n7p  0/1     CrashLoopBackOff   5
# api-service-7d4b8c9f6-q8r2s  0/1     CrashLoopBackOff   5

# Check recent pod events
kubectl describe pod api-service-7d4b8c9f6-x2k9l -n production | tail -20

# Check logs for the crash reason
kubectl logs api-service-7d4b8c9f6-x2k9l -n production --previous | tail -30

# Check node health
kubectl get nodes -o wide
kubectl top nodes

# Check external dependencies
# Database
psql -h db.internal -c "SELECT 1" mydb && echo "DB OK" || echo "DB DOWN"

# Redis
redis-cli -h redis.internal ping

# Check recent changes in the cluster
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

Communicate: Open Incident Channel

# Post in #incidents (or create a dedicated channel)

INCIDENT: [SEV-2] API service returning 503 errors
STATUS: Investigating
IMPACT: All API endpoints returning 503. Checkout flow blocked for all users.
IC: @your-name (Incident Commander)
LAST UPDATE: 14:35 UTC — API pods in CrashLoopBackOff, investigating root cause
NEXT UPDATE: 14:50 UTC

# Update status page
# Set component status to "Major Outage" or "Degraded Performance"

Mitigate: Common Quick Actions

# Rollback a bad deployment (most common fix)
kubectl rollout undo deployment/api-service -n production

# Verify rollback is progressing
kubectl rollout status deployment/api-service -n production

# Scale up to handle load (if performance issue)
kubectl scale deployment/api-service --replicas=10 -n production

# Restart pods (if stuck in bad state)
kubectl rollout restart deployment/api-service -n production

# Failover to secondary region (if regional issue)
# Update DNS weight or load balancer target

# Temporary feature flag to disable broken feature
curl -X POST https://api.internal/admin/feature-flags \
  -d '{"payment_v2": false}'

Escalation Template

# When you cannot resolve in 15 minutes, escalate with this context:

ESCALATION REQUEST
------------------
Service:    api-service (production)
Severity:   SEV-2
Started:    14:30 UTC
Duration:   15 minutes
Symptoms:   All API pods CrashLoopBackOff, 100% error rate
Tried:      Rollback to previous version, pod restart
Findings:   Crash log shows OOM kill, heap usage 100%
Blast radius: API and Payments down, Orders and Users unaffected
Need:       DBA or platform engineer  suspected database connection pool exhaustion
Dashboard:  https://grafana.internal/d/api-overview
Logs:       https://kibana.internal/app/logs?query=api-service

Post-Incident Wrap-up

# While memory is fresh, capture the timeline
cat > /tmp/incident-timeline.md << 'EOF'
## Incident Timeline: API Outage 2024-03-15

- 14:30 UTC — Alert fires: API error rate > 5%
- 14:32 UTC — Acknowledged by @your-name
- 14:35 UTC — Verified: all API pods in CrashLoopBackOff
- 14:38 UTC — Blast radius: API + Payments down
- 14:40 UTC — Cause identified: OOM due to connection pool leak in v2.3.1
- 14:42 UTC — Rollback to v2.3.0 initiated
- 14:45 UTC — Rollback complete, pods healthy
- 14:50 UTC — Error rate back to normal
- 14:55 UTC — Monitoring, status page updated to Operational
- 15:00 UTC — Incident resolved

## Root Cause
Connection pool leak introduced in PR #456 (merged 14:00 UTC)

## Action Items
- [ ] Fix connection pool leak in api-service
- [ ] Add connection pool metrics to dashboard
- [ ] Add canary deployment for api-service
EOF

# Schedule postmortem within 48 hours for SEV-1/2

Quick Reference: Severity Decision

Customer-facing AND >50% users affected     → SEV-1
Customer-facing AND <50% users, no workaround → SEV-2
Customer-facing AND workaround exists        → SEV-3
Internal-only, blocking operations           → SEV-3
Internal-only, non-blocking                  → SEV-4