Incident Postmortem & SLO/SLI - Street-Level Ops¶
Quick SLO Calculations¶
SLO: 99.9% availability (30-day window)
Error budget: 0.1% = 43.2 minutes
Per day: 1.44 minutes of allowed downtime
SLO: 99.5% availability
Error budget: 0.5% = 216 minutes (3.6 hours)
SLO: 99.99% availability
Error budget: 0.01% = 4.32 minutes (don't pick this unless you're Google)
Remember: SLO math trick — multiply the "nines" to get monthly budget. 99.9% = 43 min, 99.95% = 22 min, 99.99% = 4 min. Each additional nine cuts your budget by 10x.
Interview tip: "What SLIs would you pick for a checkout service?" Best answer: availability (% of non-5xx responses), latency (p99 under 500ms), and correctness (% of orders that charge the right amount). Picking just availability misses the user experience.
Pattern: Postmortem Within 48 Hours¶
Write the postmortem within 48 hours of incident resolution while memory is fresh:
- Hour 0-2: Collect artifacts (logs, graphs, chat transcripts, deploy timestamps)
- Hour 2-4: Draft timeline from artifacts
- Hour 4-24: Write root cause and contributing factors
- Hour 24-48: Review meeting with team, finalize action items
- Day 3-5: Share with broader org
Pattern: Postmortem Artifacts to Collect¶
# Deployment history around incident time
kubectl rollout history deployment/grokdevops -n grokdevops
# Events during incident
kubectl get events -n grokdevops --sort-by='.lastTimestamp' \
--field-selector 'reason!=Pulled'
# Prometheus query: error rate during incident
# Use Grafana explore with the incident time range
# Loki query: error logs during incident
{namespace="grokdevops"} |= "error" # with time range filter
# Git log: what was deployed?
git log --oneline --since="2024-01-15T14:00:00" --until="2024-01-15T15:00:00"
Debug clue: Always capture
kubectl get eventsimmediately — Kubernetes garbage-collects events after 1 hour by default. If you wait until the postmortem meeting, the evidence is gone.One-liner: Quick error-rate check during an incident:
curl -s localhost:9090/api/v1/query --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m]))' | jq '.data.result[0].value[1]'
Pattern: SLO Review Cadence¶
| Frequency | Activity |
|---|---|
| Daily | Check error budget dashboard |
| Weekly | Review burn rate trends, flag risks |
| Monthly | Review SLO attainment, postmortem action items |
| Quarterly | Adjust SLO targets based on data and business needs |
Gotcha: SLOs Without Teeth¶
An SLO without an error budget policy is just a number on a dashboard.
Bad: "Our SLO is 99.9%" (and nothing happens when it's violated) Good: "Our SLO is 99.9%. When budget is exhausted: feature freeze, on-call reviews all deploys, team focus shifts to reliability."
Analogy: An error budget is like a credit card balance. You start the month with a limit (e.g., 43 minutes of downtime). Every incident charges against it. When the balance is zero, you stop spending (feature freeze) until the next billing cycle resets it.
Gotcha: Too Many SLOs¶
Start with 2 per service: 1. Availability: % of successful requests 2. Latency: p99 response time
Add more only when you have a specific need (correctness, freshness, throughput).
Default trap: Teams default to 99.99% because "higher is better." But 99.99% gives you 4 minutes/month of error budget — one bad deploy eats it. Start at 99.5% and tighten only when customers demand it.
Gotcha: Gaming the SLO¶
Teams might exclude endpoints from SLI to look better. Prevent by: - Including ALL user-facing endpoints - Auditing SLI definitions quarterly - Publishing SLIs transparently
Template: Quick Incident Summary (for Chat)¶
When declaring an incident in Slack/chat:
INCIDENT: [one-line description]
SEVERITY: SEV-1/2/3
IMPACT: [who is affected and how]
STATUS: Investigating / Identified / Mitigating / Resolved
COMMANDER: @name
CHANNEL: #inc-YYYYMMDD-short-name
NEXT UPDATE: [time]
Remember: Incident severity mnemonic: SEV-1 = "customers are screaming," SEV-2 = "customers are grumbling," SEV-3 = "we noticed before customers did." If you cannot articulate the customer impact, you probably have the severity wrong.
Template: Action Item Format¶
[PM-YYYYMMDD-NN] Title
Owner: @name
Priority: P1/P2/P3
Due: YYYY-MM-DD
Status: Open/In Progress/Done
Source: Postmortem for [incident name]
Description: Specific, actionable task
Anti-Pattern: Blame in Disguise¶
These sound blameless but aren't:
| Sounds blameless | Actually blame |
|---|---|
| "The on-call should have escalated sooner" | Blaming the individual |
| "If the developer had tested..." | Blaming the developer |
| "The team failed to follow the process" | Blaming the team |
Better: - "The escalation criteria were unclear" -> Action: Document escalation criteria - "The test suite didn't cover this case" -> Action: Add integration test - "The process was documented but not discoverable" -> Action: Link process from deploy tool
War story: A team wrote "the engineer should have noticed the alert" in a postmortem. The real fix was that the alert fired in a channel with 200+ daily messages. Moving it to a dedicated channel with PagerDuty integration solved the problem permanently.
Gotcha: Action items without owners and due dates never get done. Track postmortem actions in your issue tracker alongside feature work — if they only live in a Google Doc, they die there.