Portal | Level: L2: Operations | Topics: Postmortems & SLOs | Domain: DevOps & Tooling
Incident Postmortem & SLO/SLI Drills¶
Remember: SLI -> SLO -> SLA, from measurement to promise. SLI = what you measure (request latency, error rate), SLO = what you aim for (99.9% of requests < 300ms), SLA = what you promise with consequences (99.5% uptime or credits). Your SLO should always be stricter than your SLA — the gap is your safety buffer. Mnemonic: "SLI is the Indicator, SLO is the Objective, SLA is the Agreement."
Gotcha: 99.9% vs 99.99% availability sounds like a tiny difference, but it is 10x. 99.9% = 43 minutes of downtime per month. 99.99% = 4.3 minutes per month. Each additional "nine" is an order of magnitude harder and more expensive to achieve.
Drill 1: Define SLI, SLO, SLA¶
Difficulty: Easy
Q: Define SLI, SLO, and SLA. Give a concrete example for a REST API.
Answer
- **SLI** (Service Level Indicator): A measurement of service behavior. "What we measure." - **SLO** (Service Level Objective): A target for the SLI. "What we aim for." - **SLA** (Service Level Agreement): A contract with consequences. "What we promise." Example for `api.example.com`: | Layer | Metric | Value | |-------|--------|-------| | **SLI** | Proportion of requests < 300ms returning non-5xx | Measured per rolling 30d window | | **SLO** | 99.9% availability, p99 latency < 500ms | Internal engineering target | | **SLA** | 99.5% availability | Contract: credits if breached | SLO should be stricter than SLA. If SLO = 99.9% and SLA = 99.5%, you have a buffer before contractual consequences.Drill 2: Error Budget Calculation¶
Difficulty: Medium
Q: Your API has a 99.9% availability SLO over a 30-day window. You've had 20 minutes of downtime this month. How much error budget remains?
Answer
At this rate, you're on track to stay within budget — but one more incident of similar length puts you at risk. PromQL for error budget remaining:Drill 3: Multi-Window Burn Rate Alert¶
Difficulty: Hard
Q: Why is a simple "error rate > threshold" alert bad for SLO monitoring? Explain multi-window burn rate alerting.
Answer
Problems with simple threshold: - `error_rate > 0.1%` fires on brief spikes that don't threaten the SLO - Too many false positives → alert fatigue - Or threshold is too high → miss slow burns **Burn rate** = how fast you're consuming error budget relative to the window. - Burn rate 1 = consuming budget evenly across the period (exactly depleted at window end) - Burn rate 10 = consuming 10x faster → budget gone in 1/10th the window **Multi-window**: Check both a long window (trend) and short window (still happening).# Page-worthy: 14.4x burn rate sustained for 1h (short) and 5m (still active)
- alert: ErrorBudgetBurn
expr: |
(
http_requests:burnrate5m{job="api"} > (14.4 * (1 - 0.999))
and
http_requests:burnrate1h{job="api"} > (14.4 * (1 - 0.999))
)
labels:
severity: critical
# Ticket-worthy: 3x burn rate sustained for 6h (and 30m)
- alert: ErrorBudgetBurnSlow
expr: |
(
http_requests:burnrate30m{job="api"} > (3 * (1 - 0.999))
and
http_requests:burnrate6h{job="api"} > (3 * (1 - 0.999))
)
labels:
severity: warning
Drill 4: SLI Implementation¶
Difficulty: Medium
Q: Write PromQL queries for these SLIs: (a) availability, (b) latency, (c) throughput.
Answer
# (a) Availability: proportion of successful requests
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# (b) Latency: proportion of requests under 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
# (c) Throughput: requests per second
sum(rate(http_requests_total[5m]))
# Combined SLI (availability AND latency)
(
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
*
(
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
)
Drill 5: Postmortem Template¶
Difficulty: Easy
Q: What are the essential sections of a blameless postmortem?
Answer
# Incident Postmortem: [Title]
## Summary
- **Date**: 2024-01-15
- **Duration**: 45 minutes (14:15 - 15:00 UTC)
- **Severity**: SEV-2
- **Impact**: 30% of API requests returned 500 errors
## Timeline
- 14:15 — Deployment of v2.3.1 begins
- 14:18 — Error rate spikes to 30% (PagerDuty alert fires)
- 14:22 — On-call acknowledges, begins investigation
- 14:30 — Root cause identified: missing env var in new config
- 14:35 — Rollback initiated
- 14:40 — Rollback complete, error rate drops
- 15:00 — Confirmed fully recovered
## Root Cause
[Technical description of what went wrong]
## Contributing Factors
- Config change not validated in staging
- No smoke test after deployment
- Alert delay due to evaluation interval
## What Went Well
- Fast detection (3 minutes)
- Clear rollback procedure
## What Went Poorly
- 12 minutes to identify root cause
- No pre-deploy validation caught the issue
## Action Items
| Action | Owner | Priority | Ticket |
|--------|-------|----------|--------|
| Add config validation to CI | @alice | P1 | JIRA-123 |
| Add post-deploy smoke test | @bob | P1 | JIRA-124 |
| Reduce alert evaluation interval to 30s | @carol | P2 | JIRA-125 |
## Lessons Learned
[Key takeaways for the team]
Drill 6: Error Budget Policy¶
Difficulty: Medium
Q: Design an error budget policy that defines what happens at different budget consumption levels.
Answer
Error Budget Policy for API Service (99.9% SLO, 30-day window)
Budget Remaining Action
───────────────── ──────────────────────────────────────
> 50% Normal operations. Ship features.
Standard deployment cadence.
25-50% Caution mode.
- Require extra review for risky changes
- Prioritize reliability work
- No experimental deployments
10-25% Slow down.
- Feature freeze for this service
- All engineering effort on reliability
- Daily standup on error budget status
- Postmortem on recent incidents required
< 10% Full stop.
- No deployments except reliability fixes
- Incident commander assigned
- Escalate to engineering leadership
- Canary all changes with auto-rollback
0% (exhausted) SLO breach.
- Formal review with stakeholders
- Prioritize reliability over features for next quarter
- Consider SLO adjustment if unrealistic
Drill 7: Four Golden Signals¶
Difficulty: Easy
Q: What are the four golden signals of monitoring? How does each map to a user experience problem?
Answer
| Signal | Measures | User Impact | PromQL Example | |--------|----------|-------------|----------------| | **Latency** | Request duration | "It's slow" | `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))` | | **Traffic** | Request rate | "Is anyone using it?" | `sum(rate(http_requests_total[5m]))` | | **Errors** | Failure rate | "It's broken" | `sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))` | | **Saturation** | Resource utilization | "It's about to break" | `container_memory_working_set_bytes / container_spec_memory_limit_bytes` | Key insight: Separate **successful request latency** from **error request latency**. A fast error is still an error. A slow success still degrades user experience.Drill 8: Incident Severity Levels¶
Difficulty: Easy
Q: Define severity levels for incidents with clear criteria and expected response.
Answer
| Level | Criteria | Response Time | Example | |-------|----------|--------------|---------| | **SEV-1** | Complete outage or data loss risk | 5 min ack, all-hands | API down for all users | | **SEV-2** | Major feature degraded, high error rate | 15 min ack, on-call + backup | 30% of requests failing | | **SEV-3** | Minor feature degraded, workaround exists | 1 hour ack, on-call | One endpoint slow | | **SEV-4** | Cosmetic or minor issue | Next business day | Dashboard graph broken | For each severity, define: 1. **Who gets paged**: SEV-1 = IC + oncall + backup + management. SEV-4 = ticket only. 2. **Communication cadence**: SEV-1 = updates every 15 min. SEV-3 = updates every 2h. 3. **Postmortem requirement**: SEV-1/2 = required. SEV-3/4 = optional.Wiki Navigation¶
Prerequisites¶
- Postmortems & SLOs (Topic Pack, L2)
Related Content¶
- Devops Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs
- Postmortem SLO Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs
- Postmortems & SLOs (Topic Pack, L2) — Postmortems & SLOs
- SRE Practices (Topic Pack, L2) — Postmortems & SLOs
- Skillcheck: Postmortems & SLOs (Assessment, L2) — Postmortems & SLOs