Skip to content

Portal | Level: L2: Operations | Topics: Postmortems & SLOs | Domain: DevOps & Tooling

Incident Postmortem & SLO/SLI Drills

Remember: SLI -> SLO -> SLA, from measurement to promise. SLI = what you measure (request latency, error rate), SLO = what you aim for (99.9% of requests < 300ms), SLA = what you promise with consequences (99.5% uptime or credits). Your SLO should always be stricter than your SLA — the gap is your safety buffer. Mnemonic: "SLI is the Indicator, SLO is the Objective, SLA is the Agreement."

Gotcha: 99.9% vs 99.99% availability sounds like a tiny difference, but it is 10x. 99.9% = 43 minutes of downtime per month. 99.99% = 4.3 minutes per month. Each additional "nine" is an order of magnitude harder and more expensive to achieve.

Drill 1: Define SLI, SLO, SLA

Difficulty: Easy

Q: Define SLI, SLO, and SLA. Give a concrete example for a REST API.

Answer - **SLI** (Service Level Indicator): A measurement of service behavior. "What we measure." - **SLO** (Service Level Objective): A target for the SLI. "What we aim for." - **SLA** (Service Level Agreement): A contract with consequences. "What we promise." Example for `api.example.com`: | Layer | Metric | Value | |-------|--------|-------| | **SLI** | Proportion of requests < 300ms returning non-5xx | Measured per rolling 30d window | | **SLO** | 99.9% availability, p99 latency < 500ms | Internal engineering target | | **SLA** | 99.5% availability | Contract: credits if breached | SLO should be stricter than SLA. If SLO = 99.9% and SLA = 99.5%, you have a buffer before contractual consequences.

Drill 2: Error Budget Calculation

Difficulty: Medium

Q: Your API has a 99.9% availability SLO over a 30-day window. You've had 20 minutes of downtime this month. How much error budget remains?

Answer
Total minutes in 30 days: 30 × 24 × 60 = 43,200 minutes

Error budget (0.1%): 43,200 × 0.001 = 43.2 minutes

Used: 20 minutes
Remaining: 43.2 - 20 = 23.2 minutes
Budget consumed: 20 / 43.2 = 46.3%
At this rate, you're on track to stay within budget — but one more incident of similar length puts you at risk. PromQL for error budget remaining:
# Error budget consumed (0 to 1)
1 - (
  sum(rate(http_requests_total{code!~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) / (1 - 0.999)

Drill 3: Multi-Window Burn Rate Alert

Difficulty: Hard

Q: Why is a simple "error rate > threshold" alert bad for SLO monitoring? Explain multi-window burn rate alerting.

Answer Problems with simple threshold: - `error_rate > 0.1%` fires on brief spikes that don't threaten the SLO - Too many false positives → alert fatigue - Or threshold is too high → miss slow burns **Burn rate** = how fast you're consuming error budget relative to the window. - Burn rate 1 = consuming budget evenly across the period (exactly depleted at window end) - Burn rate 10 = consuming 10x faster → budget gone in 1/10th the window **Multi-window**: Check both a long window (trend) and short window (still happening).
# Page-worthy: 14.4x burn rate sustained for 1h (short) and 5m (still active)
- alert: ErrorBudgetBurn
  expr: |
    (
      http_requests:burnrate5m{job="api"} > (14.4 * (1 - 0.999))
      and
      http_requests:burnrate1h{job="api"} > (14.4 * (1 - 0.999))
    )
  labels:
    severity: critical

# Ticket-worthy: 3x burn rate sustained for 6h (and 30m)
- alert: ErrorBudgetBurnSlow
  expr: |
    (
      http_requests:burnrate30m{job="api"} > (3 * (1 - 0.999))
      and
      http_requests:burnrate6h{job="api"} > (3 * (1 - 0.999))
    )
  labels:
    severity: warning
Google SRE handbook windows: | Burn Rate | Long Window | Short Window | Action | |-----------|-------------|--------------|--------| | 14.4x | 1h | 5m | Page | | 6x | 6h | 30m | Page | | 3x | 3d | 6h | Ticket | | 1x | 30d | 3d | Log |

Drill 4: SLI Implementation

Difficulty: Medium

Q: Write PromQL queries for these SLIs: (a) availability, (b) latency, (c) throughput.

Answer
# (a) Availability: proportion of successful requests
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# (b) Latency: proportion of requests under 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

# (c) Throughput: requests per second
sum(rate(http_requests_total[5m]))

# Combined SLI (availability AND latency)
(
  sum(rate(http_requests_total{code!~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
)
*
(
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
  /
  sum(rate(http_request_duration_seconds_count[5m]))
)

Drill 5: Postmortem Template

Difficulty: Easy

Q: What are the essential sections of a blameless postmortem?

Answer
# Incident Postmortem: [Title]

## Summary
- **Date**: 2024-01-15
- **Duration**: 45 minutes (14:15 - 15:00 UTC)
- **Severity**: SEV-2
- **Impact**: 30% of API requests returned 500 errors

## Timeline
- 14:15 — Deployment of v2.3.1 begins
- 14:18 — Error rate spikes to 30% (PagerDuty alert fires)
- 14:22 — On-call acknowledges, begins investigation
- 14:30 — Root cause identified: missing env var in new config
- 14:35 — Rollback initiated
- 14:40 — Rollback complete, error rate drops
- 15:00 — Confirmed fully recovered

## Root Cause
[Technical description of what went wrong]

## Contributing Factors
- Config change not validated in staging
- No smoke test after deployment
- Alert delay due to evaluation interval

## What Went Well
- Fast detection (3 minutes)
- Clear rollback procedure

## What Went Poorly
- 12 minutes to identify root cause
- No pre-deploy validation caught the issue

## Action Items
| Action | Owner | Priority | Ticket |
|--------|-------|----------|--------|
| Add config validation to CI | @alice | P1 | JIRA-123 |
| Add post-deploy smoke test | @bob | P1 | JIRA-124 |
| Reduce alert evaluation interval to 30s | @carol | P2 | JIRA-125 |

## Lessons Learned
[Key takeaways for the team]
Key principles: - **Blameless**: Focus on systems, not people - **Timeline**: Be specific (UTC timestamps) - **Action items**: Each must have an owner and ticket - **Review**: Share with the broader team

Drill 6: Error Budget Policy

Difficulty: Medium

Q: Design an error budget policy that defines what happens at different budget consumption levels.

Answer
Error Budget Policy for API Service (99.9% SLO, 30-day window)

Budget Remaining    Action
─────────────────   ──────────────────────────────────────
> 50%               Normal operations. Ship features.
                    Standard deployment cadence.

25-50%              Caution mode.
                    - Require extra review for risky changes
                    - Prioritize reliability work
                    - No experimental deployments

10-25%              Slow down.
                    - Feature freeze for this service
                    - All engineering effort on reliability
                    - Daily standup on error budget status
                    - Postmortem on recent incidents required

< 10%               Full stop.
                    - No deployments except reliability fixes
                    - Incident commander assigned
                    - Escalate to engineering leadership
                    - Canary all changes with auto-rollback

0% (exhausted)      SLO breach.
                    - Formal review with stakeholders
                    - Prioritize reliability over features for next quarter
                    - Consider SLO adjustment if unrealistic
This creates a shared understanding between product and engineering about the cost of unreliability.

Drill 7: Four Golden Signals

Difficulty: Easy

Q: What are the four golden signals of monitoring? How does each map to a user experience problem?

Answer | Signal | Measures | User Impact | PromQL Example | |--------|----------|-------------|----------------| | **Latency** | Request duration | "It's slow" | `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))` | | **Traffic** | Request rate | "Is anyone using it?" | `sum(rate(http_requests_total[5m]))` | | **Errors** | Failure rate | "It's broken" | `sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))` | | **Saturation** | Resource utilization | "It's about to break" | `container_memory_working_set_bytes / container_spec_memory_limit_bytes` | Key insight: Separate **successful request latency** from **error request latency**. A fast error is still an error. A slow success still degrades user experience.

Drill 8: Incident Severity Levels

Difficulty: Easy

Q: Define severity levels for incidents with clear criteria and expected response.

Answer | Level | Criteria | Response Time | Example | |-------|----------|--------------|---------| | **SEV-1** | Complete outage or data loss risk | 5 min ack, all-hands | API down for all users | | **SEV-2** | Major feature degraded, high error rate | 15 min ack, on-call + backup | 30% of requests failing | | **SEV-3** | Minor feature degraded, workaround exists | 1 hour ack, on-call | One endpoint slow | | **SEV-4** | Cosmetic or minor issue | Next business day | Dashboard graph broken | For each severity, define: 1. **Who gets paged**: SEV-1 = IC + oncall + backup + management. SEV-4 = ticket only. 2. **Communication cadence**: SEV-1 = updates every 15 min. SEV-3 = updates every 2h. 3. **Postmortem requirement**: SEV-1/2 = required. SEV-3/4 = optional.

Wiki Navigation

Prerequisites

  • Devops Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs
  • Postmortem SLO Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs
  • Postmortems & SLOs (Topic Pack, L2) — Postmortems & SLOs
  • SRE Practices (Topic Pack, L2) — Postmortems & SLOs
  • Skillcheck: Postmortems & SLOs (Assessment, L2) — Postmortems & SLOs