Skip to content

Portal | Level: L2: Operations | Topics: Postmortems & SLOs, Incident Response | Domain: DevOps & Tooling

Incident Postmortem Writing & SLO/SLI - Primer

Why This Matters

You can be the best debugger in the world, but if you can't write a clear postmortem, your organization will repeat the same incidents. And without SLOs, you have no objective way to decide when to invest in reliability vs features. This is the discipline that separates firefighting from engineering.

SLI / SLO / SLA

Fun fact: The concept of SLOs and error budgets was popularized by Google's SRE book (2016), but the underlying idea traces back to statistical process control in manufacturing (1920s, Walter Shewhart at Bell Labs). Google's innovation was applying the same "acceptable defect rate" thinking to software reliability — and making the error budget a currency that engineering teams "spend" to ship features.

Definitions

Term What it is Who owns it Example
SLI (Service Level Indicator) A metric that measures service quality Engineering 99.2% of requests succeed
SLO (Service Level Objective) A target for an SLI Engineering + Product 99.9% of requests should succeed
SLA (Service Level Agreement) A contract with consequences Business + Legal 99.5% uptime or we refund

Rule: SLA < SLO < theoretical max. Your SLO should be stricter than your SLA so you have a buffer.

Choosing SLIs

Good SLIs measure what users experience, not what infrastructure does.

Type Good SLI Bad SLI
Availability % of successful HTTP requests CPU utilization
Latency p99 response time Average response time
Throughput Successful requests per second Network bandwidth
Correctness % of responses with correct data Test pass rate

Defining SLOs

SLO: 99.9% of HTTP requests return non-5xx responses over a 30-day rolling window.

What this means:
- 30-day budget: 0.1% errors = 43.2 minutes of total downtime equivalent
- If 1000 req/min: 1 failed request per minute is fine
- If 1440 req/day of errors: budget consumed in 30 days

Error Budgets

Error Budget = 1 - SLO

For 99.9% SLO:
  Error budget = 0.1%
  In 30 days = 43.2 minutes
  In a quarter = 129.6 minutes

Budget remaining = Error budget - actual errors

If budget > 0: Ship features
If budget < 0: Focus on reliability

Error Budget Policy

An agreed-upon document that says what happens when the error budget is exhausted:

Error Budget Policy for grokdevops:

1. When >50% of monthly budget consumed:
   - Review recent changes for reliability impact
   - Enable canary deployments for all changes

2. When 100% of budget consumed:
   - Feature freeze until budget recovers
   - All engineering effort on reliability
   - Mandatory postmortem for budget-depleting incidents

3. When budget is positive:
   - Teams can ship features at normal pace
   - Risk tolerance for experiments is higher

Incident Postmortem

What Is a Postmortem?

A blameless written record of an incident: what happened, why, and how to prevent recurrence. The goal is organizational learning, not punishment.

Who made it: The blameless postmortem practice was popularized by John Allspaw and Paul Hammond in their landmark 2009 Velocity Conference talk "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr." The practice was further formalized in Google's Site Reliability Engineering book (2016, edited by Betsy Beyer et al.), which made SLOs, error budgets, and blameless postmortems standard industry vocabulary.

Blameless Culture

Blameless does NOT mean accountable-less.

Blameless Blame-full
"The deploy pipeline lacked a canary step" "John pushed bad code"
"The runbook was outdated" "The on-call should have known"
"The alert didn't fire because..." "Nobody was watching the dashboard"

Focus on systems and processes, not individuals.

Remember: The "nines" availability cheat sheet and their downtime budgets per month: 99% = 7.3 hours, 99.9% = 43.8 minutes, 99.95% = 21.9 minutes, 99.99% = 4.4 minutes, 99.999% = 26 seconds. Mnemonic: "Each nine costs 10x more" — going from 99.9% to 99.99% does not double the cost, it typically increases operational investment by an order of magnitude.

Postmortem Template

# Incident Postmortem: [Title]

**Date**: YYYY-MM-DD
**Duration**: Start time - End time (X hours Y minutes)
**Severity**: SEV-1 / SEV-2 / SEV-3
**Author**: [Name]
**Reviewers**: [Names]

## Summary
One paragraph describing what happened and the customer impact.

## Impact
- **Duration**: How long were customers affected?
- **Affected users**: Number or percentage
- **Revenue impact**: If applicable
- **SLO impact**: Error budget consumed

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:00 | Deploy v2.3.1 pushed to production |
| 14:05 | Error rate alert fires |
| 14:08 | On-call acknowledges page |
| 14:15 | Root cause identified: database migration timeout |
| 14:22 | Rollback initiated |
| 14:25 | Service recovered |

## Root Cause
What specifically caused the incident. Be precise.
"The database migration in v2.3.1 added an index on the 50M-row users table.
The migration timed out after 5 minutes, leaving the table in a locked state.
All queries to the users table blocked, causing 503 errors."

## Contributing Factors
- Migration was not tested against a production-sized dataset
- No timeout protection on migration Jobs
- Health check did not verify database connectivity

## What Went Well
- Alert fired within 5 minutes
- Rollback was quick (3 minutes)
- On-call engineer had access to all needed tools

## What Went Wrong
- Migration was not flagged as high-risk in code review
- Staging database had only 1000 rows (vs 50M in prod)
- Rollback required manual intervention

## Action Items
| # | Action | Owner | Priority | Due Date |
|---|--------|-------|----------|----------|
| 1 | Add migration dry-run step to CI pipeline | @alice | P1 | 2024-02-01 |
| 2 | Populate staging DB with production-scale data | @bob | P1 | 2024-02-01 |
| 3 | Add database connectivity to health check | @charlie | P2 | 2024-02-15 |
| 4 | Set activeDeadlineSeconds on migration Jobs | @alice | P2 | 2024-02-15 |
| 5 | Add postmortem writing to incident response training | @dave | P3 | 2024-03-01 |

## Lessons Learned
- Migrations on large tables need explicit testing at production scale
- Health checks should verify all critical dependencies

Gotcha: The most common postmortem failure mode: excellent analysis, action items created, then nobody tracks them. Within 6 weeks the action items are forgotten and the same incident recurs. The fix is simple: action items go into your normal issue tracker (Jira, Linear, GitHub Issues) with owners and due dates — not in a Google Doc that nobody re-reads. Track completion rate as a metric.

Postmortem Quality Checklist

  • Timeline has precise timestamps (UTC)
  • Root cause is specific (not "human error")
  • Contributing factors explain why the root cause wasn't caught
  • Action items have owners and due dates
  • Action items are specific (not "improve monitoring")
  • Postmortem was reviewed by at least 2 people
  • Postmortem was shared with the broader team

Incident Severity Levels

Level Meaning Response Postmortem?
SEV-1 Major customer impact, revenue loss All hands, page leadership Required
SEV-2 Significant impact, degraded service Page on-call team Required
SEV-3 Minor impact, workaround available Next business day Optional
SEV-4 No customer impact, internal only Best effort No

Measuring Reliability

Error Budget Dashboard

# Current error rate (SLI)
1 - (
  sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[30d]))
  / sum(rate(http_requests_total{job="grokdevops"}[30d]))
)

# Error budget remaining
(
  1 - (
    sum(increase(http_requests_total{job="grokdevops",status=~"5.."}[30d]))
    / sum(increase(http_requests_total{job="grokdevops"}[30d]))
  )
) - 0.999  # subtract SLO target

# Error budget burn rate (should be < 1.0)
(
  sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[1h]))
  / sum(rate(http_requests_total{job="grokdevops"}[1h]))
) / 0.001  # divide by budget (1 - SLO)

Analogy: Error budgets work like a financial budget. You start the month with a "reliability allowance" (e.g., 43 minutes for a 99.9% SLO). Every incident "spends" some of that budget. When the budget is exhausted, you "freeze spending" — no risky deploys until reliability recovers. This framing makes SLOs tangible for product managers who otherwise struggle with abstract reliability targets.

Common Pitfalls

  1. Postmortem assigned to the person who caused the incident — Use a neutral facilitator.
  2. Action items never completed — Track them in your issue tracker with deadlines.
  3. Too many SLOs — Start with 1-2 SLOs per service (availability + latency).
  4. SLO too tight — 99.999% leaves 26 seconds/month. No human can respond in time.
  5. No error budget policy — Without consequences, SLOs are just numbers.
  6. Vague action items — "Improve monitoring" is not actionable. "Add latency alert at p99 > 500ms" is.

Wiki Navigation

Prerequisites

Next Steps

  • Change Management (Topic Pack, L1) — Incident Response
  • Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
  • Debugging Methodology (Topic Pack, L1) — Incident Response
  • Devops Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs
  • Incident Command & On-Call (Topic Pack, L2) — Incident Response
  • Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
  • Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
  • Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
  • Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
  • Postmortem & SLO Drills (Drill, L2) — Postmortems & SLOs