Skip to content

GrokDevOps Wiki

Primer

grokdatum/grokdevops

Portal | Level: L2: Operations | Topics: Postmortems & SLOs, Incident Response | Domain: DevOps & Tooling

Incident Postmortem Writing & SLO/SLI - Primer¶

Why This Matters¶

You can be the best debugger in the world, but if you can't write a clear postmortem, your organization will repeat the same incidents. And without SLOs, you have no objective way to decide when to invest in reliability vs features. This is the discipline that separates firefighting from engineering.

SLI / SLO / SLA¶

Fun fact: The concept of SLOs and error budgets was popularized by Google's SRE book (2016), but the underlying idea traces back to statistical process control in manufacturing (1920s, Walter Shewhart at Bell Labs). Google's innovation was applying the same "acceptable defect rate" thinking to software reliability — and making the error budget a currency that engineering teams "spend" to ship features.

Definitions¶

Term	What it is	Who owns it	Example
SLI (Service Level Indicator)	A metric that measures service quality	Engineering	99.2% of requests succeed
SLO (Service Level Objective)	A target for an SLI	Engineering + Product	99.9% of requests should succeed
SLA (Service Level Agreement)	A contract with consequences	Business + Legal	99.5% uptime or we refund

Rule: SLA < SLO < theoretical max. Your SLO should be stricter than your SLA so you have a buffer.

Choosing SLIs¶

Good SLIs measure what users experience, not what infrastructure does.

Type	Good SLI	Bad SLI
Availability	% of successful HTTP requests	CPU utilization
Latency	p99 response time	Average response time
Throughput	Successful requests per second	Network bandwidth
Correctness	% of responses with correct data	Test pass rate

Defining SLOs¶

SLO: 99.9% of HTTP requests return non-5xx responses over a 30-day rolling window.

What this means:
- 30-day budget: 0.1% errors = 43.2 minutes of total downtime equivalent
- If 1000 req/min: 1 failed request per minute is fine
- If 1440 req/day of errors: budget consumed in 30 days

Error Budgets¶

Error Budget = 1 - SLO

For 99.9% SLO:
  Error budget = 0.1%
  In 30 days = 43.2 minutes
  In a quarter = 129.6 minutes

Budget remaining = Error budget - actual errors

If budget > 0: Ship features
If budget < 0: Focus on reliability

Error Budget Policy¶

An agreed-upon document that says what happens when the error budget is exhausted:

Error Budget Policy for grokdevops:

1. When >50% of monthly budget consumed:
   - Review recent changes for reliability impact
   - Enable canary deployments for all changes

2. When 100% of budget consumed:
   - Feature freeze until budget recovers
   - All engineering effort on reliability
   - Mandatory postmortem for budget-depleting incidents

3. When budget is positive:
   - Teams can ship features at normal pace
   - Risk tolerance for experiments is higher

Incident Postmortem¶

What Is a Postmortem?¶

A blameless written record of an incident: what happened, why, and how to prevent recurrence. The goal is organizational learning, not punishment.

Who made it: The blameless postmortem practice was popularized by John Allspaw and Paul Hammond in their landmark 2009 Velocity Conference talk "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr." The practice was further formalized in Google's Site Reliability Engineering book (2016, edited by Betsy Beyer et al.), which made SLOs, error budgets, and blameless postmortems standard industry vocabulary.

Blameless Culture¶

Blameless does NOT mean accountable-less.

Blameless	Blame-full
"The deploy pipeline lacked a canary step"	"John pushed bad code"
"The runbook was outdated"	"The on-call should have known"
"The alert didn't fire because..."	"Nobody was watching the dashboard"

Focus on systems and processes, not individuals.

Remember: The "nines" availability cheat sheet and their downtime budgets per month: 99% = 7.3 hours, 99.9% = 43.8 minutes, 99.95% = 21.9 minutes, 99.99% = 4.4 minutes, 99.999% = 26 seconds. Mnemonic: "Each nine costs 10x more" — going from 99.9% to 99.99% does not double the cost, it typically increases operational investment by an order of magnitude.

Postmortem Template¶

# Incident Postmortem: [Title]

**Date**: YYYY-MM-DD
**Duration**: Start time - End time (X hours Y minutes)
**Severity**: SEV-1 / SEV-2 / SEV-3
**Author**: [Name]
**Reviewers**: [Names]

## Summary
One paragraph describing what happened and the customer impact.

## Impact
- **Duration**: How long were customers affected?
- **Affected users**: Number or percentage
- **Revenue impact**: If applicable
- **SLO impact**: Error budget consumed

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:00 | Deploy v2.3.1 pushed to production |
| 14:05 | Error rate alert fires |
| 14:08 | On-call acknowledges page |
| 14:15 | Root cause identified: database migration timeout |
| 14:22 | Rollback initiated |
| 14:25 | Service recovered |

## Root Cause
What specifically caused the incident. Be precise.
"The database migration in v2.3.1 added an index on the 50M-row users table.
The migration timed out after 5 minutes, leaving the table in a locked state.
All queries to the users table blocked, causing 503 errors."

## Contributing Factors
- Migration was not tested against a production-sized dataset
- No timeout protection on migration Jobs
- Health check did not verify database connectivity

## What Went Well
- Alert fired within 5 minutes
- Rollback was quick (3 minutes)
- On-call engineer had access to all needed tools

## What Went Wrong
- Migration was not flagged as high-risk in code review
- Staging database had only 1000 rows (vs 50M in prod)
- Rollback required manual intervention

## Action Items
| # | Action | Owner | Priority | Due Date |
|---|--------|-------|----------|----------|
| 1 | Add migration dry-run step to CI pipeline | @alice | P1 | 2024-02-01 |
| 2 | Populate staging DB with production-scale data | @bob | P1 | 2024-02-01 |
| 3 | Add database connectivity to health check | @charlie | P2 | 2024-02-15 |
| 4 | Set activeDeadlineSeconds on migration Jobs | @alice | P2 | 2024-02-15 |
| 5 | Add postmortem writing to incident response training | @dave | P3 | 2024-03-01 |

## Lessons Learned
- Migrations on large tables need explicit testing at production scale
- Health checks should verify all critical dependencies

Gotcha: The most common postmortem failure mode: excellent analysis, action items created, then nobody tracks them. Within 6 weeks the action items are forgotten and the same incident recurs. The fix is simple: action items go into your normal issue tracker (Jira, Linear, GitHub Issues) with owners and due dates — not in a Google Doc that nobody re-reads. Track completion rate as a metric.

Postmortem Quality Checklist¶

Timeline has precise timestamps (UTC)
Root cause is specific (not "human error")
Contributing factors explain why the root cause wasn't caught
Action items have owners and due dates
Action items are specific (not "improve monitoring")
Postmortem was reviewed by at least 2 people
Postmortem was shared with the broader team

Incident Severity Levels¶

Level	Meaning	Response	Postmortem?
SEV-1	Major customer impact, revenue loss	All hands, page leadership	Required
SEV-2	Significant impact, degraded service	Page on-call team	Required
SEV-3	Minor impact, workaround available	Next business day	Optional
SEV-4	No customer impact, internal only	Best effort	No

Measuring Reliability¶

Error Budget Dashboard¶

# Current error rate (SLI)
1 - (
  sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[30d]))
  / sum(rate(http_requests_total{job="grokdevops"}[30d]))
)

# Error budget remaining
(
  1 - (
    sum(increase(http_requests_total{job="grokdevops",status=~"5.."}[30d]))
    / sum(increase(http_requests_total{job="grokdevops"}[30d]))
  )
) - 0.999  # subtract SLO target

# Error budget burn rate (should be < 1.0)
(
  sum(rate(http_requests_total{job="grokdevops",status=~"5.."}[1h]))
  / sum(rate(http_requests_total{job="grokdevops"}[1h]))
) / 0.001  # divide by budget (1 - SLO)

Analogy: Error budgets work like a financial budget. You start the month with a "reliability allowance" (e.g., 43 minutes for a 99.9% SLO). Every incident "spends" some of that budget. When the budget is exhausted, you "freeze spending" — no risky deploys until reliability recovers. This framing makes SLOs tangible for product managers who otherwise struggle with abstract reliability targets.

Common Pitfalls¶

Postmortem assigned to the person who caused the incident — Use a neutral facilitator.
Action items never completed — Track them in your issue tracker with deadlines.
Too many SLOs — Start with 1-2 SLOs per service (availability + latency).
SLO too tight — 99.999% leaves 26 seconds/month. No human can respond in time.
No error budget policy — Without consequences, SLOs are just numbers.
Vague action items — "Improve monitoring" is not actionable. "Add latency alert at p99 > 500ms" is.

Prerequisites¶

Observability Deep Dive (Topic Pack, L2)

Next Steps¶

Incident Command & On-Call (Topic Pack, L2)
Postmortem & SLO Drills (Drill, L2)
SLO Tooling (Topic Pack, L2)
SRE Practices (Topic Pack, L2)
Skillcheck: Postmortems & SLOs (Assessment, L2)

Change Management (Topic Pack, L1) — Incident Response
Chaos Engineering Scripts (CLI) (Exercise Set, L2) — Incident Response
Debugging Methodology (Topic Pack, L1) — Incident Response
Devops Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs
Incident Command & On-Call (Topic Pack, L2) — Incident Response
Incident Response Flashcards (CLI) (flashcard_deck, L1) — Incident Response
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Incident Response
Investigation Engine (CLI) (Exercise Set, L2) — Incident Response
Ops War Stories & Pattern Recognition (Topic Pack, L2) — Incident Response
Postmortem & SLO Drills (Drill, L2) — Postmortems & SLOs

Pages that link here¶