Skip to content

Postmortem & SLO Cheat Sheet

Name origin: SLO comes from Google's Site Reliability Engineering (SRE) book (2016). The SLI/SLO/SLA hierarchy was formalized there, but the concepts trace back to telecom's "five nines" (99.999%) availability targets from the 1990s. The term "error budget" was coined by Google SRE Ben Treynor Sloss to reframe reliability as a finite, spendable resource rather than an absolute goal.

SLI / SLO / SLA

SLI (indicator)  →  What you measure
SLO (objective)  →  What you target internally
SLA (agreement)  →  What you promise contractually

SLO should be stricter than SLA (buffer zone)
Type Example
Availability SLI % of requests returning non-5xx
Latency SLI % of requests completing in < 300ms
Availability SLO 99.9% over 30-day rolling window
Availability SLA 99.5% (credits if breached)

Remember: The hierarchy flows downward in strictness: SLI (raw measurement) feeds into SLO (internal target) which must be stricter than SLA (contractual promise). Mnemonic: "I-O-A" — Indicator measures, Objective targets, Agreement promises. If your SLO equals your SLA, you have zero buffer for unexpected incidents.

Error Budget

Error budget = 1 - SLO target

For 99.9% SLO over 30 days:
  Budget = 0.1% of 43,200 minutes = 43.2 minutes of downtime allowed

Budget consumed = actual_downtime / total_budget

Error Budget Policy

Budget Remaining Action
> 50% Normal ops. Ship features.
25-50% Extra review for risky changes
10-25% Feature freeze. Reliability only.
< 10% No deploys except reliability fixes
0% (breached) Formal review. Reliability > features next quarter.

Gotcha: Error budget math assumes uniform traffic. A 99.9% SLO over 30 days gives 43.2 minutes of downtime, but if 80% of your traffic happens during business hours, a 20-minute outage at peak costs far more error budget than 20 minutes at 3 AM. Consider weighting your SLI by traffic volume, not wall-clock time.

Burn Rate Alerting

Burn rate = how fast you consume error budget
  1x = budget lasts exactly the full window
  10x = budget exhausted in 1/10th the window
Burn Rate Long Window Short Window Action
14.4x 1h 5m Page (critical)
6x 6h 30m Page (high)
3x 3d 6h Ticket
1x 30d 3d Log only

Four Golden Signals

Signal What Alert When
Latency Request duration p99 > threshold
Traffic Request rate Anomalous drop/spike
Errors Error rate > error budget burn rate
Saturation Resource usage > 80% capacity

PromQL for SLOs

# Availability (success ratio)
sum(rate(http_requests_total{code!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))

# Latency (% under threshold)
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/ sum(rate(http_request_duration_seconds_count[30d]))

# Error budget remaining (0 to 1)
1 - (1 - availability) / (1 - 0.999)

Blameless Postmortem Template

# Incident: [Title]

## Summary
Date | Duration | Severity | Impact

## Timeline (UTC)
- HH:MM — Event
- HH:MM — Detection
- HH:MM — Response
- HH:MM — Mitigation
- HH:MM — Resolution

## Root Cause
[Technical: what broke and why]

## Contributing Factors
[Process/system gaps that allowed it]

## What Went Well
[Detection speed, teamwork, etc.]

## What Went Poorly
[Gaps in monitoring, slow response, etc.]

## Action Items
| Action | Owner | Priority | Ticket |
|--------|-------|----------|--------|

## Lessons Learned

Incident Severity

SEV Criteria Response Postmortem
1 Full outage / data loss 5 min, all-hands Required
2 Major degradation 15 min, oncall + backup Required
3 Minor feature down 1 hour, oncall Optional
4 Cosmetic / low impact Next business day No

Under the hood: "Blameless" does not mean "accountabilityless." A blameless postmortem focuses on systemic causes (why did the system allow this failure?) rather than individual blame (who caused this?). The goal is to make it psychologically safe to report honestly, which leads to better root-cause analysis. If people fear punishment, they hide information, and the real causes go unfixed.

Postmortem Anti-Patterns

  • Blaming individuals instead of systems
  • No action items (or action items with no owners)
  • "Be more careful" as an action item
  • Not sharing with the broader team
  • Not following up on action items
  • Writing the postmortem but never reviewing it