Skip to content

GrokDevOps Wiki

Postmortem Slo Cheatsheet

grokdatum/grokdevops

Postmortem & SLO Cheat Sheet¶

Name origin: SLO comes from Google's Site Reliability Engineering (SRE) book (2016). The SLI/SLO/SLA hierarchy was formalized there, but the concepts trace back to telecom's "five nines" (99.999%) availability targets from the 1990s. The term "error budget" was coined by Google SRE Ben Treynor Sloss to reframe reliability as a finite, spendable resource rather than an absolute goal.

SLI / SLO / SLA¶

SLI (indicator)  →  What you measure
SLO (objective)  →  What you target internally
SLA (agreement)  →  What you promise contractually

SLO should be stricter than SLA (buffer zone)

Type	Example
Availability SLI	% of requests returning non-5xx
Latency SLI	% of requests completing in < 300ms
Availability SLO	99.9% over 30-day rolling window
Availability SLA	99.5% (credits if breached)

Remember: The hierarchy flows downward in strictness: SLI (raw measurement) feeds into SLO (internal target) which must be stricter than SLA (contractual promise). Mnemonic: "I-O-A" — Indicator measures, Objective targets, Agreement promises. If your SLO equals your SLA, you have zero buffer for unexpected incidents.

Error Budget¶

Error budget = 1 - SLO target

For 99.9% SLO over 30 days:
  Budget = 0.1% of 43,200 minutes = 43.2 minutes of downtime allowed

Budget consumed = actual_downtime / total_budget

Error Budget Policy¶

Budget Remaining	Action
> 50%	Normal ops. Ship features.
25-50%	Extra review for risky changes
10-25%	Feature freeze. Reliability only.
< 10%	No deploys except reliability fixes
0% (breached)	Formal review. Reliability > features next quarter.

Gotcha: Error budget math assumes uniform traffic. A 99.9% SLO over 30 days gives 43.2 minutes of downtime, but if 80% of your traffic happens during business hours, a 20-minute outage at peak costs far more error budget than 20 minutes at 3 AM. Consider weighting your SLI by traffic volume, not wall-clock time.

Burn Rate Alerting¶

Burn rate = how fast you consume error budget
  1x = budget lasts exactly the full window
  10x = budget exhausted in 1/10th the window

Burn Rate	Long Window	Short Window	Action
14.4x	1h	5m	Page (critical)
6x	6h	30m	Page (high)
3x	3d	6h	Ticket
1x	30d	3d	Log only

Four Golden Signals¶

Signal	What	Alert When
Latency	Request duration	p99 > threshold
Traffic	Request rate	Anomalous drop/spike
Errors	Error rate	> error budget burn rate
Saturation	Resource usage	> 80% capacity

PromQL for SLOs¶

# Availability (success ratio)
sum(rate(http_requests_total{code!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))

# Latency (% under threshold)
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/ sum(rate(http_request_duration_seconds_count[30d]))

# Error budget remaining (0 to 1)
1 - (1 - availability) / (1 - 0.999)

Blameless Postmortem Template¶

# Incident: [Title]

## Summary
Date | Duration | Severity | Impact

## Timeline (UTC)
- HH:MM — Event
- HH:MM — Detection
- HH:MM — Response
- HH:MM — Mitigation
- HH:MM — Resolution

## Root Cause
[Technical: what broke and why]

## Contributing Factors
[Process/system gaps that allowed it]

## What Went Well
[Detection speed, teamwork, etc.]

## What Went Poorly
[Gaps in monitoring, slow response, etc.]

## Action Items
| Action | Owner | Priority | Ticket |
|--------|-------|----------|--------|

## Lessons Learned

Incident Severity¶

SEV	Criteria	Response	Postmortem
1	Full outage / data loss	5 min, all-hands	Required
2	Major degradation	15 min, oncall + backup	Required
3	Minor feature down	1 hour, oncall	Optional
4	Cosmetic / low impact	Next business day	No

Under the hood: "Blameless" does not mean "accountabilityless." A blameless postmortem focuses on systemic causes (why did the system allow this failure?) rather than individual blame (who caused this?). The goal is to make it psychologically safe to report honestly, which leads to better root-cause analysis. If people fear punishment, they hide information, and the real causes go unfixed.

Postmortem Anti-Patterns¶

Blaming individuals instead of systems
No action items (or action items with no owners)
"Be more careful" as an action item
Not sharing with the broader team
Not following up on action items
Writing the postmortem but never reviewing it

Pages that link here¶