Postmortem & SLO Footguns¶

Mistakes that make your SLOs useless, your postmortems performative, or your reliability culture toxic.

1. SLO set at 100%¶

You set your availability SLO at 99.99% because "we should aim high." Your error budget for the month is 4.3 minutes. A single deploy takes longer than that. You can't deploy, can't do maintenance, can't experiment. The SLO is violated every month and nobody takes it seriously.

Fix: Start with a realistic SLO (99.5% or 99.9%). Tighten it as your system matures. An SLO that's always violated is worse than no SLO.

Remember: 99.9% = 43 minutes/month of allowed downtime. 99.99% = 4.3 minutes/month. 99.95% = 21.6 minutes/month. Pick the number where violating it would actually trigger a meaningful response, not the number that sounds impressive.

2. SLI that doesn't reflect user experience¶

Your SLI is up{job="api"} == 1 — is the pod running? The pod is running, but it's returning 500 errors to every request. Your SLI says 100% availability while users can't use the product.

Fix: Measure at the user boundary: successful request ratio, latency at the load balancer, end-to-end transaction success. The SLI should answer "can users do what they need to do?"

3. Error budget consumed by non-incidents¶

Your SLO is 99.9%. A dependency has a 5-minute blip. Another dependency has a 10-minute blip. Planned maintenance takes 15 minutes. You've consumed your 43-minute monthly error budget on things that aren't your fault and aren't incidents.

Fix: Separate internal vs external error budget impact. Exclude planned maintenance from SLI calculations (or budget for it explicitly). Attribute errors to the responsible team/dependency.

4. Postmortem blame disguised as "action items"¶

Your postmortem says "Action: Developer X should review changes more carefully." This is blame dressed as an action item. Developer X feels attacked. Nobody reports near-misses anymore. Your postmortem culture is dead.

Fix: Focus on systemic failures, not individual behavior. Ask "what about the system allowed this to happen?" Good action item: "Add integration test for config validation in CI." Bad: "Developer should be more careful."

Under the hood: Etsy's "Just Culture" framework categorizes errors as human error (system failed the person), at-risk behavior (system incentivized shortcuts), or reckless behavior (rare, requires pattern). Almost all incidents are category 1 — the system made it easy to fail.

5. Action items that never get done¶

Every postmortem produces 5 action items. They go into a Jira backlog. They get deprioritized against features. The same failure mode causes another incident 3 months later. The postmortem cites the same unfixed action items.

Fix: Assign owners and deadlines to every action item. Track completion rate. Limit action items to 3 high-impact changes. Review open items weekly in team standup.

6. Measuring availability from the server side only¶

Your server reports 99.95% availability. But 2% of requests time out at the client before the server even logs them. Mobile users on slow networks experience a completely different reliability than your metrics show.

Fix: Measure from the client perspective too — synthetic monitoring, real user monitoring (RUM), or edge/CDN metrics. Server-side metrics are necessary but not sufficient.

Gotcha: TCP retransmits on the server side are invisible to HTTP metrics. A request that took 200ms server-side may have taken 3 seconds client-side due to packet loss and retransmits. Only client-side or network-layer metrics capture this.

7. Burn rate alerts with no context¶

You set up multi-window burn rate alerts per the SRE book. The 5% error budget alert fires. On-call checks — what exactly is failing? The alert just says "error budget burning." No service name, no affected endpoint, no starting timestamp.

Fix: Include context in alert annotations: which service, which SLI, current burn rate, link to relevant dashboard, estimated time until budget exhaustion.

8. SLOs without consequences¶

You define SLOs but nothing happens when they're violated. Teams consistently miss their SLO. Leadership asks "why are we tracking this?" Nobody changes behavior because the SLO has no teeth.

Fix: Define error budget policies: when budget is exhausted, freeze feature releases and focus on reliability. When budget is healthy, deploy faster and experiment. The SLO should drive actual decisions.

9. Postmortem with no timeline¶

Your postmortem says "something broke and we fixed it." It doesn't say when the issue started, when it was detected, when the root cause was found, or when the fix was deployed. You can't measure time-to-detect or time-to-recover.

Fix: Every postmortem needs a detailed timeline: incident start, detection, escalation, diagnosis, mitigation, resolution, communication milestones. Measure MTTD and MTTR.

10. Conflating SLAs and SLOs¶

You set your SLO equal to your SLA (99.9%). You have zero margin. A single missed SLO means an SLA breach with financial penalties. Teams game the metrics instead of improving reliability.

Fix: SLO should be stricter than SLA. If your SLA is 99.9%, set your SLO at 99.95%. The gap gives you a buffer to detect and fix problems before they become contractual violations.

Remember: SLA = external contract with financial penalties. SLO = internal target that triggers engineering action. SLI = the actual measurement. SLA without SLO means you only learn about reliability problems when the customer invoice arrives.