Skip to content

Anti-Primer: Capacity Planning

Everything that can go wrong, will — and in this story, it does.

The Setup

An SRE team is implementing Capacity Planning practices for a production system with a 99.95% SLA. The system has been running on heroics and tribal knowledge. The team has 4 weeks to formalize processes before the next quarterly review.

The Timeline

Hour 0: SLOs Without Measurement

Defines SLOs on a whiteboard but does not instrument the system to measure them. The deadline was looming, and this seemed like the fastest path forward. But the result is sLOs are aspirational fiction; nobody knows if they are met or breached until customers complain.

Footgun #1: SLOs Without Measurement — defines SLOs on a whiteboard but does not instrument the system to measure them, leading to sLOs are aspirational fiction; nobody knows if they are met or breached until customers complain.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Incident Response Without Roles

Everyone jumps on an incident simultaneously with no assigned roles. Under time pressure, the team chose speed over caution. But the result is duplicated effort, conflicting actions, and nobody communicating status to stakeholders.

Footgun #2: Incident Response Without Roles — everyone jumps on an incident simultaneously with no assigned roles, leading to duplicated effort, conflicting actions, and nobody communicating status to stakeholders.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Toil Accepted as Normal

Manual operational tasks are performed weekly without tracking or reduction plans. Nobody pushed back because the shortcut looked harmless in the moment. But the result is toil grows as the system scales; the team spends 80% of time on manual work and 20% on improvements.

Footgun #3: Toil Accepted as Normal — manual operational tasks are performed weekly without tracking or reduction plans, leading to toil grows as the system scales; the team spends 80% of time on manual work and 20% on improvements.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Postmortems Without Action Items

Writes detailed blameless postmortems but never follows through on the action items. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is the same incident recurs 3 months later; team loses faith in the postmortem process.

Footgun #4: Postmortems Without Action Items — writes detailed blameless postmortems but never follows through on the action items, leading to the same incident recurs 3 months later; team loses faith in the postmortem process.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 SLOs Without Measurement SLOs are aspirational fiction; nobody knows if they are met or breached until customers complain Primer: Instrument SLI measurements before committing to SLO targets
2 Incident Response Without Roles Duplicated effort, conflicting actions, and nobody communicating status to stakeholders Primer: Define incident commander, communication lead, and technical lead roles before incidents happen
3 Toil Accepted as Normal Toil grows as the system scales; the team spends 80% of time on manual work and 20% on improvements Primer: Measure toil; set a budget (e.g., max 50%); automate the most time-consuming tasks first
4 Postmortems Without Action Items The same incident recurs 3 months later; team loses faith in the postmortem process Primer: Track action items in the issue tracker; review completion in weekly meetings

Damage Report

  • Downtime: 1-4 hours of uncoordinated incident response
  • Data loss: Possible if remediation actions conflict or are applied incorrectly
  • Customer impact: Extended customer-facing impact due to slow or chaotic response
  • Engineering time to remediate: 12-24 engineer-hours across response, remediation, and postmortem
  • Reputation cost: Stakeholder confidence in SRE practices shaken; process overhaul demanded

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on slos without measurement, they would have learned: Instrument SLI measurements before committing to SLO targets.
  • Footgun #2: If the engineer had read the primer, section on incident response without roles, they would have learned: Define incident commander, communication lead, and technical lead roles before incidents happen.
  • Footgun #3: If the engineer had read the primer, section on toil accepted as normal, they would have learned: Measure toil; set a budget (e.g., max 50%); automate the most time-consuming tasks first.
  • Footgun #4: If the engineer had read the primer, section on postmortems without action items, they would have learned: Track action items in the issue tracker; review completion in weekly meetings.

Cross-References