SRE Practices Footguns¶
Mistakes that turn your SRE practice into rebranded ops, your error budgets into fiction, and your reliability culture into theater.
1. Adopting SRE titles without SRE practices¶
You rename your ops team to "SRE" but nothing changes. They still do manual deploys, fight fires, and have no time for engineering. Leadership checks the "we do SRE" box. The team burns out at the same rate, now with fancier titles.
Fix: SRE is a practice, not a title. The minimum viable SRE adoption: define SLOs, implement error budgets, and allocate at least 50% of SRE time to engineering work. If you can't commit to that, keep the honest title.
Remember: Google's SRE book defines the 50% rule: SREs should spend at most 50% of their time on operational work (toil). If toil exceeds 50%, the team redirects work to the development team until the balance is restored. Without this enforcement mechanism, "SRE" is just "ops with a raise."
2. Setting SLOs without involving product¶
Engineering picks SLO targets in isolation. Product doesn't know they exist. When the error budget burns and engineering declares a feature freeze, product is blindsided. The SLO gets overridden by executive fiat. Trust in the SRE process dies.
Fix: SLOs are a contract between engineering and product. Define them together. Both sides must agree on the target and the error budget policy before the first incident. The SLO negotiation is the most important SRE conversation you'll have.
Gotcha: "99.9% availability" sounds like a reasonable SLO, but do you measure it per-request or per-minute? Per-request: 1 in 1000 requests can fail. Per-minute: you can be down for 43 minutes/month. These are radically different targets. A 30-second full outage burns your entire per-minute budget for a day but barely registers in per-request metrics. Define the measurement window and method explicitly.
3. Toil tracking without authority to stop it¶
Engineers dutifully log 65% toil week after week. The data goes into a report. Leadership says "we'll address it next quarter." Next quarter, there's a new priority. The toil ratio hits 75%. Your best people quit because they didn't become engineers to restart pods all day.
Fix: Toil data must connect to a budget. "We spend 40 engineer-hours/week on toil. That's $X/year. Automating the top 3 items costs $Y and recovers Z hours/week." Make it a financial conversation, not just a feelings conversation.
4. Capacity planning only for compute¶
You model CPU and memory growth but forget about connection pools, file descriptors, DNS query volume, database connections, or Kubernetes pod limits. Your nodes have 40% CPU headroom but your app crashes because it hit the 65535 ephemeral port limit.
Fix: Capacity plan for every resource your service consumes. Start with the obvious (CPU, memory, disk, network) but include application-level limits: connection pools, thread counts, queue depths, rate limits, and any hard caps in your stack.
Under the hood: Linux ephemeral port range is 32768-60999 by default (28,231 ports). A service making outbound connections to a single destination can exhaust this in seconds under load.
ss -sshows current socket counts.net.ipv4.ip_local_port_rangeis tunable but there's always a ceiling.
5. Error budget policy with no teeth¶
Your error budget policy says "feature freeze when budget exhausted." Budget exhausted in March. VP of Product says "we can't freeze, the board demo is next week." Engineering caves. The policy is now known to be optional. Nobody takes it seriously again.
Fix: Error budget policies need executive sponsorship before the first violation. Get the VP of Engineering and VP of Product to co-sign the policy. When it triggers, the sponsoring exec enforces it. This must be established in peacetime, not during an incident.
6. PRR checklist that's all "Yes"¶
Every item on your Production Readiness Review is checked "Yes." The service launches. It has no runbooks, no capacity testing, and monitoring that alerts on "up/down" only. The first incident takes four hours because nobody knows how the service works.
Fix: PRR reviewers must verify evidence, not accept self-attestation. "Do you have monitoring?" is a bad question. "Show me the dashboard and explain the three key metrics" is a good one. Require screenshots, links, or live demos for each PRR item.
7. Automating toil before understanding it¶
You automate a manual process without understanding why it was manual. The process had a human judgment step that your automation skips. Now the automation runs and silently does the wrong thing in 5% of cases. You've replaced visible toil with invisible bugs.
Fix: Before automating, document the full manual process including decision points. Identify which steps are pure mechanics (automate immediately) and which involve judgment (automate with guardrails and alerts). Shadow the process 3 times before writing code.
Debug clue: If your automation is "mostly right but sometimes wrong," it probably skipped a human judgment step. Look for conditionals in the manual process: "if the database is large, do X instead of Y" or "check with the team lead if this affects more than 100 users." These branches are where automation needs guardrails, not blind execution.
8. On-call rotation with no primary/secondary split¶
Everyone is on a flat rotation. When a complex incident hits, the on-call engineer is alone. They don't know whether to escalate or keep debugging. They burn two hours solo before waking someone up. The incident is now two hours longer than it needed to be.
Fix: Run primary/secondary on-call. Primary handles the page. Secondary is available for escalation. Define clear escalation criteria: "If you haven't identified root cause in 15 minutes, page secondary." This is not weakness — it's the system working.
9. Measuring availability but not latency¶
Your SLO is "99.9% of requests return non-5xx." Your service returns 200 OK to 99.95% of requests. But 30% of those 200s take over 10 seconds. Users abandon the page. Your SLO says you're fine. Your users say you're broken.
Fix: Every service needs at least two SLOs: availability (success rate) and latency (p99 response time). A slow success is a failure from the user's perspective. "99.9% of requests succeed AND p99 latency < 500ms" is a real SLO.
10. Treating every alert as equally urgent¶
Your pager fires 15 times a day. Three are real issues. Twelve are noise. On-call learns to glance and dismiss. Then a real SEV-1 fires and gets the same glance-and-dismiss treatment. The outage runs for 45 minutes before someone actually investigates.
Fix: Ruthlessly tune your alerts. Every alert must be actionable, novel, and urgent. If an alert fires more than once a week without requiring action, silence it or fix the underlying issue. Target: fewer than 2 pages per on-call shift. More than that is alert fatigue, and alert fatigue kills.
War story: A PagerDuty analysis of their customer data found that teams with more than 5 alerts per on-call shift had 3x longer MTTR than teams with fewer than 2. Alert fatigue doesn't just annoy engineers — it measurably degrades incident response quality.