SRE Practices - Street-Level Ops¶

What experienced SREs know about running reliability as an engineering discipline, not a hope.

Quick Diagnosis Commands¶

# Check current error budget burn rate (Prometheus)
# Burn rate > 1.0 means you're consuming budget faster than allowed
curl -s 'http://prometheus:9090/api/v1/query?query=
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /sum(rate(http_requests_total[1h]))
  / 0.001' | jq '.data.result[0].value[1]'

# Quick SLO compliance check — last 30 days
curl -s 'http://prometheus:9090/api/v1/query?query=
  1 - (sum(increase(http_requests_total{status=~"5.."}[30d]))
  / sum(increase(http_requests_total[30d])))' | jq '.data.result[0].value[1]'

# Toil snapshot — how many manual interventions this week?
grep -c "manual\|restart\|reboot\|cleared\|acked" /var/log/ops-actions.log

# Capacity check — fleet-wide CPU utilization
kubectl top nodes --no-headers | awk '{print $1, $3}' | sort -k2 -rn | head -10

# Disk growth rate — days until full
df -h / | tail -1 | awk '{print $5}' | tr -d '%'
# Combine with historical: predict exhaustion from monitoring trends

# Check how many alerts fired in the last 24h (Alertmanager)
curl -s http://alertmanager:9093/api/v2/alerts | jq '[.[] | select(.startsAt > (now - 86400 | todate))] | length'

Remember: SLO terminology mnemonic: I-O-A-B — SLI (what you measure, e.g., latency), SLO (the target, e.g., p99 < 200ms), SLA (the contract with consequences), Error Budget (the allowed failure, e.g., 0.1% of requests). SLIs feed SLOs, SLOs feed error budgets, error budgets drive engineering prioritization.

Gotcha: SLOs That Nobody Looks At¶

You defined SLOs six months ago. They're in a Confluence page. Nobody has checked them since. The dashboard exists but isn't on any team's daily routine. When an incident happens, nobody thinks about error budget impact.

Fix: Build SLO review into existing rituals. Add the error budget burn rate to your daily standup dashboard. Include SLO attainment in weekly team reports. Make the error budget widget the first thing on your Grafana home dashboard. If nobody is looking at it, it doesn't exist.

Under the hood: Google's SRE book recommends multi-window, multi-burn-rate alerting: alert on 14.4x burn rate over 1 hour (page) and 3x burn rate over 6 hours (ticket). A single-threshold alert either fires too late (you burned 100% before noticing) or too often (every minor blip pages). The multi-window approach catches fast burns quickly and slow burns before they exhaust the budget.

Gotcha: Measuring Toil But Never Reducing It¶

Your team dutifully logs toil in a spreadsheet every week. The spreadsheet shows 60% toil. Leadership nods. Nothing changes because there's no allocated engineering time to build the automation. The spreadsheet becomes a monument to futility.

Fix: Reserve 30% of sprint capacity for toil reduction projects. Track toil reduction like you track feature delivery. Set a quarterly target: "Reduce toil ratio from 60% to 40%." Make it a KPI that leadership reviews.

Gotcha: Capacity Planning by Gut Feel¶

"We'll probably need more nodes in Q3." Based on what? A feeling. When Q3 hits, you're either over-provisioned (wasting money) or under-provisioned (scrambling during a traffic spike). Neither outcome is engineering.

Fix: Build a capacity model from actual data:

1. Measure current utilization per resource type (CPU, memory, disk, network)
2. Measure growth rate (requests/day trend over 90 days)
3. Identify your bottleneck resource (usually one hits the wall first)
4. Project: at current growth rate, when does bottleneck hit 70%?
5. Plan provisioning to land 30 days before that date
6. Validate with load tests quarterly

Gotcha: PRR Rubber Stamps¶

Every service passes the Production Readiness Review because the reviewer doesn't want to be the person blocking a launch. The checklist is filled out but the answers are "N/A" or "TODO." The service launches, breaks at 2am, and the SRE team inherits a service they never approved.

Fix: PRR must have veto power. Assign PRR reviewers who are not on the launching team. Require evidence, not promises: "Show me the dashboard" not "We'll add monitoring." Failed PRR items become blockers, not action items.

Pattern: The Error Budget Conversation¶

When error budget is burned, the conversation between SRE and product should follow this structure:

SRE:    "Our 30-day error budget for service X is exhausted.
         We burned 80% of it on Tuesday's deploy incident."

PM:     "We have Feature Y launching next week. Can we still ship?"

SRE:    "Per our error budget policy, we're in feature freeze.
         I recommend we spend this week on:
         1. Fixing the flaky integration test that missed Tuesday's bug
         2. Adding the canary deploy step we've been deferring
         3. Backfilling the missing runbook for database failover"

PM:     "When can we resume feature work?"

SRE:    "Error budget resets on the rolling 30-day window.
         If we have zero incidents this week, we'll have budget back
         by next Thursday. The reliability work makes that more likely."

This conversation only works if the error budget policy was agreed upon in advance. Negotiating during an incident is a losing game.

Pattern: Toil Elimination Sprint¶

Dedicate one sprint per quarter entirely to toil reduction:

Toil Sprint Structure (2-week sprint):

  Day 1:    Review toil log. Rank by (frequency × time × frustration).
            Pick top 3 toil items to eliminate.

  Day 2-3:  Design automation for item #1 (usually the daily one).
            Prototype and test.

  Day 4-6:  Implement and deploy automation for #1.
            Begin design for #2.

  Day 7-8:  Implement #2. Begin #3 if time allows.

  Day 9:    Test all automations in staging.
            Document what was automated and how.

  Day 10:   Deploy to production. Update toil tracking.
            Demo to team. Measure: how much time reclaimed per week?

Pattern: Capacity Planning With N+1 and N+2¶

N+0: Current capacity handles current load.
     Problem: zero headroom. Any spike = outage.

N+1: One redundant unit (server, node, pod) beyond what's needed.
     Handles: single failure or small traffic spike.
     Minimum viable for production.

N+2: Two redundant units.
     Handles: one planned maintenance + one unplanned failure simultaneously.
     Standard for business-critical services.

Rule of thumb:
  - Stateless services: N+1 (scale horizontally, fast to add)
  - Stateful services: N+2 (harder to recover, needs more buffer)
  - Single points of failure: eliminate entirely (no N+anything saves you)

Gotcha: N+1 only works if the remaining N units can handle the full load. If your 3-node cluster runs at 80% CPU, losing one node puts the remaining two at 120% — they crash too. The real formula is: peak load / (N-1) < capacity per unit. Run load tests with one node intentionally removed to validate your N+1 claim.

Pattern: Release Engineering Checklist¶

Before every production deploy, verify:

Pre-Deploy:
  □ CI pipeline green (tests, lint, security scan)
  □ Change reviewed by someone who didn't write it
  □ Rollback procedure documented and tested
  □ Canary percentage configured (start at 5%)
  □ SLI dashboards open and baselined

During Deploy:
  □ Watch error rate for 10 minutes at each canary stage
  □ Watch latency p99 for regression
  □ Spot-check logs for new error patterns
  □ Verify health checks passing on new pods

Post-Deploy:
  □ All canary stages promoted to 100%
  □ SLI metrics stable for 30 minutes
  □ No new alert firing
  □ Deploy recorded in change log with SHA and timestamp

Emergency: Error Budget Exhausted Mid-Sprint¶

Your 30-day error budget just hit zero with two weeks left in the window. Product is pushing for a feature release.

1. Invoke the error budget policy (this is why you wrote it)
2. Communicate to stakeholders:
   - "We are in reliability-only mode per our agreed policy"
   - "Feature work resumes when budget recovers"
   - "Here is what we're doing to restore reliability"
3. Focus engineering effort on:
   - Root cause of the budget-burning incidents
   - Preventing recurrence (not just fixing symptoms)
   - Reducing blast radius of future failures
4. Track budget recovery daily
5. Resume feature work only when budget is positive again

Emergency: New Service Failing PRR Before Hard Launch Date¶

Marketing has announced a launch date. The service fails its Production Readiness Review. You're caught between business pressure and engineering safety.

1. Itemize exactly what failed:
   - Which PRR items are red? Which are yellow?
   - What's the risk of launching without each item?

2. Propose a tiered response:
   - "Must fix before launch" (no monitoring = no launch)
   - "Accept risk with mitigation" (missing runbook = senior SRE on standby)
   - "Fix within 30 days post-launch" (non-critical items)

3. Get explicit sign-off from VP Eng on accepted risks
   - Not verbal. Written. With names.

4. Document the exception:
   - What was waived, why, and the plan to close the gap
   - Schedule the 30-day review and hold to it

5. Staff the launch with extra on-call coverage
   - The mitigation for missing automation is human attention (temporarily)