Skip to content

Portal | Level: L2: Operations | Topics: SRE Practices, Postmortems & SLOs, Alerting Rules | Domain: DevOps & Tooling

SRE Practices - Primer

Why This Matters

Site Reliability Engineering is the discipline that turns "keep it running" from a vague hope into a measurable engineering practice. Without SRE principles, you end up in a cycle: ship features, break things, fight fires, repeat. SRE gives you the framework to quantify reliability, make data-driven tradeoffs between velocity and stability, and systematically reduce the operational tax that grinds teams down.

This pack goes beyond SLOs and postmortems (covered in the postmortem-slo pack) into the full SRE operational framework: how to measure and eliminate toil, plan capacity before you need it, gate releases on production readiness, and treat reliability as a feature that competes for engineering time just like any product feature.

Core Principles

Who made it: The term "Site Reliability Engineering" was coined by Ben Treynor Sloss at Google around 2003. His famous definition: "SRE is what happens when you ask a software engineer to design an operations function." Google published the SRE Book (O'Reilly, 2016), which codified the discipline and sparked industry-wide adoption. The key insight: SREs are software engineers who happen to work on operations problems, not operators who learned to script. The 50% toil cap exists to enforce this — if more than half your time is manual operations, you are not doing SRE.

1. Reliability Is a Feature

Reliability is not something that happens by default. It is an explicit product requirement that must be funded, staffed, and tracked. The Google SRE model treats reliability as the most important feature — because if users can't reach your service, no other feature matters.

The Reliability Stack:

  Product Features     ← What users want
  ─────────────────
  Reliability          ← What makes features usable
  ─────────────────
  Infrastructure       ← What makes reliability possible

If the bottom two layers are underfunded, the top layer is worthless.

2. Error Budgets Drive Decisions

Error budgets are the bridge between product velocity and operational stability. The concept is simple: if your SLO is 99.9%, you have a 0.1% error budget. When the budget is healthy, ship fast. When it's burned, slow down and fix things.

Error Budget Policy — Decision Matrix:

  Budget Status        │ Engineering Action           │ Release Policy
  ─────────────────────┼──────────────────────────────┼─────────────────────
  > 50% remaining      │ Normal development           │ Ship at will
  25-50% remaining     │ Review recent reliability    │ Canary all deploys
  < 25% remaining      │ Prioritize reliability work  │ Reduced deploy frequency
  Exhausted (0%)       │ Feature freeze               │ Only reliability fixes
  Negative (overdrawn) │ All hands on reliability     │ Full stop on features

Remember: The error budget formula: Budget = 1 - SLO. If SLO = 99.9%, budget = 0.1% = 43.2 minutes/month. If SLO = 99.95%, budget = 0.05% = 21.6 minutes/month. That single extra "9" cuts your budget in half. This is why going from "three nines" to "four nines" is not a 0.09% improvement — it is a 10x reduction in allowed downtime.

Analogy: Error budgets work like a bank account. Reliability incidents are withdrawals. Feature velocity is the interest rate. When the account is flush, you spend freely (ship fast). When the balance is low, you stop spending (feature freeze) and focus on deposits (reliability work). The budget policy is your overdraft protection — it triggers automatically and forces the hard conversation between product and engineering.

3. Toil Measurement and Reduction

Toil is manual, repetitive, automatable work that scales linearly with service growth and has no enduring value. The SRE target: no more than 50% of an SRE's time should be toil. Above that, the team is a glorified ops team, not an engineering team.

Characteristic Toil NOT Toil
Manual Restarting a crashed pod by hand Pod restarts via liveness probe
Repetitive Running the same deploy script weekly Writing the deploy automation
Automatable Rotating certs by hand every 90 days cert-manager auto-renewal
Reactive Manually scaling on traffic spike HPA auto-scaling
No enduring value Acknowledging known-false alerts Tuning alert thresholds
Scales with service Adding firewall rules per customer Self-service customer onboarding

Measuring Toil

Toil Tracking Spreadsheet (per engineer, per week):

  Task                          │ Time (min) │ Frequency │ Toil? │ Automatable?
  ──────────────────────────────┼────────────┼───────────┼───────┼─────────────
  Restart failed jobs           │ 15         │ 3x/week   │ Yes   │ Yes
  Rotate staging certs          │ 30         │ Monthly   │ Yes   │ Yes
  Investigate false alerts      │ 45         │ Daily     │ Yes   │ Yes (tune)
  Capacity planning review      │ 60         │ Monthly   │ No    │ No
  Design review for new service │ 120        │ Weekly    │ No    │ No

Toil ratio = Toil hours / Total hours
Target: < 50%
Alarm: > 60%

The Toil Reduction Loop

  ┌─────────────┐     ┌───────────────┐     ┌──────────────┐
  │ Identify     │────▶│ Measure        │────▶│ Prioritize    │
  │ (what's toil)│     │ (how much time)│     │ (ROI ranking) │
  └─────────────┘     └───────────────┘     └──────┬───────┘
  ┌─────────────┐     ┌───────────────┐              │
  │ Verify       │◀────│ Automate       │◀────────────┘
  │ (toil gone?) │     │ (build it)     │
  └─────────────┘     └───────────────┘

4. Capacity Planning

Capacity planning answers: "Will we have enough resources when demand grows?" Bad capacity planning manifests as either outages (too little) or waste (too much).

Capacity Planning Cadence:

  Weekly:   Check utilization dashboards, flag hosts > 70% CPU/memory
  Monthly:  Review growth trends, project 90-day resource needs
  Quarterly: Budget request for next quarter's infrastructure
  Annually: Long-range forecasting tied to product roadmap

The Four Signals of Capacity

Signal What to Watch Danger Zone
CPU Sustained utilization across fleet > 70% average (no burst headroom)
Memory RSS growth trends, OOM frequency > 80% average, any OOM kills
Disk Growth rate vs provisioned space < 20% free, or < 30 days until full
Network Bandwidth utilization, connection counts > 60% of link capacity

Load Testing for Capacity

# Establish a baseline: what does your service handle today?
# Record: requests/sec, latency p50/p99, error rate, resource usage

# Synthetic load test (example with k6)
k6 run --vus 100 --duration 5m loadtest.js

# Key questions to answer:
# 1. At what RPS does p99 latency exceed your SLO?
# 2. At what RPS do errors start appearing?
# 3. What resource hits its limit first? (CPU? Memory? Connections?)
# 4. How does the system behave when overloaded? (graceful degradation or crash?)

5. Release Engineering

Release engineering is the discipline of getting code from a developer's branch to production safely, repeatably, and reversibly.

Release Safety Ladder:

  Level 0: YOLO deploy (git push to main, auto-deploy)
  Level 1: CI checks pass before merge
  Level 2: Canary deploy (small % of traffic first)
  Level 3: Progressive rollout (10% → 25% → 50% → 100%)
  Level 4: Automated rollback on SLI degradation
  Level 5: Dark launches + feature flags + automated analysis
Practice What It Prevents Complexity
CI gate (tests pass) Obviously broken code reaching prod Low
Canary deploy Bad code affecting all users at once Medium
Feature flags Coupling deploy with release Medium
Blue-green deploy Downtime during deploy Medium
Progressive rollout Blast radius of bad releases High
Automated rollback Slow human response to broken deploys High

6. Production Readiness Reviews

A Production Readiness Review (PRR) is a checklist-driven evaluation of whether a new service is ready to be supported in production. It happens before launch, not after the first outage.

PRR Checklist — Minimum Viable:

  □ SLOs defined and measurable
  □ Monitoring: dashboards exist, key metrics identified
  □ Alerting: SLO-based alerts configured, on-call routed
  □ Runbooks: common failure modes documented with remediation steps
  □ Capacity: load tested, resource limits set, scaling policy defined
  □ Dependencies: failure modes of each dependency understood
  □ Rollback: deployment can be reversed in < 5 minutes
  □ Data: backup strategy defined and tested
  □ Security: auth, encryption, secrets management reviewed
  □ On-call: team trained, escalation path defined

PRR Graduation Model

  ┌──────────────────┐
  │ Development       │  No SLO, no on-call, break/fix by dev team
  └────────┬─────────┘
           │ PRR Review
  ┌────────▼─────────┐
  │ Early Production  │  SLO defined, shared on-call with SRE
  └────────┬─────────┘
           │ 90-day review
  ┌────────▼─────────┐
  │ Full Production   │  SLO enforced, error budget policy active
  └────────┬─────────┘
           │ Maturity review
  ┌────────▼─────────┐
  │ Mature Service    │  Self-service, automated everything
  └──────────────────┘

7. On-Call and Escalation

SRE on-call is not "wake someone up for every alert." It is a structured system with clear escalation paths, bounded response times, and protection against burnout.

Tier Who Response Time Handles
L1 On-call SRE 5 min ack Alert triage, runbook execution
L2 Senior SRE / Service owner 15 min ack Complex diagnosis, non-runbook issues
L3 Principal engineer / Architect 30 min ack System-wide failures, unknown unknowns
Executive VP Eng / CTO As needed Customer communication, business decisions
Escalation Rule of Thumb:

  If you haven't made progress in 15 minutes → escalate to L2
  If L2 hasn't made progress in 30 minutes → escalate to L3
  If customer impact exceeds 1 hour → notify executive

  Escalation is NOT failure. It is the system working correctly.

8. Eliminating Toil Through Automation Priorities

Not all toil is equally worth automating. Prioritize by frequency and time cost:

Automation Priority Matrix:

                    High Frequency
           ┌─────────────┼──────────────┐
           │  AUTOMATE    │  AUTOMATE    │
           │  NEXT        │  FIRST       │
  Low      │  (weekly,    │  (daily,     │  High
  Time ────┤  5 min each) │  30 min each)│── Time
  Cost     │              │              │   Cost
           │  CONSIDER    │  AUTOMATE    │
           │  (is it      │  SOON        │
           │  worth it?)  │  (monthly,   │
           │              │  2 hours)    │
           └─────────────┼──────────────┘
                    Low Frequency

Common Pitfalls

  1. Treating SRE as "ops with a new name" — SRE is an engineering discipline. If your SREs aren't writing code to eliminate toil, they're just ops engineers with fancier titles.
  2. No error budget policy — SLOs without consequences are just dashboards nobody looks at. The policy is what makes SLOs actionable.
  3. Toil measurement without toil reduction — Tracking toil in a spreadsheet feels productive but changes nothing. Allocate 30% of sprint capacity to toil elimination projects.
  4. Capacity planning by crisis — If you only think about capacity when the pager fires, you'll always be behind. Build the dashboards and review cadence before the emergency.
  5. PRR as a checkbox exercise — A PRR that approves everything is worthless. It should have teeth: services that fail PRR don't launch.
  6. Automating the wrong things first — Automating a monthly 5-minute task saves 1 hour/year. Automating a daily 30-minute task saves 182 hours/year. Do the math.
  7. Ignoring the human side — On-call burnout, alert fatigue, and toil demoralization are SRE problems too. A reliable system with a burned-out team is a temporary situation.

Wiki Navigation

Prerequisites

Next Steps