Portal | Level: L2: Operations | Topics: SRE Practices, Postmortems & SLOs, Alerting Rules | Domain: DevOps & Tooling

SRE Practices - Primer¶

Why This Matters¶

Site Reliability Engineering is the discipline that turns "keep it running" from a vague hope into a measurable engineering practice. Without SRE principles, you end up in a cycle: ship features, break things, fight fires, repeat. SRE gives you the framework to quantify reliability, make data-driven tradeoffs between velocity and stability, and systematically reduce the operational tax that grinds teams down.

This pack goes beyond SLOs and postmortems (covered in the postmortem-slo pack) into the full SRE operational framework: how to measure and eliminate toil, plan capacity before you need it, gate releases on production readiness, and treat reliability as a feature that competes for engineering time just like any product feature.

Core Principles¶

Who made it: The term "Site Reliability Engineering" was coined by Ben Treynor Sloss at Google around 2003. His famous definition: "SRE is what happens when you ask a software engineer to design an operations function." Google published the SRE Book (O'Reilly, 2016), which codified the discipline and sparked industry-wide adoption. The key insight: SREs are software engineers who happen to work on operations problems, not operators who learned to script. The 50% toil cap exists to enforce this — if more than half your time is manual operations, you are not doing SRE.

1. Reliability Is a Feature¶

Reliability is not something that happens by default. It is an explicit product requirement that must be funded, staffed, and tracked. The Google SRE model treats reliability as the most important feature — because if users can't reach your service, no other feature matters.

The Reliability Stack:

  Product Features     ← What users want
  ─────────────────
  Reliability          ← What makes features usable
  ─────────────────
  Infrastructure       ← What makes reliability possible

If the bottom two layers are underfunded, the top layer is worthless.

2. Error Budgets Drive Decisions¶

Error budgets are the bridge between product velocity and operational stability. The concept is simple: if your SLO is 99.9%, you have a 0.1% error budget. When the budget is healthy, ship fast. When it's burned, slow down and fix things.

Error Budget Policy — Decision Matrix:

  Budget Status        │ Engineering Action           │ Release Policy
  ─────────────────────┼──────────────────────────────┼─────────────────────
  > 50% remaining      │ Normal development           │ Ship at will
  25-50% remaining     │ Review recent reliability    │ Canary all deploys
  < 25% remaining      │ Prioritize reliability work  │ Reduced deploy frequency
  Exhausted (0%)       │ Feature freeze               │ Only reliability fixes
  Negative (overdrawn) │ All hands on reliability     │ Full stop on features

Remember: The error budget formula: Budget = 1 - SLO. If SLO = 99.9%, budget = 0.1% = 43.2 minutes/month. If SLO = 99.95%, budget = 0.05% = 21.6 minutes/month. That single extra "9" cuts your budget in half. This is why going from "three nines" to "four nines" is not a 0.09% improvement — it is a 10x reduction in allowed downtime.

Analogy: Error budgets work like a bank account. Reliability incidents are withdrawals. Feature velocity is the interest rate. When the account is flush, you spend freely (ship fast). When the balance is low, you stop spending (feature freeze) and focus on deposits (reliability work). The budget policy is your overdraft protection — it triggers automatically and forces the hard conversation between product and engineering.

3. Toil Measurement and Reduction¶

Toil is manual, repetitive, automatable work that scales linearly with service growth and has no enduring value. The SRE target: no more than 50% of an SRE's time should be toil. Above that, the team is a glorified ops team, not an engineering team.

Characteristic	Toil	NOT Toil
Manual	Restarting a crashed pod by hand	Pod restarts via liveness probe
Repetitive	Running the same deploy script weekly	Writing the deploy automation
Automatable	Rotating certs by hand every 90 days	cert-manager auto-renewal
Reactive	Manually scaling on traffic spike	HPA auto-scaling
No enduring value	Acknowledging known-false alerts	Tuning alert thresholds
Scales with service	Adding firewall rules per customer	Self-service customer onboarding

Measuring Toil¶

Toil Tracking Spreadsheet (per engineer, per week):

  Task                          │ Time (min) │ Frequency │ Toil? │ Automatable?
  ──────────────────────────────┼────────────┼───────────┼───────┼─────────────
  Restart failed jobs           │ 15         │ 3x/week   │ Yes   │ Yes
  Rotate staging certs          │ 30         │ Monthly   │ Yes   │ Yes
  Investigate false alerts      │ 45         │ Daily     │ Yes   │ Yes (tune)
  Capacity planning review      │ 60         │ Monthly   │ No    │ No
  Design review for new service │ 120        │ Weekly    │ No    │ No

Toil ratio = Toil hours / Total hours
Target: < 50%
Alarm: > 60%

The Toil Reduction Loop¶

  ┌─────────────┐     ┌───────────────┐     ┌──────────────┐
  │ Identify     │────▶│ Measure        │────▶│ Prioritize    │
  │ (what's toil)│     │ (how much time)│     │ (ROI ranking) │
  └─────────────┘     └───────────────┘     └──────┬───────┘
                                                     │
  ┌─────────────┐     ┌───────────────┐              │
  │ Verify       │◀────│ Automate       │◀────────────┘
  │ (toil gone?) │     │ (build it)     │
  └─────────────┘     └───────────────┘

4. Capacity Planning¶

Capacity planning answers: "Will we have enough resources when demand grows?" Bad capacity planning manifests as either outages (too little) or waste (too much).

Capacity Planning Cadence:

  Weekly:   Check utilization dashboards, flag hosts > 70% CPU/memory
  Monthly:  Review growth trends, project 90-day resource needs
  Quarterly: Budget request for next quarter's infrastructure
  Annually: Long-range forecasting tied to product roadmap

The Four Signals of Capacity¶

Signal	What to Watch	Danger Zone
CPU	Sustained utilization across fleet	> 70% average (no burst headroom)
Memory	RSS growth trends, OOM frequency	> 80% average, any OOM kills
Disk	Growth rate vs provisioned space	< 20% free, or < 30 days until full
Network	Bandwidth utilization, connection counts	> 60% of link capacity

Load Testing for Capacity¶

# Establish a baseline: what does your service handle today?
# Record: requests/sec, latency p50/p99, error rate, resource usage

# Synthetic load test (example with k6)
k6 run --vus 100 --duration 5m loadtest.js

# Key questions to answer:
# 1. At what RPS does p99 latency exceed your SLO?
# 2. At what RPS do errors start appearing?
# 3. What resource hits its limit first? (CPU? Memory? Connections?)
# 4. How does the system behave when overloaded? (graceful degradation or crash?)

5. Release Engineering¶

Release engineering is the discipline of getting code from a developer's branch to production safely, repeatably, and reversibly.

Release Safety Ladder:

  Level 0: YOLO deploy (git push to main, auto-deploy)
  Level 1: CI checks pass before merge
  Level 2: Canary deploy (small % of traffic first)
  Level 3: Progressive rollout (10% → 25% → 50% → 100%)
  Level 4: Automated rollback on SLI degradation
  Level 5: Dark launches + feature flags + automated analysis

Practice	What It Prevents	Complexity
CI gate (tests pass)	Obviously broken code reaching prod	Low
Canary deploy	Bad code affecting all users at once	Medium
Feature flags	Coupling deploy with release	Medium
Blue-green deploy	Downtime during deploy	Medium
Progressive rollout	Blast radius of bad releases	High
Automated rollback	Slow human response to broken deploys	High

6. Production Readiness Reviews¶

A Production Readiness Review (PRR) is a checklist-driven evaluation of whether a new service is ready to be supported in production. It happens before launch, not after the first outage.

PRR Checklist — Minimum Viable:

  □ SLOs defined and measurable
  □ Monitoring: dashboards exist, key metrics identified
  □ Alerting: SLO-based alerts configured, on-call routed
  □ Runbooks: common failure modes documented with remediation steps
  □ Capacity: load tested, resource limits set, scaling policy defined
  □ Dependencies: failure modes of each dependency understood
  □ Rollback: deployment can be reversed in < 5 minutes
  □ Data: backup strategy defined and tested
  □ Security: auth, encryption, secrets management reviewed
  □ On-call: team trained, escalation path defined

PRR Graduation Model¶

  ┌──────────────────┐
  │ Development       │  No SLO, no on-call, break/fix by dev team
  └────────┬─────────┘
           │ PRR Review
  ┌────────▼─────────┐
  │ Early Production  │  SLO defined, shared on-call with SRE
  └────────┬─────────┘
           │ 90-day review
  ┌────────▼─────────┐
  │ Full Production   │  SLO enforced, error budget policy active
  └────────┬─────────┘
           │ Maturity review
  ┌────────▼─────────┐
  │ Mature Service    │  Self-service, automated everything
  └──────────────────┘

7. On-Call and Escalation¶

SRE on-call is not "wake someone up for every alert." It is a structured system with clear escalation paths, bounded response times, and protection against burnout.

Tier	Who	Response Time	Handles
L1	On-call SRE	5 min ack	Alert triage, runbook execution
L2	Senior SRE / Service owner	15 min ack	Complex diagnosis, non-runbook issues
L3	Principal engineer / Architect	30 min ack	System-wide failures, unknown unknowns
Executive	VP Eng / CTO	As needed	Customer communication, business decisions

Escalation Rule of Thumb:

  If you haven't made progress in 15 minutes → escalate to L2
  If L2 hasn't made progress in 30 minutes → escalate to L3
  If customer impact exceeds 1 hour → notify executive

  Escalation is NOT failure. It is the system working correctly.

8. Eliminating Toil Through Automation Priorities¶

Not all toil is equally worth automating. Prioritize by frequency and time cost:

Automation Priority Matrix:

                    High Frequency
                         │
           ┌─────────────┼──────────────┐
           │  AUTOMATE    │  AUTOMATE    │
           │  NEXT        │  FIRST       │
  Low      │  (weekly,    │  (daily,     │  High
  Time ────┤  5 min each) │  30 min each)│── Time
  Cost     │              │              │   Cost
           │  CONSIDER    │  AUTOMATE    │
           │  (is it      │  SOON        │
           │  worth it?)  │  (monthly,   │
           │              │  2 hours)    │
           └─────────────┼──────────────┘
                         │
                    Low Frequency

Common Pitfalls¶

Treating SRE as "ops with a new name" — SRE is an engineering discipline. If your SREs aren't writing code to eliminate toil, they're just ops engineers with fancier titles.
No error budget policy — SLOs without consequences are just dashboards nobody looks at. The policy is what makes SLOs actionable.
Toil measurement without toil reduction — Tracking toil in a spreadsheet feels productive but changes nothing. Allocate 30% of sprint capacity to toil elimination projects.
Capacity planning by crisis — If you only think about capacity when the pager fires, you'll always be behind. Build the dashboards and review cadence before the emergency.
PRR as a checkbox exercise — A PRR that approves everything is worthless. It should have teeth: services that fail PRR don't launch.
Automating the wrong things first — Automating a monthly 5-minute task saves 1 hour/year. Automating a daily 30-minute task saves 182 hours/year. Do the math.
Ignoring the human side — On-call burnout, alert fatigue, and toil demoralization are SRE problems too. A reliable system with a burned-out team is a temporary situation.

Prerequisites¶

Observability Deep Dive (Topic Pack, L2)
Postmortems & SLOs (Topic Pack, L2)

Next Steps¶

Capacity Planning (Topic Pack, L2)

Alerting Flashcards (CLI) (flashcard_deck, L1) — Alerting Rules
Alerting Rules (Topic Pack, L2) — Alerting Rules
Alerting Rules Drills (Drill, L2) — Alerting Rules
Capacity Planning (Topic Pack, L2) — SRE Practices
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Alerting Rules
Change Management (Topic Pack, L1) — SRE Practices
Devops Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs
On-Call (Topic Pack, L2) — Alerting Rules
Postmortem & SLO Drills (Drill, L2) — Postmortems & SLOs
Postmortem SLO Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs

SRE Practices - Primer¶

Why This Matters¶

Core Principles¶

1. Reliability Is a Feature¶

2. Error Budgets Drive Decisions¶

3. Toil Measurement and Reduction¶

Measuring Toil¶

The Toil Reduction Loop¶

4. Capacity Planning¶

The Four Signals of Capacity¶

Load Testing for Capacity¶

5. Release Engineering¶

6. Production Readiness Reviews¶

PRR Graduation Model¶

7. On-Call and Escalation¶

8. Eliminating Toil Through Automation Priorities¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Pages that link here¶

SRE Practices - Primer¶

Why This Matters¶

Core Principles¶

1. Reliability Is a Feature¶

2. Error Budgets Drive Decisions¶

3. Toil Measurement and Reduction¶

Measuring Toil¶

The Toil Reduction Loop¶

4. Capacity Planning¶

The Four Signals of Capacity¶

Load Testing for Capacity¶

5. Release Engineering¶

6. Production Readiness Reviews¶

PRR Graduation Model¶

7. On-Call and Escalation¶

8. Eliminating Toil Through Automation Priorities¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Related Content¶

Pages that link here¶