Portal | Level: L2: Operations | Topics: SRE Practices, Postmortems & SLOs, Alerting Rules | Domain: DevOps & Tooling
SRE Practices - Primer¶
Why This Matters¶
Site Reliability Engineering is the discipline that turns "keep it running" from a vague hope into a measurable engineering practice. Without SRE principles, you end up in a cycle: ship features, break things, fight fires, repeat. SRE gives you the framework to quantify reliability, make data-driven tradeoffs between velocity and stability, and systematically reduce the operational tax that grinds teams down.
This pack goes beyond SLOs and postmortems (covered in the postmortem-slo pack) into the full SRE operational framework: how to measure and eliminate toil, plan capacity before you need it, gate releases on production readiness, and treat reliability as a feature that competes for engineering time just like any product feature.
Core Principles¶
Who made it: The term "Site Reliability Engineering" was coined by Ben Treynor Sloss at Google around 2003. His famous definition: "SRE is what happens when you ask a software engineer to design an operations function." Google published the SRE Book (O'Reilly, 2016), which codified the discipline and sparked industry-wide adoption. The key insight: SREs are software engineers who happen to work on operations problems, not operators who learned to script. The 50% toil cap exists to enforce this — if more than half your time is manual operations, you are not doing SRE.
1. Reliability Is a Feature¶
Reliability is not something that happens by default. It is an explicit product requirement that must be funded, staffed, and tracked. The Google SRE model treats reliability as the most important feature — because if users can't reach your service, no other feature matters.
The Reliability Stack:
Product Features ← What users want
─────────────────
Reliability ← What makes features usable
─────────────────
Infrastructure ← What makes reliability possible
If the bottom two layers are underfunded, the top layer is worthless.
2. Error Budgets Drive Decisions¶
Error budgets are the bridge between product velocity and operational stability. The concept is simple: if your SLO is 99.9%, you have a 0.1% error budget. When the budget is healthy, ship fast. When it's burned, slow down and fix things.
Error Budget Policy — Decision Matrix:
Budget Status │ Engineering Action │ Release Policy
─────────────────────┼──────────────────────────────┼─────────────────────
> 50% remaining │ Normal development │ Ship at will
25-50% remaining │ Review recent reliability │ Canary all deploys
< 25% remaining │ Prioritize reliability work │ Reduced deploy frequency
Exhausted (0%) │ Feature freeze │ Only reliability fixes
Negative (overdrawn) │ All hands on reliability │ Full stop on features
Remember: The error budget formula: Budget = 1 - SLO. If SLO = 99.9%, budget = 0.1% = 43.2 minutes/month. If SLO = 99.95%, budget = 0.05% = 21.6 minutes/month. That single extra "9" cuts your budget in half. This is why going from "three nines" to "four nines" is not a 0.09% improvement — it is a 10x reduction in allowed downtime.
Analogy: Error budgets work like a bank account. Reliability incidents are withdrawals. Feature velocity is the interest rate. When the account is flush, you spend freely (ship fast). When the balance is low, you stop spending (feature freeze) and focus on deposits (reliability work). The budget policy is your overdraft protection — it triggers automatically and forces the hard conversation between product and engineering.
3. Toil Measurement and Reduction¶
Toil is manual, repetitive, automatable work that scales linearly with service growth and has no enduring value. The SRE target: no more than 50% of an SRE's time should be toil. Above that, the team is a glorified ops team, not an engineering team.
| Characteristic | Toil | NOT Toil |
|---|---|---|
| Manual | Restarting a crashed pod by hand | Pod restarts via liveness probe |
| Repetitive | Running the same deploy script weekly | Writing the deploy automation |
| Automatable | Rotating certs by hand every 90 days | cert-manager auto-renewal |
| Reactive | Manually scaling on traffic spike | HPA auto-scaling |
| No enduring value | Acknowledging known-false alerts | Tuning alert thresholds |
| Scales with service | Adding firewall rules per customer | Self-service customer onboarding |
Measuring Toil¶
Toil Tracking Spreadsheet (per engineer, per week):
Task │ Time (min) │ Frequency │ Toil? │ Automatable?
──────────────────────────────┼────────────┼───────────┼───────┼─────────────
Restart failed jobs │ 15 │ 3x/week │ Yes │ Yes
Rotate staging certs │ 30 │ Monthly │ Yes │ Yes
Investigate false alerts │ 45 │ Daily │ Yes │ Yes (tune)
Capacity planning review │ 60 │ Monthly │ No │ No
Design review for new service │ 120 │ Weekly │ No │ No
Toil ratio = Toil hours / Total hours
Target: < 50%
Alarm: > 60%
The Toil Reduction Loop¶
┌─────────────┐ ┌───────────────┐ ┌──────────────┐
│ Identify │────▶│ Measure │────▶│ Prioritize │
│ (what's toil)│ │ (how much time)│ │ (ROI ranking) │
└─────────────┘ └───────────────┘ └──────┬───────┘
│
┌─────────────┐ ┌───────────────┐ │
│ Verify │◀────│ Automate │◀────────────┘
│ (toil gone?) │ │ (build it) │
└─────────────┘ └───────────────┘
4. Capacity Planning¶
Capacity planning answers: "Will we have enough resources when demand grows?" Bad capacity planning manifests as either outages (too little) or waste (too much).
Capacity Planning Cadence:
Weekly: Check utilization dashboards, flag hosts > 70% CPU/memory
Monthly: Review growth trends, project 90-day resource needs
Quarterly: Budget request for next quarter's infrastructure
Annually: Long-range forecasting tied to product roadmap
The Four Signals of Capacity¶
| Signal | What to Watch | Danger Zone |
|---|---|---|
| CPU | Sustained utilization across fleet | > 70% average (no burst headroom) |
| Memory | RSS growth trends, OOM frequency | > 80% average, any OOM kills |
| Disk | Growth rate vs provisioned space | < 20% free, or < 30 days until full |
| Network | Bandwidth utilization, connection counts | > 60% of link capacity |
Load Testing for Capacity¶
# Establish a baseline: what does your service handle today?
# Record: requests/sec, latency p50/p99, error rate, resource usage
# Synthetic load test (example with k6)
k6 run --vus 100 --duration 5m loadtest.js
# Key questions to answer:
# 1. At what RPS does p99 latency exceed your SLO?
# 2. At what RPS do errors start appearing?
# 3. What resource hits its limit first? (CPU? Memory? Connections?)
# 4. How does the system behave when overloaded? (graceful degradation or crash?)
5. Release Engineering¶
Release engineering is the discipline of getting code from a developer's branch to production safely, repeatably, and reversibly.
Release Safety Ladder:
Level 0: YOLO deploy (git push to main, auto-deploy)
Level 1: CI checks pass before merge
Level 2: Canary deploy (small % of traffic first)
Level 3: Progressive rollout (10% → 25% → 50% → 100%)
Level 4: Automated rollback on SLI degradation
Level 5: Dark launches + feature flags + automated analysis
| Practice | What It Prevents | Complexity |
|---|---|---|
| CI gate (tests pass) | Obviously broken code reaching prod | Low |
| Canary deploy | Bad code affecting all users at once | Medium |
| Feature flags | Coupling deploy with release | Medium |
| Blue-green deploy | Downtime during deploy | Medium |
| Progressive rollout | Blast radius of bad releases | High |
| Automated rollback | Slow human response to broken deploys | High |
6. Production Readiness Reviews¶
A Production Readiness Review (PRR) is a checklist-driven evaluation of whether a new service is ready to be supported in production. It happens before launch, not after the first outage.
PRR Checklist — Minimum Viable:
□ SLOs defined and measurable
□ Monitoring: dashboards exist, key metrics identified
□ Alerting: SLO-based alerts configured, on-call routed
□ Runbooks: common failure modes documented with remediation steps
□ Capacity: load tested, resource limits set, scaling policy defined
□ Dependencies: failure modes of each dependency understood
□ Rollback: deployment can be reversed in < 5 minutes
□ Data: backup strategy defined and tested
□ Security: auth, encryption, secrets management reviewed
□ On-call: team trained, escalation path defined
PRR Graduation Model¶
┌──────────────────┐
│ Development │ No SLO, no on-call, break/fix by dev team
└────────┬─────────┘
│ PRR Review
┌────────▼─────────┐
│ Early Production │ SLO defined, shared on-call with SRE
└────────┬─────────┘
│ 90-day review
┌────────▼─────────┐
│ Full Production │ SLO enforced, error budget policy active
└────────┬─────────┘
│ Maturity review
┌────────▼─────────┐
│ Mature Service │ Self-service, automated everything
└──────────────────┘
7. On-Call and Escalation¶
SRE on-call is not "wake someone up for every alert." It is a structured system with clear escalation paths, bounded response times, and protection against burnout.
| Tier | Who | Response Time | Handles |
|---|---|---|---|
| L1 | On-call SRE | 5 min ack | Alert triage, runbook execution |
| L2 | Senior SRE / Service owner | 15 min ack | Complex diagnosis, non-runbook issues |
| L3 | Principal engineer / Architect | 30 min ack | System-wide failures, unknown unknowns |
| Executive | VP Eng / CTO | As needed | Customer communication, business decisions |
Escalation Rule of Thumb:
If you haven't made progress in 15 minutes → escalate to L2
If L2 hasn't made progress in 30 minutes → escalate to L3
If customer impact exceeds 1 hour → notify executive
Escalation is NOT failure. It is the system working correctly.
8. Eliminating Toil Through Automation Priorities¶
Not all toil is equally worth automating. Prioritize by frequency and time cost:
Automation Priority Matrix:
High Frequency
│
┌─────────────┼──────────────┐
│ AUTOMATE │ AUTOMATE │
│ NEXT │ FIRST │
Low │ (weekly, │ (daily, │ High
Time ────┤ 5 min each) │ 30 min each)│── Time
Cost │ │ │ Cost
│ CONSIDER │ AUTOMATE │
│ (is it │ SOON │
│ worth it?) │ (monthly, │
│ │ 2 hours) │
└─────────────┼──────────────┘
│
Low Frequency
Common Pitfalls¶
- Treating SRE as "ops with a new name" — SRE is an engineering discipline. If your SREs aren't writing code to eliminate toil, they're just ops engineers with fancier titles.
- No error budget policy — SLOs without consequences are just dashboards nobody looks at. The policy is what makes SLOs actionable.
- Toil measurement without toil reduction — Tracking toil in a spreadsheet feels productive but changes nothing. Allocate 30% of sprint capacity to toil elimination projects.
- Capacity planning by crisis — If you only think about capacity when the pager fires, you'll always be behind. Build the dashboards and review cadence before the emergency.
- PRR as a checkbox exercise — A PRR that approves everything is worthless. It should have teeth: services that fail PRR don't launch.
- Automating the wrong things first — Automating a monthly 5-minute task saves 1 hour/year. Automating a daily 30-minute task saves 182 hours/year. Do the math.
- Ignoring the human side — On-call burnout, alert fatigue, and toil demoralization are SRE problems too. A reliable system with a burned-out team is a temporary situation.
Wiki Navigation¶
Prerequisites¶
- Observability Deep Dive (Topic Pack, L2)
- Postmortems & SLOs (Topic Pack, L2)
Next Steps¶
- Capacity Planning (Topic Pack, L2)
Related Content¶
- Alerting Flashcards (CLI) (flashcard_deck, L1) — Alerting Rules
- Alerting Rules (Topic Pack, L2) — Alerting Rules
- Alerting Rules Drills (Drill, L2) — Alerting Rules
- Capacity Planning (Topic Pack, L2) — SRE Practices
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Alerting Rules
- Change Management (Topic Pack, L1) — SRE Practices
- Devops Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs
- On-Call (Topic Pack, L2) — Alerting Rules
- Postmortem & SLO Drills (Drill, L2) — Postmortems & SLOs
- Postmortem SLO Flashcards (CLI) (flashcard_deck, L1) — Postmortems & SLOs