SLOs: When Good Enough Is a Number
- lesson
- slos
- slis
- slas
- error-budgets
- burn-rates
- prometheus-alerting
- sre-practices
- incident-response ---# SLOs — When Good Enough Is a Number
Topics: SLOs, SLIs, SLAs, error budgets, burn rates, Prometheus alerting, SRE practices, incident response Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Strategy: Build-up + incident-driven Prerequisites: None (everything is explained from scratch)
The Mission¶
It's the morning of day 12 in a 30-day cycle. You check the SLO dashboard for order-service
and see this:
SLO target: 99.9% availability
Error budget: 43.2 minutes/month
Budget consumed: 28.7 minutes
Budget remaining: 14.5 minutes (33.6%)
Current burn rate: 2.4x
You've used two-thirds of your monthly error budget and you're not even halfway through the month. At the current 2.4x burn rate, your remaining 14.5 minutes of budget will be gone in about 6 days. That puts you at day 18 — twelve days of the month left with zero budget and a feature release queued for next week.
What do you do? How did you get here? And what does "33.6% remaining" actually mean?
By the end of this lesson you'll understand: - What SLIs, SLOs, and SLAs are — and the critical differences between them - How to choose SLIs that reflect what users actually experience - The math behind error budgets (it's simpler than it sounds) - How burn rate alerts work and why they replaced threshold alerts - How to implement all of this in Prometheus with real recording and alerting rules - What to do when the budget runs low — the error budget policy that makes SLOs actionable
We'll build up from definitions to math to implementation, with a real incident threaded through.
Part 1: The Vocabulary — SLI vs SLO vs SLA¶
These three terms sound similar, get confused constantly, and mean completely different things.
| Term | What it is | Who owns it | Example |
|---|---|---|---|
| SLI (Service Level Indicator) | A measurement of service quality | Engineering | 99.2% of requests returned non-5xx this week |
| SLO (Service Level Objective) | An internal target for an SLI | Engineering + Product | 99.9% of requests should succeed over 30 days |
| SLA (Service Level Agreement) | A contract with consequences | Business + Legal | 99.5% uptime or we issue service credits |
The hierarchy matters: SLA < SLO < theoretical max. Your SLO should be stricter than your SLA, giving you a buffer to detect and fix problems before they become contractual violations.
Name Origin: "SLA" comes from telecommunications. Telcos have written service level agreements since the 1980s, specifying uptime commitments for leased lines. "SLO" and "SLI" were formalized by Google's SRE book (2016), though the underlying idea — measuring defect rates against a target — traces back to Walter Shewhart's statistical process control work at Bell Labs in the 1920s. Google's innovation was applying "acceptable defect rate" thinking to software services and making the error budget a currency teams spend to ship features.
Here's the key insight that most people miss on first encounter: the SLI is what you measure, the SLO is the line you draw, and the SLA is the promise you make to someone who can sue you. If your SLI is "percentage of successful HTTP requests," your SLO might be "99.9% over 30 days," and your SLA to customers might be "99.5% or we refund."
Gotcha: Setting your SLO equal to your SLA is a trap. If SLA = SLO = 99.9%, you have zero margin. A single dip below SLO means immediate SLA breach with financial penalties. Set SLO stricter than SLA (e.g., SLO = 99.95% when SLA = 99.9%) so you detect and fix problems before they become contractual violations.
Flashcard Check #1¶
| Question | Answer (cover this column) |
|---|---|
| What does SLI stand for and what does it measure? | Service Level Indicator. A metric measuring service quality from the user's perspective. |
| If your SLO is 99.9% and your SLA is 99.5%, which is stricter? | The SLO. It gives you a buffer before SLA breach. |
| Why is CPU utilization a bad SLI? | It measures infrastructure, not user experience. A server at 90% CPU might serve perfectly; one at 10% CPU might return errors. |
Part 2: Choosing SLIs — Measuring What Users Care About¶
Not all metrics make good SLIs. The rule is simple: good SLIs measure what users experience, not what infrastructure does.
| SLI Type | Good SLI | Bad SLI | Why the bad one fails |
|---|---|---|---|
| Availability | % of HTTP requests returning non-5xx | up{job="api"} == 1 |
Pod can be "up" while returning errors to every request |
| Latency | p99 response time < 300ms | Average response time | Averages hide tail latency — 1% of users could wait 30 seconds |
| Throughput | Successful requests per second | Network bandwidth | Bandwidth says nothing about whether requests succeed |
| Correctness | % of responses returning correct data | Test pass rate | Tests pass in CI; production data is different |
Mental Model: Think of SLIs as answering one question: "Can users do what they came to do?" If a user clicks "Place Order" and gets a 500 error, your availability SLI should reflect that. If they click "Place Order" and it takes 8 seconds, your latency SLI should reflect that. If their order goes through but charges the wrong amount, your correctness SLI should reflect that.
The four golden signals¶
Google's SRE book defined four signals every service should measure. They map directly to SLI categories:
| Signal | What it answers | PromQL example |
|---|---|---|
| Latency | How fast? | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) |
| Traffic | How much? | sum(rate(http_requests_total[5m])) |
| Errors | How often does it fail? | sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) |
| Saturation | How full? | container_memory_working_set_bytes / kube_pod_container_resource_limits{resource="memory"} |
Remember: Mnemonic for the four golden signals: LETS — Latency, Errors, Traffic, Saturation. RED (Rate, Errors, Duration) is the microservice-focused subset. USE (Utilization, Saturation, Errors) is Brendan Gregg's method for infrastructure resources.
Excluding internal traffic from SLIs¶
This is a footgun that bites almost everyone on their first SLO implementation:
# WRONG — includes health checks and metrics scrapes
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# RIGHT — only user-facing traffic
sum(rate(http_requests_total{code!~"5..",handler!~"/healthz|/readyz|/metrics"}[5m]))
/
sum(rate(http_requests_total{handler!~"/healthz|/readyz|/metrics"}[5m]))
Health checks from Kubernetes always succeed (that's their job). Including them inflates your good-event count and makes your SLI look better than reality.
Part 3: The Math — Error Budgets¶
This is the part that sounds scary and turns out to be arithmetic.
The formula¶
That's it. For a 99.9% SLO, your error budget is 0.1%. Now let's make that concrete.
The nines table¶
There are 43,200 minutes in a 30-day month. Your error budget in minutes:
| SLO | Error budget | Minutes/month | Translation |
|---|---|---|---|
| 99% | 1% | 432 | ~7.2 hours. Generous. |
| 99.5% | 0.5% | 216 | ~3.6 hours. Comfortable. |
| 99.9% | 0.1% | 43.2 | Less than 45 minutes. This is what most teams target. |
| 99.95% | 0.05% | 21.6 | About 20 minutes. Getting tight. |
| 99.99% | 0.01% | 4.32 | 4 minutes and 19 seconds. No human can respond this fast. |
| 99.999% | 0.001% | 0.432 | 26 seconds. This requires full automation. |
Remember: "Each nine costs 10x more." Going from 99.9% to 99.99% doesn't add 0.09% reliability — it removes 90% of your error budget. The jump from three nines to four nines typically requires an order of magnitude more investment in redundancy, automation, and operational rigor.
Trivia: Most teams that claim 99.99% availability have not actually measured it correctly. When you account for client-side timeouts, DNS failures, and edge-case error codes that don't register as 5xx, the real number is often worse. Honest measurement is harder than ambitious targets.
Back to the mission¶
Let's do the math on our order-service situation:
SLO: 99.9%
Error budget: 0.1% = 43.2 minutes/month
Day of month: 12 (of 30)
Budget consumed: 28.7 minutes
Budget remaining: 43.2 - 28.7 = 14.5 minutes
What happened?
- Day 3: Deploy caused 12 minutes of elevated errors
- Day 7: Dependency timeout caused 8.2 minutes
- Day 8-12: Slow leak — 0.5% error rate intermittently (8.5 minutes total)
The first two incidents were acute — they burned budget fast but were fixed fast. The third one is the slow burn that's hard to notice and hard to stop.
Part 4: Burn Rate — How Fast Are You Spending?¶
Error budget remaining tells you where you are. Burn rate tells you where you're headed.
The formula¶
Burn rate = (current error rate) / (error budget rate)
Where:
error budget rate = 1 - SLO target = 0.001 for a 99.9% SLO
Burn rate of 1.0 means you're consuming budget at exactly the rate that would exhaust it over the full 30-day window. You'll land right at zero — technically meeting SLO, but with no margin.
| Burn rate | What it means | Budget lasts | Action |
|---|---|---|---|
| 0.5x | Burning slowly, you'll have budget left at month-end | 60 days | No action needed |
| 1.0x | Exactly on pace to exhaust budget at month-end | 30 days | Monitor closely |
| 2.0x | Burning twice as fast as allowed | 15 days | Investigate |
| 6.0x | Burning 6x — budget gone in 5 days | 5 days | Immediate action |
| 14.4x | Budget gone in ~2 days | 2.08 days | Page. Wake someone up. |
Worked example¶
Our order-service has a 0.24% error rate right now. With a 99.9% SLO:
Burn rate = 0.0024 / 0.001 = 2.4x
At 2.4x, the remaining 14.5 minutes of budget will last:
14.5 minutes / 2.4 = 6.04 days
Day 12 + 6 days = Day 18
That leaves 12 days with zero budget.
This is the number that should make you sit up. Not "33.6% remaining" — that sounds manageable. But "zero budget by day 18 with a feature release on day 20" — that's a problem.
Mental Model: Think of burn rate like a speedometer on a road trip. The error budget is your fuel tank. "33% fuel remaining" matters a lot less than "you're doing 140 in a 60 zone." Burn rate tells you the speed; time-to-exhaustion tells you when you run out.
Flashcard Check #2¶
| Question | Answer (cover this column) |
|---|---|
| What burn rate means "budget lasts exactly 30 days"? | 1.0x — you're consuming at the exact rate that exhausts the budget over the SLO window. |
| If a 99.9% SLO has a burn rate of 10x, how long until budget exhaustion? | 30 days / 10 = 3 days. |
| Why is burn rate more useful than "budget remaining %"? | Budget remaining is a snapshot; burn rate tells you the trajectory. 50% remaining is fine on day 15, terrifying on day 2. |
Part 5: Multi-Window Burn Rate Alerting¶
This is where Google's SRE Workbook changed the industry. Before multi-window burn rate alerting, SLO alerts were either too noisy (fire on every blip) or too slow (miss real incidents).
The problem with naive alerting¶
# BAD — do not use this
- alert: HighErrorRate
expr: |
rate(http_requests_total{code=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.001
for: 5m
This fires whenever the 5-minute error rate exceeds the budget rate. Problems: - A 30-second spike pages on-call for something that consumed 0.001% of the budget - No urgency context — is this a 14x burn or a 1.1x burn? - By the time on-call responds, the spike may have resolved
The Google approach: four alert tiers¶
The SRE Workbook (2018, Chapter 6) defines four alerting tiers. Each uses two windows — a short window to detect the severity and a long window to confirm it's real:
| Tier | Burn rate | Short window | Long window | Severity | Budget consumed before alert |
|---|---|---|---|---|---|
| 1 | 14.4x | 5m | 1h | Page (critical) | 2% in 1 hour |
| 2 | 6.0x | 30m | 6h | Page (critical) | 5% in 6 hours |
| 3 | 3.0x | 2h | 24h | Ticket (warning) | 10% in 24 hours |
| 4 | 1.0x | 6h | 72h | Ticket (warning) | 100% in 30 days |
Why two windows? The short window catches the spike. The long window confirms it's not a blip. Both must exceed the burn rate threshold before the alert fires.
Trivia: The specific burn rate numbers (14.4, 6, 3, 1) aren't arbitrary. They're derived from the percentage of error budget you're willing to consume before being alerted. 14.4x burn for 1 hour consumes 2% of a 30-day budget. 6x burn for 6 hours consumes 5%. The math:
burn_rate = (budget_fraction * SLO_window_hours) / alert_window_hours. For tier 1:14.4 = (0.02 * 720) / 1.
What the error rates actually look like¶
For a 99.9% SLO, here's what each burn rate tier translates to in real error rates:
14.4x burn rate → 14.4 × 0.001 = 1.44% error rate
6.0x burn rate → 6.0 × 0.001 = 0.60% error rate
3.0x burn rate → 3.0 × 0.001 = 0.30% error rate
1.0x burn rate → 1.0 × 0.001 = 0.10% error rate
A 1.44% error rate doesn't sound terrible — but at that rate, your entire monthly error budget is gone in 50 hours. That's why burn rate is a better signal than raw error rate.
Part 6: SLO Alerting in Prometheus — Real Rules¶
Let's implement multi-window burn rate alerting for order-service.
Step 1: Recording rules (pre-compute the ratios)¶
Recording rules compute the error ratios once and store them as new time series. This saves query-time computation and keeps alerting rules clean.
# prometheus-recording-rules.yaml
groups:
- name: slo:order-service:availability
interval: 30s
rules:
# 5-minute error ratio
- record: slo:sli_error:ratio_rate5m
expr: |
sum(rate(http_requests_total{job="order-service",code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="order-service"}[5m]))
labels:
service: order-service
slo: availability
# 30-minute error ratio
- record: slo:sli_error:ratio_rate30m
expr: |
sum(rate(http_requests_total{job="order-service",code=~"5.."}[30m]))
/
sum(rate(http_requests_total{job="order-service"}[30m]))
labels:
service: order-service
slo: availability
# 1-hour error ratio
- record: slo:sli_error:ratio_rate1h
expr: |
sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="order-service"}[1h]))
labels:
service: order-service
slo: availability
# 6-hour error ratio
- record: slo:sli_error:ratio_rate6h
expr: |
sum(rate(http_requests_total{job="order-service",code=~"5.."}[6h]))
/
sum(rate(http_requests_total{job="order-service"}[6h]))
labels:
service: order-service
slo: availability
# 24-hour error ratio
- record: slo:sli_error:ratio_rate24h
expr: |
sum(rate(http_requests_total{job="order-service",code=~"5.."}[24h]))
/
sum(rate(http_requests_total{job="order-service"}[24h]))
labels:
service: order-service
slo: availability
# 72-hour (3-day) error ratio
- record: slo:sli_error:ratio_rate3d
expr: |
sum(rate(http_requests_total{job="order-service",code=~"5.."}[3d]))
/
sum(rate(http_requests_total{job="order-service"}[3d]))
labels:
service: order-service
slo: availability
Step 2: Alerting rules (multi-window burn rate)¶
# prometheus-alerting-rules.yaml
groups:
- name: slo:order-service:alerts
rules:
# Tier 1: Fast burn — page immediately
# 14.4x burn: budget gone in ~2 days
- alert: OrderServiceBudgetFastBurn
expr: |
slo:sli_error:ratio_rate5m{service="order-service",slo="availability"} > (14.4 * 0.001)
and
slo:sli_error:ratio_rate1h{service="order-service",slo="availability"} > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: "true"
team: platform
annotations:
summary: "order-service burning error budget at 14.4x rate"
description: |
Current 5m error rate: {{ $value | humanizePercentage }}
At this rate, 30-day budget exhausted in ~2 days.
runbook_url: "https://runbooks.example.com/slo-budget-burn"
# Tier 2: Moderate burn — page
# 6x burn: budget gone in ~5 days
- alert: OrderServiceBudgetModerateBurn
expr: |
slo:sli_error:ratio_rate30m{service="order-service",slo="availability"} > (6 * 0.001)
and
slo:sli_error:ratio_rate6h{service="order-service",slo="availability"} > (6 * 0.001)
for: 5m
labels:
severity: critical
slo: "true"
team: platform
annotations:
summary: "order-service burning error budget at 6x rate"
description: |
Current 30m error rate: {{ $value | humanizePercentage }}
At this rate, 30-day budget exhausted in ~5 days.
runbook_url: "https://runbooks.example.com/slo-budget-burn"
# Tier 3: Slow burn — ticket
# 3x burn: budget gone in ~10 days
- alert: OrderServiceBudgetSlowBurn
expr: |
slo:sli_error:ratio_rate2h{service="order-service",slo="availability"} > (3 * 0.001)
and
slo:sli_error:ratio_rate24h{service="order-service",slo="availability"} > (3 * 0.001)
for: 15m
labels:
severity: warning
slo: "true"
team: platform
annotations:
summary: "order-service burning error budget at 3x rate"
# Tier 4: Chronic burn — ticket
# 1x burn: budget exactly meets SLO
- alert: OrderServiceBudgetChronicBurn
expr: |
slo:sli_error:ratio_rate6h{service="order-service",slo="availability"} > (1 * 0.001)
and
slo:sli_error:ratio_rate3d{service="order-service",slo="availability"} > (1 * 0.001)
for: 1h
labels:
severity: warning
slo: "true"
team: platform
annotations:
summary: "order-service chronically burning error budget"
Under the Hood: Why recording rules instead of inline expressions? Each alerting rule evaluation recalculates the PromQL expression. With 4 alert tiers and 6 time windows, that's 8 rate calculations every evaluation cycle. Recording rules compute each window once per interval and store the result. The alert rules then read a simple label lookup instead of re-aggregating raw metrics. This matters at scale — 50 services with 2 SLOs each means 100 SLOs, 600 recording rules, and 400 alerting rules.
Step 3: Burn rate in PromQL (for dashboards)¶
# Current burn rate (1h window)
(
sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="order-service"}[1h]))
) / 0.001
# Error budget remaining (0 = exhausted, 1 = full)
1 - (
sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d]))
/
sum(increase(http_requests_total{job="order-service"}[30d]))
) / 0.001
# Time to budget exhaustion at current burn rate (hours)
(
1 - (
sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d]))
/ sum(increase(http_requests_total{job="order-service"}[30d]))
) / 0.001
) * 720
/
(
sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="order-service"}[1h]))
) / 0.001
Gotcha: Using
increase()over a 30-day window can produce inaccurate values if the metric series had gaps (restarts, scrape failures). Prometheus extrapolates across gaps, which can inflate or deflate the count. For production SLO dashboards, prefer recording rules that accumulate daily totals, or use Sloth/Pyrra which handle this correctly by computing over pre-aggregated 5-minute ratios.
Part 7: SLO-Based Alerting vs Threshold Alerting¶
This is the conceptual shift that makes SLOs worth the effort.
| Threshold alerting | SLO-based alerting | |
|---|---|---|
| Fires when | Error rate > 5% for 5 minutes | Error budget burning at 14.4x for 1 hour |
| Context | "Errors are high right now" | "At this rate, your monthly budget is gone in 2 days" |
| False positives | High — fires on brief spikes | Low — two-window confirmation |
| Missed incidents | High — 0.5% sustained error rate flies under 5% threshold | Low — 1x chronic burn still triggers tier 4 alert |
| Actionability | Vague — "errors high, do something" | Specific — "budget exhausts in X hours, investigate or freeze deploys" |
War Story: A widely-reported SRE pattern: a team set a threshold alert at 1% error rate. Their normal error rate was 0.05%. A code change introduced a bug affecting 0.8% of requests — below the 1% threshold. The alert never fired. Over 3 weeks, the slow burn consumed the entire monthly error budget three times over. With burn rate alerting, the tier 4 alert (1x burn over 72 hours) would have caught it on day 3. This pattern — the "slow burn that threshold alerts miss" — is the single most common argument for switching to SLO-based alerting. Google's SRE Workbook (2018, Chapter 5) documents several variants of this failure mode.
Part 8: The Error Budget Policy — Making SLOs Matter¶
SLOs without consequences are dashboards nobody looks at. The error budget policy is what gives SLOs teeth.
A real error budget policy¶
Error Budget Policy — order-service
SLO: 99.9% availability over 30-day rolling window
Budget: 43.2 minutes/month
THRESHOLDS:
1. Budget > 50% remaining:
- Normal development pace
- Ship features at will
- Standard deploy cadence
2. Budget 25-50% remaining:
- Review recent deploys for reliability impact
- Enable canary deployments for all changes
- Daily check on budget burn rate
3. Budget < 25% remaining:
- Prioritize reliability work over features
- Reduced deploy frequency (max 1/day)
- Mandatory rollback plan for every deploy
- Escalate to engineering lead
4. Budget exhausted (0%):
- Feature freeze — only reliability fixes ship
- Mandatory postmortem for budget-depleting incidents
- All engineering effort on reliability
- Exception: emergency security patches (approved by VP Eng)
REVIEW:
- Monthly: SLO review meeting (eng lead + product)
- Quarterly: SLO target reassessment
Mental Model: Error budgets work like a credit card limit. When you're flush, spend freely — ship features, run experiments, try risky deploys. When you're maxed out, stop spending and pay down the debt — fix flaky tests, add retries, improve failover. The error budget policy is your overdraft protection. It triggers automatically and forces the hard conversation between product ("we need this feature") and engineering ("we need the service to stay up").
The Google SRE book origin¶
The error budget concept was formalized in Google's Site Reliability Engineering book (O'Reilly, 2016, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy). The book is available free at sre.google. The companion Site Reliability Workbook (2018) added practical implementation details including the multi-window burn rate alerting approach.
Trivia: The Google SRE book was the most downloaded free book in O'Reilly's publishing history. The concept that had the most industry impact wasn't SLOs themselves — it was the error budget as an objective mechanism for resolving the velocity-vs-reliability conflict. Before error budgets, developers and operators were in permanent political tension. After error budgets, the argument became: "Do we have budget? Ship. No budget? Fix." Data replaced politics.
Flashcard Check #3¶
| Question | Answer (cover this column) |
|---|---|
| What happens when the error budget is exhausted? | Feature freeze. Only reliability fixes ship until the budget recovers. |
| Why must an error budget policy exist before defining SLOs? | Without consequences, SLOs are just numbers on a dashboard. The policy makes them actionable. |
| Who published the SRE book that formalized error budgets? | Google (2016). Edited by Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy. Free at sre.google. |
Part 9: War Story — The SLO That Was Too Tight¶
War Story: A platform team set a 99.99% availability SLO for their internal API gateway — 4.3 minutes of allowed downtime per month. The gateway handled 50,000 req/sec, so even brief blips consumed the budget fast. Every routine deploy triggered a burn rate alert because the rolling restart caused 2-3 seconds of connection resets. The team stopped deploying to avoid burning budget. After 6 weeks without deploys, a security patch sat undeployed for 11 days. The fix: they lowered the SLO to 99.9% (43 minutes/month), which gave room for graceful deploys. Deploy frequency went from "almost never" to twice a week. Counterintuitively, the service became more reliable — regular deploys meant smaller changes, faster rollbacks, and no more security patch backlogs.
Lesson: an SLO so tight that it prevents deploys makes the system less reliable over time. The SLO should be "as low as users will tolerate" — not "as high as we can imagine."
This is the tension at the heart of SRE: reliability and velocity are not opposites, but an SLO that's too tight makes them adversaries. The right SLO creates space for both.
Part 10: SLO Documents and Review Cadence¶
SLOs are not set-and-forget. They need a document and a review cycle.
What goes in an SLO document¶
# SLO Document: order-service
## Service description
Handles customer order placement, payment processing, and order status queries.
~50,000 requests/minute during peak hours.
## SLIs
1. Availability: % of non-5xx responses (excluding /healthz, /readyz, /metrics)
2. Latency: % of requests completing in < 500ms (p99)
## SLOs
1. Availability: 99.9% over 30-day rolling window
2. Latency: 99.0% of requests < 500ms over 30-day rolling window
## Error budget
1. Availability: 0.1% = 43.2 minutes/month
2. Latency: 1.0% = 432 minutes/month (latency budget is more generous)
## Dependencies
- payment-gateway (external, 99.5% SLA from vendor)
- inventory-service (internal, 99.9% SLO)
- postgres-primary (internal, 99.95% SLO)
## Error budget policy
[Link to policy document]
## Dashboard
[Link to Grafana SLO dashboard]
## Last review: 2026-02-15
## Next review: 2026-05-15
Review cadence¶
| Review | Frequency | Who | Questions |
|---|---|---|---|
| Budget check | Daily (automated) | Dashboard/Slack bot | "Is burn rate normal?" |
| SLO meeting | Monthly | Eng lead + product | "Did we meet SLO? What consumed budget?" |
| SLO reassessment | Quarterly | Eng lead + product + SRE | "Is the target still right? Too tight? Too loose?" |
| Full SLO audit | Annually | SRE + architecture | "Are we measuring the right things? Any new SLIs needed?" |
Gotcha: The quarterly reassessment is where teams discover their SLO is wrong. Common signs: SLO is never breached (too loose — tighten it or the budget policy never activates and SLOs become irrelevant). SLO is breached every month (too tight — the team ignores it because it's always red). Either extreme makes SLOs useless.
Part 11: Resolving the Mission¶
Back to day 12. Budget at 33.6%, burn rate at 2.4x.
Here's the decision framework:
Step 1: What's the burn rate?
→ 2.4x. Not critical (that's < 6x), but not sustainable.
Step 2: What's the time-to-exhaustion?
→ 14.5 minutes / (2.4 × 43.2 / 30) = ~6 days. Budget exhausted by day 18.
Step 3: What's consuming budget?
→ Dashboard shows: intermittent 503s from the payment-gateway dependency.
→ Not our code — but it IS our SLI.
Step 4: Consult the error budget policy.
→ 33.6% remaining = "Budget 25-50% remaining" tier.
→ Actions: enable canary deploys, daily burn rate checks, review recent changes.
Step 5: Address the root cause.
→ payment-gateway is rate-limiting us during peak hours.
→ Short-term: add retry with exponential backoff.
→ Long-term: discuss rate limit increase with vendor, add circuit breaker.
Step 6: Reassess the feature release.
→ The release is on day 20. If the fix reduces burn rate to < 1x by day 14,
we'll have enough budget to absorb the deploy risk.
→ If burn rate stays at 2.4x, we defer the release per the budget policy.
This is what SLOs look like in practice. Not a dashboard you glance at — a decision framework that tells you what to do and when.
Exercises¶
Exercise 1: Calculate the budget (2 minutes)¶
Your service has a 99.5% SLO over a 30-day window. How many minutes of downtime equivalent does your error budget allow?
Exercise 2: Interpret a burn rate (5 minutes)¶
Your 99.9% SLO shows a current burn rate of 4.5x. It's day 20 of 30, and you've consumed 60% of your budget.
- How many minutes of budget remain?
- At 4.5x burn, how long until budget exhaustion?
- What tier of alert should this trigger?
Solution
1. Budget remaining: 43.2 × 0.40 = 17.28 minutes
2. Daily budget consumption at 4.5x: 43.2 / 30 × 4.5 = 6.48 minutes/day
Time to exhaustion: 17.28 / 6.48 = 2.67 days
3. 4.5x is between 3x (tier 3, warning) and 6x (tier 2, critical).
This triggers the tier 3 alert (slow burn, ticket).
But with only 2.67 days until exhaustion and 10 days left in the month,
the error budget policy likely calls for prioritizing reliability work.
Exercise 3: Write a recording rule (10 minutes)¶
Write a Prometheus recording rule that computes the 1-hour error ratio for a service
called checkout-service, counting HTTP 5xx responses as errors. Exclude the /health
endpoint.
Solution
groups:
- name: slo:checkout-service:availability
rules:
- record: slo:sli_error:ratio_rate1h
expr: |
sum(rate(http_requests_total{job="checkout-service",code=~"5..",handler!="/health"}[1h]))
/
sum(rate(http_requests_total{job="checkout-service",handler!="/health"}[1h]))
labels:
service: checkout-service
slo: availability
Exercise 4: Design an error budget policy (15 minutes)¶
You're launching a new service. Historical data shows baseline availability of 98.8%. Draft: (a) an appropriate initial SLO, (b) a three-tier error budget policy with concrete actions at each tier.
Hints
- Start at or below the historical baseline - The policy should be achievable *today*, not aspirational - Each tier needs specific actions, not vague guidanceSolution sketch
SLO: 98.5% (below the 98.8% baseline — gives room to breathe)
Error budget: 1.5% = 648 minutes/month (~10.8 hours)
Tier 1 (> 50% budget remaining): Normal operations. Ship at will.
Tier 2 (25-50% remaining): Review error sources. Enable canary deploys.
All deploys require rollback plan. Weekly budget review meeting.
Tier 3 (< 25% remaining): Feature freeze. Reliability sprint.
Mandatory postmortem for any incident consuming > 5% of budget.
Improvement path: Once you consistently meet 98.5% with >40% budget
remaining, raise to 99.0%. Repeat until you reach the target your
users actually need.
Cheat Sheet¶
| Concept | Formula / Value |
|---|---|
| Error budget | 1 - SLO target |
| Budget in minutes (30d) | 43,200 × (1 - SLO) |
| Burn rate | current_error_rate / (1 - SLO) |
| Time to exhaustion | SLO_window / burn_rate |
| 99% = | 432 min/month (7.2 hours) |
| 99.9% = | 43.2 min/month |
| 99.99% = | 4.32 min/month |
| 99.999% = | 26 seconds/month |
Burn rate alert tiers (99.9% SLO, 30-day window):
| Tier | Burn rate | Error rate | Windows | Severity |
|---|---|---|---|---|
| 1 | 14.4x | 1.44% | 5m + 1h | Critical (page) |
| 2 | 6.0x | 0.60% | 30m + 6h | Critical (page) |
| 3 | 3.0x | 0.30% | 2h + 24h | Warning (ticket) |
| 4 | 1.0x | 0.10% | 6h + 3d | Warning (ticket) |
PromQL patterns:
| What | Query |
|---|---|
| Error ratio | sum(rate(errors[W])) / sum(rate(total[W])) |
| Burn rate | error_ratio / (1 - SLO) |
| Budget remaining | 1 - (sum(increase(errors[30d])) / sum(increase(total[30d]))) / (1 - SLO) |
Tools that generate SLO rules automatically:
| Tool | What it does |
|---|---|
| Sloth | Generates Prometheus recording + alerting rules from YAML SLO spec |
| Pyrra | Kubernetes-native SLO CRDs with built-in error budget UI |
| OpenSLO | Vendor-neutral SLO spec (converts to Sloth/Pyrra format) |
Takeaways¶
-
SLIs measure user experience, SLOs set the target, SLAs are the contract. Never confuse them. Never set SLO equal to SLA.
-
99.9% sounds like a small number but it's only 43 minutes a month. Each additional nine costs 10x more and cuts your budget by 90%.
-
Burn rate, not error rate. A 0.5% error rate sounds harmless; at 5x burn rate it exhausts your monthly budget in 6 days.
-
Two windows catch what one window misses. Multi-window burn rate alerting eliminates false positives from spikes and false negatives from slow burns.
-
SLOs without an error budget policy are just dashboards. The policy is what makes SLOs drive actual decisions — feature freezes, reliability sprints, deploy cadence changes.
-
Start loose, tighten with data. Set your initial SLO at or below your historical baseline. A 99% SLO you consistently meet teaches you more than a 99.99% SLO you constantly breach.
Related Lessons¶
- Prometheus and the Art of Not Alerting — deeper dive on alerting philosophy and alert fatigue
- The Monitoring That Lied — when your observability tools give you the wrong answer
- How Incident Response Actually Works — what happens when the burn rate alert fires at 3am
- The Art of the Postmortem — writing the postmortem after the SLO breach
- Deploy a Web App from Nothing — building the service that needs SLOs