Skip to content

SLOs: When Good Enough Is a Number

  • lesson
  • slos
  • slis
  • slas
  • error-budgets
  • burn-rates
  • prometheus-alerting
  • sre-practices
  • incident-response ---# SLOs — When Good Enough Is a Number

Topics: SLOs, SLIs, SLAs, error budgets, burn rates, Prometheus alerting, SRE practices, incident response Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Strategy: Build-up + incident-driven Prerequisites: None (everything is explained from scratch)


The Mission

It's the morning of day 12 in a 30-day cycle. You check the SLO dashboard for order-service and see this:

SLO target:        99.9% availability
Error budget:      43.2 minutes/month
Budget consumed:   28.7 minutes
Budget remaining:  14.5 minutes (33.6%)
Current burn rate: 2.4x

You've used two-thirds of your monthly error budget and you're not even halfway through the month. At the current 2.4x burn rate, your remaining 14.5 minutes of budget will be gone in about 6 days. That puts you at day 18 — twelve days of the month left with zero budget and a feature release queued for next week.

What do you do? How did you get here? And what does "33.6% remaining" actually mean?

By the end of this lesson you'll understand: - What SLIs, SLOs, and SLAs are — and the critical differences between them - How to choose SLIs that reflect what users actually experience - The math behind error budgets (it's simpler than it sounds) - How burn rate alerts work and why they replaced threshold alerts - How to implement all of this in Prometheus with real recording and alerting rules - What to do when the budget runs low — the error budget policy that makes SLOs actionable

We'll build up from definitions to math to implementation, with a real incident threaded through.


Part 1: The Vocabulary — SLI vs SLO vs SLA

These three terms sound similar, get confused constantly, and mean completely different things.

Term What it is Who owns it Example
SLI (Service Level Indicator) A measurement of service quality Engineering 99.2% of requests returned non-5xx this week
SLO (Service Level Objective) An internal target for an SLI Engineering + Product 99.9% of requests should succeed over 30 days
SLA (Service Level Agreement) A contract with consequences Business + Legal 99.5% uptime or we issue service credits

The hierarchy matters: SLA < SLO < theoretical max. Your SLO should be stricter than your SLA, giving you a buffer to detect and fix problems before they become contractual violations.

Name Origin: "SLA" comes from telecommunications. Telcos have written service level agreements since the 1980s, specifying uptime commitments for leased lines. "SLO" and "SLI" were formalized by Google's SRE book (2016), though the underlying idea — measuring defect rates against a target — traces back to Walter Shewhart's statistical process control work at Bell Labs in the 1920s. Google's innovation was applying "acceptable defect rate" thinking to software services and making the error budget a currency teams spend to ship features.

Here's the key insight that most people miss on first encounter: the SLI is what you measure, the SLO is the line you draw, and the SLA is the promise you make to someone who can sue you. If your SLI is "percentage of successful HTTP requests," your SLO might be "99.9% over 30 days," and your SLA to customers might be "99.5% or we refund."

Gotcha: Setting your SLO equal to your SLA is a trap. If SLA = SLO = 99.9%, you have zero margin. A single dip below SLO means immediate SLA breach with financial penalties. Set SLO stricter than SLA (e.g., SLO = 99.95% when SLA = 99.9%) so you detect and fix problems before they become contractual violations.

Flashcard Check #1

Question Answer (cover this column)
What does SLI stand for and what does it measure? Service Level Indicator. A metric measuring service quality from the user's perspective.
If your SLO is 99.9% and your SLA is 99.5%, which is stricter? The SLO. It gives you a buffer before SLA breach.
Why is CPU utilization a bad SLI? It measures infrastructure, not user experience. A server at 90% CPU might serve perfectly; one at 10% CPU might return errors.

Part 2: Choosing SLIs — Measuring What Users Care About

Not all metrics make good SLIs. The rule is simple: good SLIs measure what users experience, not what infrastructure does.

SLI Type Good SLI Bad SLI Why the bad one fails
Availability % of HTTP requests returning non-5xx up{job="api"} == 1 Pod can be "up" while returning errors to every request
Latency p99 response time < 300ms Average response time Averages hide tail latency — 1% of users could wait 30 seconds
Throughput Successful requests per second Network bandwidth Bandwidth says nothing about whether requests succeed
Correctness % of responses returning correct data Test pass rate Tests pass in CI; production data is different

Mental Model: Think of SLIs as answering one question: "Can users do what they came to do?" If a user clicks "Place Order" and gets a 500 error, your availability SLI should reflect that. If they click "Place Order" and it takes 8 seconds, your latency SLI should reflect that. If their order goes through but charges the wrong amount, your correctness SLI should reflect that.

The four golden signals

Google's SRE book defined four signals every service should measure. They map directly to SLI categories:

Signal What it answers PromQL example
Latency How fast? histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Traffic How much? sum(rate(http_requests_total[5m]))
Errors How often does it fail? sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Saturation How full? container_memory_working_set_bytes / kube_pod_container_resource_limits{resource="memory"}

Remember: Mnemonic for the four golden signals: LETSLatency, Errors, Traffic, Saturation. RED (Rate, Errors, Duration) is the microservice-focused subset. USE (Utilization, Saturation, Errors) is Brendan Gregg's method for infrastructure resources.

Excluding internal traffic from SLIs

This is a footgun that bites almost everyone on their first SLO implementation:

# WRONG — includes health checks and metrics scrapes
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# RIGHT — only user-facing traffic
sum(rate(http_requests_total{code!~"5..",handler!~"/healthz|/readyz|/metrics"}[5m]))
/
sum(rate(http_requests_total{handler!~"/healthz|/readyz|/metrics"}[5m]))

Health checks from Kubernetes always succeed (that's their job). Including them inflates your good-event count and makes your SLI look better than reality.


Part 3: The Math — Error Budgets

This is the part that sounds scary and turns out to be arithmetic.

The formula

Error budget = 1 - SLO target

That's it. For a 99.9% SLO, your error budget is 0.1%. Now let's make that concrete.

The nines table

There are 43,200 minutes in a 30-day month. Your error budget in minutes:

SLO Error budget Minutes/month Translation
99% 1% 432 ~7.2 hours. Generous.
99.5% 0.5% 216 ~3.6 hours. Comfortable.
99.9% 0.1% 43.2 Less than 45 minutes. This is what most teams target.
99.95% 0.05% 21.6 About 20 minutes. Getting tight.
99.99% 0.01% 4.32 4 minutes and 19 seconds. No human can respond this fast.
99.999% 0.001% 0.432 26 seconds. This requires full automation.

Remember: "Each nine costs 10x more." Going from 99.9% to 99.99% doesn't add 0.09% reliability — it removes 90% of your error budget. The jump from three nines to four nines typically requires an order of magnitude more investment in redundancy, automation, and operational rigor.

Trivia: Most teams that claim 99.99% availability have not actually measured it correctly. When you account for client-side timeouts, DNS failures, and edge-case error codes that don't register as 5xx, the real number is often worse. Honest measurement is harder than ambitious targets.

Back to the mission

Let's do the math on our order-service situation:

SLO:                99.9%
Error budget:       0.1% = 43.2 minutes/month
Day of month:       12 (of 30)
Budget consumed:    28.7 minutes
Budget remaining:   43.2 - 28.7 = 14.5 minutes

What happened?
- Day 3: Deploy caused 12 minutes of elevated errors
- Day 7: Dependency timeout caused 8.2 minutes
- Day 8-12: Slow leak — 0.5% error rate intermittently (8.5 minutes total)

The first two incidents were acute — they burned budget fast but were fixed fast. The third one is the slow burn that's hard to notice and hard to stop.


Part 4: Burn Rate — How Fast Are You Spending?

Error budget remaining tells you where you are. Burn rate tells you where you're headed.

The formula

Burn rate = (current error rate) / (error budget rate)

Where:
  error budget rate = 1 - SLO target = 0.001 for a 99.9% SLO

Burn rate of 1.0 means you're consuming budget at exactly the rate that would exhaust it over the full 30-day window. You'll land right at zero — technically meeting SLO, but with no margin.

Burn rate What it means Budget lasts Action
0.5x Burning slowly, you'll have budget left at month-end 60 days No action needed
1.0x Exactly on pace to exhaust budget at month-end 30 days Monitor closely
2.0x Burning twice as fast as allowed 15 days Investigate
6.0x Burning 6x — budget gone in 5 days 5 days Immediate action
14.4x Budget gone in ~2 days 2.08 days Page. Wake someone up.

Worked example

Our order-service has a 0.24% error rate right now. With a 99.9% SLO:

Burn rate = 0.0024 / 0.001 = 2.4x

At 2.4x, the remaining 14.5 minutes of budget will last:
  14.5 minutes / 2.4 = 6.04 days

Day 12 + 6 days = Day 18
That leaves 12 days with zero budget.

This is the number that should make you sit up. Not "33.6% remaining" — that sounds manageable. But "zero budget by day 18 with a feature release on day 20" — that's a problem.

Mental Model: Think of burn rate like a speedometer on a road trip. The error budget is your fuel tank. "33% fuel remaining" matters a lot less than "you're doing 140 in a 60 zone." Burn rate tells you the speed; time-to-exhaustion tells you when you run out.

Flashcard Check #2

Question Answer (cover this column)
What burn rate means "budget lasts exactly 30 days"? 1.0x — you're consuming at the exact rate that exhausts the budget over the SLO window.
If a 99.9% SLO has a burn rate of 10x, how long until budget exhaustion? 30 days / 10 = 3 days.
Why is burn rate more useful than "budget remaining %"? Budget remaining is a snapshot; burn rate tells you the trajectory. 50% remaining is fine on day 15, terrifying on day 2.

Part 5: Multi-Window Burn Rate Alerting

This is where Google's SRE Workbook changed the industry. Before multi-window burn rate alerting, SLO alerts were either too noisy (fire on every blip) or too slow (miss real incidents).

The problem with naive alerting

# BAD — do not use this
- alert: HighErrorRate
  expr: |
    rate(http_requests_total{code=~"5.."}[5m])
    / rate(http_requests_total[5m]) > 0.001
  for: 5m

This fires whenever the 5-minute error rate exceeds the budget rate. Problems: - A 30-second spike pages on-call for something that consumed 0.001% of the budget - No urgency context — is this a 14x burn or a 1.1x burn? - By the time on-call responds, the spike may have resolved

The Google approach: four alert tiers

The SRE Workbook (2018, Chapter 6) defines four alerting tiers. Each uses two windows — a short window to detect the severity and a long window to confirm it's real:

Tier Burn rate Short window Long window Severity Budget consumed before alert
1 14.4x 5m 1h Page (critical) 2% in 1 hour
2 6.0x 30m 6h Page (critical) 5% in 6 hours
3 3.0x 2h 24h Ticket (warning) 10% in 24 hours
4 1.0x 6h 72h Ticket (warning) 100% in 30 days

Why two windows? The short window catches the spike. The long window confirms it's not a blip. Both must exceed the burn rate threshold before the alert fires.

Trivia: The specific burn rate numbers (14.4, 6, 3, 1) aren't arbitrary. They're derived from the percentage of error budget you're willing to consume before being alerted. 14.4x burn for 1 hour consumes 2% of a 30-day budget. 6x burn for 6 hours consumes 5%. The math: burn_rate = (budget_fraction * SLO_window_hours) / alert_window_hours. For tier 1: 14.4 = (0.02 * 720) / 1.

What the error rates actually look like

For a 99.9% SLO, here's what each burn rate tier translates to in real error rates:

14.4x burn rate → 14.4 × 0.001 = 1.44% error rate
 6.0x burn rate →  6.0 × 0.001 = 0.60% error rate
 3.0x burn rate →  3.0 × 0.001 = 0.30% error rate
 1.0x burn rate →  1.0 × 0.001 = 0.10% error rate

A 1.44% error rate doesn't sound terrible — but at that rate, your entire monthly error budget is gone in 50 hours. That's why burn rate is a better signal than raw error rate.


Part 6: SLO Alerting in Prometheus — Real Rules

Let's implement multi-window burn rate alerting for order-service.

Step 1: Recording rules (pre-compute the ratios)

Recording rules compute the error ratios once and store them as new time series. This saves query-time computation and keeps alerting rules clean.

# prometheus-recording-rules.yaml
groups:
  - name: slo:order-service:availability
    interval: 30s
    rules:
      # 5-minute error ratio
      - record: slo:sli_error:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{job="order-service",code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="order-service"}[5m]))
        labels:
          service: order-service
          slo: availability

      # 30-minute error ratio
      - record: slo:sli_error:ratio_rate30m
        expr: |
          sum(rate(http_requests_total{job="order-service",code=~"5.."}[30m]))
          /
          sum(rate(http_requests_total{job="order-service"}[30m]))
        labels:
          service: order-service
          slo: availability

      # 1-hour error ratio
      - record: slo:sli_error:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
          /
          sum(rate(http_requests_total{job="order-service"}[1h]))
        labels:
          service: order-service
          slo: availability

      # 6-hour error ratio
      - record: slo:sli_error:ratio_rate6h
        expr: |
          sum(rate(http_requests_total{job="order-service",code=~"5.."}[6h]))
          /
          sum(rate(http_requests_total{job="order-service"}[6h]))
        labels:
          service: order-service
          slo: availability

      # 24-hour error ratio
      - record: slo:sli_error:ratio_rate24h
        expr: |
          sum(rate(http_requests_total{job="order-service",code=~"5.."}[24h]))
          /
          sum(rate(http_requests_total{job="order-service"}[24h]))
        labels:
          service: order-service
          slo: availability

      # 72-hour (3-day) error ratio
      - record: slo:sli_error:ratio_rate3d
        expr: |
          sum(rate(http_requests_total{job="order-service",code=~"5.."}[3d]))
          /
          sum(rate(http_requests_total{job="order-service"}[3d]))
        labels:
          service: order-service
          slo: availability

Step 2: Alerting rules (multi-window burn rate)

# prometheus-alerting-rules.yaml
groups:
  - name: slo:order-service:alerts
    rules:
      # Tier 1: Fast burn — page immediately
      # 14.4x burn: budget gone in ~2 days
      - alert: OrderServiceBudgetFastBurn
        expr: |
          slo:sli_error:ratio_rate5m{service="order-service",slo="availability"} > (14.4 * 0.001)
          and
          slo:sli_error:ratio_rate1h{service="order-service",slo="availability"} > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: "true"
          team: platform
        annotations:
          summary: "order-service burning error budget at 14.4x rate"
          description: |
            Current 5m error rate: {{ $value | humanizePercentage }}
            At this rate, 30-day budget exhausted in ~2 days.
          runbook_url: "https://runbooks.example.com/slo-budget-burn"

      # Tier 2: Moderate burn — page
      # 6x burn: budget gone in ~5 days
      - alert: OrderServiceBudgetModerateBurn
        expr: |
          slo:sli_error:ratio_rate30m{service="order-service",slo="availability"} > (6 * 0.001)
          and
          slo:sli_error:ratio_rate6h{service="order-service",slo="availability"} > (6 * 0.001)
        for: 5m
        labels:
          severity: critical
          slo: "true"
          team: platform
        annotations:
          summary: "order-service burning error budget at 6x rate"
          description: |
            Current 30m error rate: {{ $value | humanizePercentage }}
            At this rate, 30-day budget exhausted in ~5 days.
          runbook_url: "https://runbooks.example.com/slo-budget-burn"

      # Tier 3: Slow burn — ticket
      # 3x burn: budget gone in ~10 days
      - alert: OrderServiceBudgetSlowBurn
        expr: |
          slo:sli_error:ratio_rate2h{service="order-service",slo="availability"} > (3 * 0.001)
          and
          slo:sli_error:ratio_rate24h{service="order-service",slo="availability"} > (3 * 0.001)
        for: 15m
        labels:
          severity: warning
          slo: "true"
          team: platform
        annotations:
          summary: "order-service burning error budget at 3x rate"

      # Tier 4: Chronic burn — ticket
      # 1x burn: budget exactly meets SLO
      - alert: OrderServiceBudgetChronicBurn
        expr: |
          slo:sli_error:ratio_rate6h{service="order-service",slo="availability"} > (1 * 0.001)
          and
          slo:sli_error:ratio_rate3d{service="order-service",slo="availability"} > (1 * 0.001)
        for: 1h
        labels:
          severity: warning
          slo: "true"
          team: platform
        annotations:
          summary: "order-service chronically burning error budget"

Under the Hood: Why recording rules instead of inline expressions? Each alerting rule evaluation recalculates the PromQL expression. With 4 alert tiers and 6 time windows, that's 8 rate calculations every evaluation cycle. Recording rules compute each window once per interval and store the result. The alert rules then read a simple label lookup instead of re-aggregating raw metrics. This matters at scale — 50 services with 2 SLOs each means 100 SLOs, 600 recording rules, and 400 alerting rules.

Step 3: Burn rate in PromQL (for dashboards)

# Current burn rate (1h window)
(
  sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
  /
  sum(rate(http_requests_total{job="order-service"}[1h]))
) / 0.001

# Error budget remaining (0 = exhausted, 1 = full)
1 - (
  sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d]))
  /
  sum(increase(http_requests_total{job="order-service"}[30d]))
) / 0.001

# Time to budget exhaustion at current burn rate (hours)
(
  1 - (
    sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d]))
    / sum(increase(http_requests_total{job="order-service"}[30d]))
  ) / 0.001
) * 720
/
(
  sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
  / sum(rate(http_requests_total{job="order-service"}[1h]))
) / 0.001

Gotcha: Using increase() over a 30-day window can produce inaccurate values if the metric series had gaps (restarts, scrape failures). Prometheus extrapolates across gaps, which can inflate or deflate the count. For production SLO dashboards, prefer recording rules that accumulate daily totals, or use Sloth/Pyrra which handle this correctly by computing over pre-aggregated 5-minute ratios.


Part 7: SLO-Based Alerting vs Threshold Alerting

This is the conceptual shift that makes SLOs worth the effort.

Threshold alerting SLO-based alerting
Fires when Error rate > 5% for 5 minutes Error budget burning at 14.4x for 1 hour
Context "Errors are high right now" "At this rate, your monthly budget is gone in 2 days"
False positives High — fires on brief spikes Low — two-window confirmation
Missed incidents High — 0.5% sustained error rate flies under 5% threshold Low — 1x chronic burn still triggers tier 4 alert
Actionability Vague — "errors high, do something" Specific — "budget exhausts in X hours, investigate or freeze deploys"

War Story: A widely-reported SRE pattern: a team set a threshold alert at 1% error rate. Their normal error rate was 0.05%. A code change introduced a bug affecting 0.8% of requests — below the 1% threshold. The alert never fired. Over 3 weeks, the slow burn consumed the entire monthly error budget three times over. With burn rate alerting, the tier 4 alert (1x burn over 72 hours) would have caught it on day 3. This pattern — the "slow burn that threshold alerts miss" — is the single most common argument for switching to SLO-based alerting. Google's SRE Workbook (2018, Chapter 5) documents several variants of this failure mode.


Part 8: The Error Budget Policy — Making SLOs Matter

SLOs without consequences are dashboards nobody looks at. The error budget policy is what gives SLOs teeth.

A real error budget policy

Error Budget Policy — order-service

SLO: 99.9% availability over 30-day rolling window
Budget: 43.2 minutes/month

THRESHOLDS:

1. Budget > 50% remaining:
   - Normal development pace
   - Ship features at will
   - Standard deploy cadence

2. Budget 25-50% remaining:
   - Review recent deploys for reliability impact
   - Enable canary deployments for all changes
   - Daily check on budget burn rate

3. Budget < 25% remaining:
   - Prioritize reliability work over features
   - Reduced deploy frequency (max 1/day)
   - Mandatory rollback plan for every deploy
   - Escalate to engineering lead

4. Budget exhausted (0%):
   - Feature freeze — only reliability fixes ship
   - Mandatory postmortem for budget-depleting incidents
   - All engineering effort on reliability
   - Exception: emergency security patches (approved by VP Eng)

REVIEW:
   - Monthly: SLO review meeting (eng lead + product)
   - Quarterly: SLO target reassessment

Mental Model: Error budgets work like a credit card limit. When you're flush, spend freely — ship features, run experiments, try risky deploys. When you're maxed out, stop spending and pay down the debt — fix flaky tests, add retries, improve failover. The error budget policy is your overdraft protection. It triggers automatically and forces the hard conversation between product ("we need this feature") and engineering ("we need the service to stay up").

The Google SRE book origin

The error budget concept was formalized in Google's Site Reliability Engineering book (O'Reilly, 2016, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy). The book is available free at sre.google. The companion Site Reliability Workbook (2018) added practical implementation details including the multi-window burn rate alerting approach.

Trivia: The Google SRE book was the most downloaded free book in O'Reilly's publishing history. The concept that had the most industry impact wasn't SLOs themselves — it was the error budget as an objective mechanism for resolving the velocity-vs-reliability conflict. Before error budgets, developers and operators were in permanent political tension. After error budgets, the argument became: "Do we have budget? Ship. No budget? Fix." Data replaced politics.

Flashcard Check #3

Question Answer (cover this column)
What happens when the error budget is exhausted? Feature freeze. Only reliability fixes ship until the budget recovers.
Why must an error budget policy exist before defining SLOs? Without consequences, SLOs are just numbers on a dashboard. The policy makes them actionable.
Who published the SRE book that formalized error budgets? Google (2016). Edited by Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy. Free at sre.google.

Part 9: War Story — The SLO That Was Too Tight

War Story: A platform team set a 99.99% availability SLO for their internal API gateway — 4.3 minutes of allowed downtime per month. The gateway handled 50,000 req/sec, so even brief blips consumed the budget fast. Every routine deploy triggered a burn rate alert because the rolling restart caused 2-3 seconds of connection resets. The team stopped deploying to avoid burning budget. After 6 weeks without deploys, a security patch sat undeployed for 11 days. The fix: they lowered the SLO to 99.9% (43 minutes/month), which gave room for graceful deploys. Deploy frequency went from "almost never" to twice a week. Counterintuitively, the service became more reliable — regular deploys meant smaller changes, faster rollbacks, and no more security patch backlogs.

Lesson: an SLO so tight that it prevents deploys makes the system less reliable over time. The SLO should be "as low as users will tolerate" — not "as high as we can imagine."

This is the tension at the heart of SRE: reliability and velocity are not opposites, but an SLO that's too tight makes them adversaries. The right SLO creates space for both.


Part 10: SLO Documents and Review Cadence

SLOs are not set-and-forget. They need a document and a review cycle.

What goes in an SLO document

# SLO Document: order-service

## Service description
Handles customer order placement, payment processing, and order status queries.
~50,000 requests/minute during peak hours.

## SLIs
1. Availability: % of non-5xx responses (excluding /healthz, /readyz, /metrics)
2. Latency: % of requests completing in < 500ms (p99)

## SLOs
1. Availability: 99.9% over 30-day rolling window
2. Latency: 99.0% of requests < 500ms over 30-day rolling window

## Error budget
1. Availability: 0.1% = 43.2 minutes/month
2. Latency: 1.0% = 432 minutes/month (latency budget is more generous)

## Dependencies
- payment-gateway (external, 99.5% SLA from vendor)
- inventory-service (internal, 99.9% SLO)
- postgres-primary (internal, 99.95% SLO)

## Error budget policy
[Link to policy document]

## Dashboard
[Link to Grafana SLO dashboard]

## Last review: 2026-02-15
## Next review: 2026-05-15

Review cadence

Review Frequency Who Questions
Budget check Daily (automated) Dashboard/Slack bot "Is burn rate normal?"
SLO meeting Monthly Eng lead + product "Did we meet SLO? What consumed budget?"
SLO reassessment Quarterly Eng lead + product + SRE "Is the target still right? Too tight? Too loose?"
Full SLO audit Annually SRE + architecture "Are we measuring the right things? Any new SLIs needed?"

Gotcha: The quarterly reassessment is where teams discover their SLO is wrong. Common signs: SLO is never breached (too loose — tighten it or the budget policy never activates and SLOs become irrelevant). SLO is breached every month (too tight — the team ignores it because it's always red). Either extreme makes SLOs useless.


Part 11: Resolving the Mission

Back to day 12. Budget at 33.6%, burn rate at 2.4x.

Here's the decision framework:

Step 1: What's the burn rate?
  → 2.4x. Not critical (that's < 6x), but not sustainable.

Step 2: What's the time-to-exhaustion?
  → 14.5 minutes / (2.4 × 43.2 / 30) = ~6 days. Budget exhausted by day 18.

Step 3: What's consuming budget?
  → Dashboard shows: intermittent 503s from the payment-gateway dependency.
  → Not our code — but it IS our SLI.

Step 4: Consult the error budget policy.
  → 33.6% remaining = "Budget 25-50% remaining" tier.
  → Actions: enable canary deploys, daily burn rate checks, review recent changes.

Step 5: Address the root cause.
  → payment-gateway is rate-limiting us during peak hours.
  → Short-term: add retry with exponential backoff.
  → Long-term: discuss rate limit increase with vendor, add circuit breaker.

Step 6: Reassess the feature release.
  → The release is on day 20. If the fix reduces burn rate to < 1x by day 14,
     we'll have enough budget to absorb the deploy risk.
  → If burn rate stays at 2.4x, we defer the release per the budget policy.

This is what SLOs look like in practice. Not a dashboard you glance at — a decision framework that tells you what to do and when.


Exercises

Exercise 1: Calculate the budget (2 minutes)

Your service has a 99.5% SLO over a 30-day window. How many minutes of downtime equivalent does your error budget allow?

Solution
Error budget = 1 - 0.995 = 0.5%
Minutes = 43,200 × 0.005 = 216 minutes = 3.6 hours

Exercise 2: Interpret a burn rate (5 minutes)

Your 99.9% SLO shows a current burn rate of 4.5x. It's day 20 of 30, and you've consumed 60% of your budget.

  1. How many minutes of budget remain?
  2. At 4.5x burn, how long until budget exhaustion?
  3. What tier of alert should this trigger?
Solution
1. Budget remaining: 43.2 × 0.40 = 17.28 minutes
2. Daily budget consumption at 4.5x: 43.2 / 30 × 4.5 = 6.48 minutes/day
   Time to exhaustion: 17.28 / 6.48 = 2.67 days
3. 4.5x is between 3x (tier 3, warning) and 6x (tier 2, critical).
   This triggers the tier 3 alert (slow burn, ticket).
   But with only 2.67 days until exhaustion and 10 days left in the month,
   the error budget policy likely calls for prioritizing reliability work.

Exercise 3: Write a recording rule (10 minutes)

Write a Prometheus recording rule that computes the 1-hour error ratio for a service called checkout-service, counting HTTP 5xx responses as errors. Exclude the /health endpoint.

Solution
groups:
  - name: slo:checkout-service:availability
    rules:
      - record: slo:sli_error:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{job="checkout-service",code=~"5..",handler!="/health"}[1h]))
          /
          sum(rate(http_requests_total{job="checkout-service",handler!="/health"}[1h]))
        labels:
          service: checkout-service
          slo: availability
Key points: - The `handler!="/health"` exclusion appears in **both** numerator and denominator - Labels identify the service and SLO for alerting rule consumption - The naming convention `slo:sli_error:ratio_rate1h` follows the Prometheus recommended format: `level:metric:operations`

Exercise 4: Design an error budget policy (15 minutes)

You're launching a new service. Historical data shows baseline availability of 98.8%. Draft: (a) an appropriate initial SLO, (b) a three-tier error budget policy with concrete actions at each tier.

Hints - Start at or below the historical baseline - The policy should be achievable *today*, not aspirational - Each tier needs specific actions, not vague guidance
Solution sketch
SLO: 98.5% (below the 98.8% baseline — gives room to breathe)
Error budget: 1.5% = 648 minutes/month (~10.8 hours)

Tier 1 (> 50% budget remaining): Normal operations. Ship at will.
Tier 2 (25-50% remaining): Review error sources. Enable canary deploys.
  All deploys require rollback plan. Weekly budget review meeting.
Tier 3 (< 25% remaining): Feature freeze. Reliability sprint.
  Mandatory postmortem for any incident consuming > 5% of budget.

Improvement path: Once you consistently meet 98.5% with >40% budget
remaining, raise to 99.0%. Repeat until you reach the target your
users actually need.

Cheat Sheet

Concept Formula / Value
Error budget 1 - SLO target
Budget in minutes (30d) 43,200 × (1 - SLO)
Burn rate current_error_rate / (1 - SLO)
Time to exhaustion SLO_window / burn_rate
99% = 432 min/month (7.2 hours)
99.9% = 43.2 min/month
99.99% = 4.32 min/month
99.999% = 26 seconds/month

Burn rate alert tiers (99.9% SLO, 30-day window):

Tier Burn rate Error rate Windows Severity
1 14.4x 1.44% 5m + 1h Critical (page)
2 6.0x 0.60% 30m + 6h Critical (page)
3 3.0x 0.30% 2h + 24h Warning (ticket)
4 1.0x 0.10% 6h + 3d Warning (ticket)

PromQL patterns:

What Query
Error ratio sum(rate(errors[W])) / sum(rate(total[W]))
Burn rate error_ratio / (1 - SLO)
Budget remaining 1 - (sum(increase(errors[30d])) / sum(increase(total[30d]))) / (1 - SLO)

Tools that generate SLO rules automatically:

Tool What it does
Sloth Generates Prometheus recording + alerting rules from YAML SLO spec
Pyrra Kubernetes-native SLO CRDs with built-in error budget UI
OpenSLO Vendor-neutral SLO spec (converts to Sloth/Pyrra format)

Takeaways

  1. SLIs measure user experience, SLOs set the target, SLAs are the contract. Never confuse them. Never set SLO equal to SLA.

  2. 99.9% sounds like a small number but it's only 43 minutes a month. Each additional nine costs 10x more and cuts your budget by 90%.

  3. Burn rate, not error rate. A 0.5% error rate sounds harmless; at 5x burn rate it exhausts your monthly budget in 6 days.

  4. Two windows catch what one window misses. Multi-window burn rate alerting eliminates false positives from spikes and false negatives from slow burns.

  5. SLOs without an error budget policy are just dashboards. The policy is what makes SLOs drive actual decisions — feature freezes, reliability sprints, deploy cadence changes.

  6. Start loose, tighten with data. Set your initial SLO at or below your historical baseline. A 99% SLO you consistently meet teaches you more than a 99.99% SLO you constantly breach.