Portal | Level: L2: Operations | Topics: SLO Tooling | Domain: Observability
SLO Tooling — Primer¶
Why This Matters¶
Writing SLO alerting rules by hand is error-prone and hard to reason about. The math behind multi-window multi-burn-rate alerting — the approach Google describes in the SRE Workbook — involves at least 4 alert rules per SLO, each with different time windows and burn rate thresholds. Get the math wrong and you either miss incidents (false negatives) or page engineers for non-issues (false positives).
Sloth, Pyrra, and the OpenSLO spec exist to solve this. They take a high-level SLO definition ("99.9% of requests should succeed") and generate the correct Prometheus alerting rules, recording rules, and dashboards automatically. You write the objective once; the tooling generates everything that implements it correctly.
This matters operationally because teams that hand-roll SLO rules consistently make the same mistakes: using absolute thresholds instead of burn rates, ignoring the page budget, alerting on the wrong window. This primer covers the math, the tools, and how to wire them into your stack.
Core Concepts¶
1. SLI / SLO / Error Budget Recap¶
SLI (Service Level Indicator): A metric that measures service behavior from the user's perspective.
# SLI: ratio of successful HTTP requests
SLI = good_events / total_events
= http_requests_total{code!~"5.."} / http_requests_total
# SLI: ratio of requests below latency threshold
SLI = http_requests_duration_bucket{le="0.5"} / http_requests_duration_count
SLO (Service Level Objective): A target value for the SLI over a time window.
Error Budget: The allowed amount of badness within the SLO window.
Error budget = 1 - SLO target = 0.1%
Over 30 days (43,200 minutes):
Allowed bad minutes = 43,200 × 0.001 = 43.2 minutes
Error budget is consumed when the service performs below SLO. It is the team's license to take risks (deploy, experiment) and the signal that the service needs reliability work.
2. Why Hand-Writing SLO Rules Is Error-Prone¶
A naive SLO alert:
# BAD — do not use this
- alert: HighErrorRate
expr: rate(http_requests_total{code=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.001
for: 5m
Problems with this approach: - No burn rate context: 0.1% error rate for 5 minutes consumes a tiny fraction of budget. This fires for benign blips. - No budget awareness: You cannot tell if this alert signals 1% budget consumed or 100% consumed. - Paging without urgency: On-call gets paged for something that would take weeks to exhaust the budget. - Missing recovery: No information about how fast budget is recovering.
The correct approach uses multi-window multi-burn-rate alerting.
3. Multi-Window Multi-Burn-Rate Alerting (Google's Approach)¶
The Google SRE Workbook defines four alert rules per SLO:
| Alert | Burn Rate | Short Window | Long Window | Severity |
|---|---|---|---|---|
| Page | 14.4x | 1h | 5m | Critical |
| Page | 6x | 6h | 30m | Critical |
| Ticket | 3x | 24h | 2h | Warning |
| Ticket | 1x | 72h | 6h | Warning |
Burn rate is how fast the error budget is being consumed relative to the SLO window:
Burn rate = error_rate / (1 - SLO_target)
# For 99.9% SLO:
14.4x burn rate = 14.4 × 0.001 = 1.44% error rate
6.0x burn rate = 6.0 × 0.001 = 0.60% error rate
3.0x burn rate = 3.0 × 0.001 = 0.30% error rate
1.0x burn rate = 1.0 × 0.001 = 0.10% error rate
# At 14.4x burn rate, 30-day budget exhausted in:
30 days / 14.4 = ~2.1 days
The two-window requirement (short + long) prevents false positives from brief spikes. Both windows must exceed the threshold before alerting.
Manually written multi-burn-rate rules:
# This is correct — but writing it for every SLO is tedious and error-prone
# This is why you use Sloth or Pyrra
groups:
- name: slo.order-service.availability
rules:
# Fast burn — page
- alert: OrderServiceAvailabilityBudgetFastBurn
expr: |
(
rate(http_requests_total{job="order-service",code=~"5.."}[5m])
/
rate(http_requests_total{job="order-service"}[5m])
) > (14.4 * 0.001)
and
(
rate(http_requests_total{job="order-service",code=~"5.."}[1h])
/
rate(http_requests_total{job="order-service"}[1h])
) > (14.4 * 0.001)
labels:
severity: critical
slo: order-service-availability
annotations:
summary: "Order service is burning error budget at 14.4x rate"
description: >
Current error rate will exhaust the 30-day error budget in
{{ printf "%.1f" (mulf (divf 30 14.4) 24) }} hours.
4. Sloth — PrometheusServiceLevel CRD¶
Who made it: Sloth was created by Xabier Larrakoetxea (slok) at Spotahome. The name follows the SRE ethos — a sloth is slow and deliberate, which is exactly how you want your error budget burn. If you are burning budget fast, something is wrong.
Sloth takes a YAML SLO spec and generates the correct Prometheus rules — recording rules for intermediate calculations and alerting rules for all four burn rate windows.
Sloth SLO spec:
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: order-service
namespace: monitoring
spec:
service: "order-service"
labels:
team: platform
env: production
slos:
- name: "requests-availability"
objective: 99.9
description: "99.9% of API requests complete successfully"
sli:
events:
error_query: sum(rate(http_requests_total{job="order-service",code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total{job="order-service"}[{{.window}}]))
alerting:
name: OrderServiceHighErrorRate
labels:
team: platform
channel: pagerduty
annotations:
runbook: "https://runbooks.example.com/order-service-availability"
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
- name: "requests-latency"
objective: 99.0
description: "99% of requests complete in under 500ms"
sli:
events:
error_query: |
sum(rate(http_request_duration_seconds_bucket{job="order-service",le="0.5"}[{{.window}}]))
total_query: |
sum(rate(http_request_duration_seconds_count{job="order-service"}[{{.window}}]))
alerting:
name: OrderServiceHighLatency
labels:
team: platform
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Generate rules from Sloth:
# CLI — generate Prometheus rules YAML from SLO spec
sloth generate -i order-service-slo.yaml -o order-service-rules.yaml
# Validate spec first
sloth validate -i order-service-slo.yaml
# As Kubernetes controller — apply the CRD and Sloth reconciles rules automatically
kubectl apply -f order-service-slo.yaml
# Sloth controller generates PrometheusRule resources automatically
# Inspect generated rules
kubectl get prometheusrule order-service -n monitoring -o yaml
What Sloth generates for a 99.9% SLO:
# Recording rules (intermediate values for efficiency)
- record: slo:sli_error:ratio_rate5m
expr: |
sum(rate(http_requests_total{job="order-service",code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="order-service"}[5m]))
- record: slo:sli_error:ratio_rate30m
expr: <same for 30m window>
- record: slo:sli_error:ratio_rate1h
expr: <same for 1h window>
# ... recording rules for 2h, 6h, 24h, 72h windows
# Alerting rules (4 rules using pre-computed recording rules)
- alert: SLOBudgetFastBurn
expr: |
slo:sli_error:ratio_rate5m{...} > (14.4 * (1 - 0.999))
and
slo:sli_error:ratio_rate1h{...} > (14.4 * (1 - 0.999))
labels: {severity: critical, ...}
- alert: SLOBudgetFastBurn
expr: |
slo:sli_error:ratio_rate30m{...} > (6 * (1 - 0.999))
and
slo:sli_error:ratio_rate6h{...} > (6 * (1 - 0.999))
labels: {severity: critical, ...}
# ... two more ticket-level alerts
5. Pyrra — SLO CRD with UI¶
Pyrra is another CNCF SLO tool with two notable differences from Sloth: it has a built-in UI showing error budget status, and it generates ServiceLevelObjective CRDs that include UI metadata.
Pyrra SLO definition:
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: order-service-availability
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
target: "99.9"
window: 4w # 28-day window
description: "Percentage of successful HTTP requests to the order service"
indicator:
ratio:
errors:
metric: http_requests_total{job="order-service",code=~"5.."}
total:
metric: http_requests_total{job="order-service"}
grouping:
- handler
Deploying Pyrra:
# Install Pyrra operator
kubectl apply -f https://github.com/pyrra-dev/pyrra/releases/latest/download/config-operator.yaml
# Apply SLO CRDs
kubectl apply -f order-service-slo.yaml
# Pyrra reconciles to PrometheusRule resources
kubectl get prometheusrule -n monitoring | grep order-service
# Access Pyrra UI (shows error budget status per SLO)
kubectl port-forward svc/pyrra-api 9099:9099 -n monitoring
# Open http://localhost:9099
Pyrra's UI shows: - Current error budget remaining (as percentage and absolute time) - Burn rate over different windows - Historical error budget consumption
6. OpenSLO Spec¶
Fun fact: OpenSLO was initiated by Nobl9 in 2021 and contributed to a community governance model. It borrows its structure from the Kubernetes API convention (
apiVersion,kind,metadata,spec) to feel familiar to platform engineers, even though OpenSLO resources are not Kubernetes CRDs by default.
OpenSLO is a vendor-neutral specification for SLO definitions. It decouples the SLO definition from any specific tooling:
# openslo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
name: order-service-availability
displayName: "Order Service Availability"
spec:
service: order-service
description: "Ratio of successful requests to total requests"
timeWindow:
- duration: 4w
isRolling: true
budgetingMethod: Occurrences
objectives:
- displayName: "Good requests"
target: 0.999 # 99.9%
indicator:
metadata:
name: order-service-request-success
spec:
ratioMetric:
counter: true
good:
metricSource:
type: Prometheus
spec:
query: http_requests_total{job="order-service",code!~"5.."}
total:
metricSource:
type: Prometheus
spec:
query: http_requests_total{job="order-service"}
alertPolicies:
- kind: AlertPolicy
metadata:
name: order-service-alerts
spec:
conditions:
- kind: AlertCondition
metadata:
name: BudgetBurn
spec:
condition:
kind: BurnRate
burnRateThreshold: 2
lookbackWindow: 1h
alertAfter: 5m
Convert OpenSLO specs to Sloth-compatible format:
# OpenSLO to Sloth conversion
openslo-convert --from openslo --to sloth -f openslo.yaml | sloth generate -i -
7. Error Budget Math and Burn Rate Formulas¶
# Core formulas
Error budget (%) = 100 - SLO_target_percent
Error budget (minutes/30d) = 43200 × (1 - SLO_decimal)
# Burn rate → time to exhaustion
Time to exhaustion = SLO_window / burn_rate
# At what error rate is the budget exhausted at exactly 1x burn rate?
Budget exhaustion rate = 1 - SLO_target
# What error rate corresponds to N× burn rate?
Error rate at Nx = N × (1 - SLO_target)
# How much budget does an incident consume?
Budget consumed = incident_error_rate × incident_duration / (SLO_window × (1 - SLO_target))
# Examples for 99.9% SLO (30-day window):
# 14.4x burn: 1.44% error rate → budget exhausted in 2.08 days
# 6x burn: 0.6% error rate → budget exhausted in 5 days
# 3x burn: 0.3% error rate → budget exhausted in 10 days
# 1x burn: 0.1% error rate → budget exhausted in 30 days (exactly on target)
8. SLO Dashboards in Grafana¶
Sloth generates dashboard annotations, but you need to configure the Grafana dashboard manually or use a pre-built one:
// Grafana dashboard panel — error budget remaining
{
"type": "stat",
"title": "Error Budget Remaining",
"targets": [{
"expr": "1 - (sum(increase(http_requests_total{job=\"order-service\",code=~\"5..\"}[30d])) / sum(increase(http_requests_total{job=\"order-service\"}[30d]))) / (1 - 0.999)"
}],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 0.25},
{"color": "green", "value": 0.50}
]
}
}
}
}
PromQL for SLO dashboard panels:
# Current error rate (1h window)
sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="order-service"}[1h]))
# Error budget consumed this month (0-1 scale, 0 = full budget, 1 = exhausted)
(
sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d]))
/
sum(increase(http_requests_total{job="order-service"}[30d]))
) / (1 - 0.999)
# Burn rate (how fast budget is burning right now, 1x = on target)
(
sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="order-service"}[1h]))
) / (1 - 0.999)
# Time to budget exhaustion at current burn rate (in hours)
(
(1 - (
sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d]))
/
sum(increase(http_requests_total{job="order-service"}[30d]))
) / (1 - 0.999))
*
(30 * 24) # hours in 30 days
)
/
(
(
sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="order-service"}[1h]))
) / (1 - 0.999)
)
9. Alertmanager Integration¶
Route SLO alerts differently based on severity and burn rate:
# alertmanager.yaml
route:
receiver: default
routes:
# Critical burn rate — page immediately
- matchers:
- severity = critical
- slo = "true"
receiver: pagerduty-critical
group_wait: 0s
group_interval: 5m
repeat_interval: 1h
# Warning burn rate — create ticket
- matchers:
- severity = warning
- slo = "true"
receiver: slack-slo-warnings
group_wait: 10m
group_interval: 1h
repeat_interval: 8h
receivers:
- name: pagerduty-critical
pagerduty_configs:
- routing_key: "${PAGERDUTY_KEY}"
description: '{{ template "pagerduty.description" . }}'
details:
slo: '{{ index .GroupLabels "slo" }}'
burn_rate: '{{ index .CommonAnnotations "burn_rate" }}'
runbook: '{{ index .CommonAnnotations "runbook" }}'
- name: slack-slo-warnings
slack_configs:
- api_url: "${SLACK_WEBHOOK_URL}"
channel: "#slo-alerts"
text: |
SLO Warning: {{ .GroupLabels.alertname }}
Service: {{ .GroupLabels.service }}
Budget remaining: see Pyrra dashboard
10. Nobl9 and Commercial SLO Platforms¶
For teams that want a managed SLO platform, Nobl9 provides: - SLO definitions via their CRD or web UI - Native integration with Prometheus, Datadog, New Relic, Splunk, Dynatrace - Error budget tracking across multiple data sources - SLO-based alerting policies
Nobl9 SLO definition (YAML API):
apiVersion: n9/v1alpha
kind: SLO
metadata:
name: order-service-availability
project: production
spec:
service: order-service
indicator:
metricSource:
name: prometheus-prod
type: Prometheus
project: production
rawMetric:
query: |
rate(http_requests_total{job="order-service",code!~"5.."}[{{.Resolution}}])
/
rate(http_requests_total{job="order-service"}[{{.Resolution}}])
timeWindows:
- unit: Day
count: 30
isRolling: true
objectives:
- displayName: "Good"
target: 0.999
value: 1
op: gte
alertPolicies:
- burnRateCondition:
burnRate: 14.4
exhaustionTime: 2h
Quick Reference¶
| Task | Command |
|---|---|
| Validate Sloth spec | sloth validate -i slo.yaml |
| Generate rules (CLI) | sloth generate -i slo.yaml -o rules.yaml |
| Apply Sloth CRD | kubectl apply -f slo.yaml |
| List Pyrra SLOs | kubectl get slo -n monitoring |
| Current burn rate | See PromQL above |
| Error budget PromQL | (1 - sum(rate(errors[30d]))/sum(rate(total[30d]))) / (1 - slo_target) |
| Install Sloth operator | helm install sloth slok/sloth -n monitoring |
| Install Pyrra operator | kubectl apply -f pyrra config-operator.yaml |
| Burn rate at 14.4× | 14.4 × (1 - SLO) error rate |
Wiki Navigation¶
Prerequisites¶
- Observability Deep Dive (Topic Pack, L2)
- Postmortems & SLOs (Topic Pack, L2)