Skip to content

Portal | Level: L2: Operations | Topics: SLO Tooling | Domain: Observability

SLO Tooling — Primer

Why This Matters

Writing SLO alerting rules by hand is error-prone and hard to reason about. The math behind multi-window multi-burn-rate alerting — the approach Google describes in the SRE Workbook — involves at least 4 alert rules per SLO, each with different time windows and burn rate thresholds. Get the math wrong and you either miss incidents (false negatives) or page engineers for non-issues (false positives).

Sloth, Pyrra, and the OpenSLO spec exist to solve this. They take a high-level SLO definition ("99.9% of requests should succeed") and generate the correct Prometheus alerting rules, recording rules, and dashboards automatically. You write the objective once; the tooling generates everything that implements it correctly.

This matters operationally because teams that hand-roll SLO rules consistently make the same mistakes: using absolute thresholds instead of burn rates, ignoring the page budget, alerting on the wrong window. This primer covers the math, the tools, and how to wire them into your stack.


Core Concepts

1. SLI / SLO / Error Budget Recap

SLI (Service Level Indicator): A metric that measures service behavior from the user's perspective.

# SLI: ratio of successful HTTP requests
SLI = good_events / total_events
    = http_requests_total{code!~"5.."} / http_requests_total

# SLI: ratio of requests below latency threshold
SLI = http_requests_duration_bucket{le="0.5"} / http_requests_duration_count

SLO (Service Level Objective): A target value for the SLI over a time window.

SLO: 99.9% of requests return 2xx or 3xx over a rolling 30-day window

Error Budget: The allowed amount of badness within the SLO window.

Error budget = 1 - SLO target = 0.1%
Over 30 days (43,200 minutes):
  Allowed bad minutes = 43,200 × 0.001 = 43.2 minutes

Error budget is consumed when the service performs below SLO. It is the team's license to take risks (deploy, experiment) and the signal that the service needs reliability work.

2. Why Hand-Writing SLO Rules Is Error-Prone

A naive SLO alert:

# BAD — do not use this
- alert: HighErrorRate
  expr: rate(http_requests_total{code=~"5.."}[5m]) /
        rate(http_requests_total[5m]) > 0.001
  for: 5m

Problems with this approach: - No burn rate context: 0.1% error rate for 5 minutes consumes a tiny fraction of budget. This fires for benign blips. - No budget awareness: You cannot tell if this alert signals 1% budget consumed or 100% consumed. - Paging without urgency: On-call gets paged for something that would take weeks to exhaust the budget. - Missing recovery: No information about how fast budget is recovering.

The correct approach uses multi-window multi-burn-rate alerting.

3. Multi-Window Multi-Burn-Rate Alerting (Google's Approach)

The Google SRE Workbook defines four alert rules per SLO:

Alert Burn Rate Short Window Long Window Severity
Page 14.4x 1h 5m Critical
Page 6x 6h 30m Critical
Ticket 3x 24h 2h Warning
Ticket 1x 72h 6h Warning

Burn rate is how fast the error budget is being consumed relative to the SLO window:

Burn rate = error_rate / (1 - SLO_target)

# For 99.9% SLO:
14.4x burn rate = 14.4 × 0.001 = 1.44% error rate
 6.0x burn rate =  6.0 × 0.001 = 0.60% error rate
 3.0x burn rate =  3.0 × 0.001 = 0.30% error rate
 1.0x burn rate =  1.0 × 0.001 = 0.10% error rate

# At 14.4x burn rate, 30-day budget exhausted in:
30 days / 14.4 = ~2.1 days

The two-window requirement (short + long) prevents false positives from brief spikes. Both windows must exceed the threshold before alerting.

Manually written multi-burn-rate rules:

# This is correct — but writing it for every SLO is tedious and error-prone
# This is why you use Sloth or Pyrra

groups:
  - name: slo.order-service.availability
    rules:
      # Fast burn — page
      - alert: OrderServiceAvailabilityBudgetFastBurn
        expr: |
          (
            rate(http_requests_total{job="order-service",code=~"5.."}[5m])
            /
            rate(http_requests_total{job="order-service"}[5m])
          ) > (14.4 * 0.001)
          and
          (
            rate(http_requests_total{job="order-service",code=~"5.."}[1h])
            /
            rate(http_requests_total{job="order-service"}[1h])
          ) > (14.4 * 0.001)
        labels:
          severity: critical
          slo: order-service-availability
        annotations:
          summary: "Order service is burning error budget at 14.4x rate"
          description: >
            Current error rate will exhaust the 30-day error budget in
            {{ printf "%.1f" (mulf (divf 30 14.4) 24) }} hours.

4. Sloth — PrometheusServiceLevel CRD

Who made it: Sloth was created by Xabier Larrakoetxea (slok) at Spotahome. The name follows the SRE ethos — a sloth is slow and deliberate, which is exactly how you want your error budget burn. If you are burning budget fast, something is wrong.

Sloth takes a YAML SLO spec and generates the correct Prometheus rules — recording rules for intermediate calculations and alerting rules for all four burn rate windows.

Sloth SLO spec:

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: order-service
  namespace: monitoring
spec:
  service: "order-service"
  labels:
    team: platform
    env: production
  slos:
    - name: "requests-availability"
      objective: 99.9
      description: "99.9% of API requests complete successfully"
      sli:
        events:
          error_query: sum(rate(http_requests_total{job="order-service",code=~"5.."}[{{.window}}]))
          total_query: sum(rate(http_requests_total{job="order-service"}[{{.window}}]))
      alerting:
        name: OrderServiceHighErrorRate
        labels:
          team: platform
          channel: pagerduty
        annotations:
          runbook: "https://runbooks.example.com/order-service-availability"
        page_alert:
          labels:
            severity: critical
        ticket_alert:
          labels:
            severity: warning

    - name: "requests-latency"
      objective: 99.0
      description: "99% of requests complete in under 500ms"
      sli:
        events:
          error_query: |
            sum(rate(http_request_duration_seconds_bucket{job="order-service",le="0.5"}[{{.window}}]))
          total_query: |
            sum(rate(http_request_duration_seconds_count{job="order-service"}[{{.window}}]))
      alerting:
        name: OrderServiceHighLatency
        labels:
          team: platform
        page_alert:
          labels:
            severity: critical
        ticket_alert:
          labels:
            severity: warning

Generate rules from Sloth:

# CLI — generate Prometheus rules YAML from SLO spec
sloth generate -i order-service-slo.yaml -o order-service-rules.yaml

# Validate spec first
sloth validate -i order-service-slo.yaml

# As Kubernetes controller — apply the CRD and Sloth reconciles rules automatically
kubectl apply -f order-service-slo.yaml
# Sloth controller generates PrometheusRule resources automatically

# Inspect generated rules
kubectl get prometheusrule order-service -n monitoring -o yaml

What Sloth generates for a 99.9% SLO:

# Recording rules (intermediate values for efficiency)
- record: slo:sli_error:ratio_rate5m
  expr: |
    sum(rate(http_requests_total{job="order-service",code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{job="order-service"}[5m]))

- record: slo:sli_error:ratio_rate30m
  expr: <same for 30m window>

- record: slo:sli_error:ratio_rate1h
  expr: <same for 1h window>

# ... recording rules for 2h, 6h, 24h, 72h windows

# Alerting rules (4 rules using pre-computed recording rules)
- alert: SLOBudgetFastBurn
  expr: |
    slo:sli_error:ratio_rate5m{...} > (14.4 * (1 - 0.999))
    and
    slo:sli_error:ratio_rate1h{...} > (14.4 * (1 - 0.999))
  labels: {severity: critical, ...}

- alert: SLOBudgetFastBurn
  expr: |
    slo:sli_error:ratio_rate30m{...} > (6 * (1 - 0.999))
    and
    slo:sli_error:ratio_rate6h{...} > (6 * (1 - 0.999))
  labels: {severity: critical, ...}

# ... two more ticket-level alerts

5. Pyrra — SLO CRD with UI

Pyrra is another CNCF SLO tool with two notable differences from Sloth: it has a built-in UI showing error budget status, and it generates ServiceLevelObjective CRDs that include UI metadata.

Pyrra SLO definition:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: order-service-availability
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  target: "99.9"
  window: 4w  # 28-day window
  description: "Percentage of successful HTTP requests to the order service"
  indicator:
    ratio:
      errors:
        metric: http_requests_total{job="order-service",code=~"5.."}
      total:
        metric: http_requests_total{job="order-service"}
      grouping:
        - handler

Deploying Pyrra:

# Install Pyrra operator
kubectl apply -f https://github.com/pyrra-dev/pyrra/releases/latest/download/config-operator.yaml

# Apply SLO CRDs
kubectl apply -f order-service-slo.yaml

# Pyrra reconciles to PrometheusRule resources
kubectl get prometheusrule -n monitoring | grep order-service

# Access Pyrra UI (shows error budget status per SLO)
kubectl port-forward svc/pyrra-api 9099:9099 -n monitoring
# Open http://localhost:9099

Pyrra's UI shows: - Current error budget remaining (as percentage and absolute time) - Burn rate over different windows - Historical error budget consumption

6. OpenSLO Spec

Fun fact: OpenSLO was initiated by Nobl9 in 2021 and contributed to a community governance model. It borrows its structure from the Kubernetes API convention (apiVersion, kind, metadata, spec) to feel familiar to platform engineers, even though OpenSLO resources are not Kubernetes CRDs by default.

OpenSLO is a vendor-neutral specification for SLO definitions. It decouples the SLO definition from any specific tooling:

# openslo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
  name: order-service-availability
  displayName: "Order Service Availability"
spec:
  service: order-service
  description: "Ratio of successful requests to total requests"
  timeWindow:
    - duration: 4w
      isRolling: true
  budgetingMethod: Occurrences
  objectives:
    - displayName: "Good requests"
      target: 0.999  # 99.9%
  indicator:
    metadata:
      name: order-service-request-success
    spec:
      ratioMetric:
        counter: true
        good:
          metricSource:
            type: Prometheus
            spec:
              query: http_requests_total{job="order-service",code!~"5.."}
        total:
          metricSource:
            type: Prometheus
            spec:
              query: http_requests_total{job="order-service"}
  alertPolicies:
    - kind: AlertPolicy
      metadata:
        name: order-service-alerts
      spec:
        conditions:
          - kind: AlertCondition
            metadata:
              name: BudgetBurn
            spec:
              condition:
                kind: BurnRate
                burnRateThreshold: 2
                lookbackWindow: 1h
                alertAfter: 5m

Convert OpenSLO specs to Sloth-compatible format:

# OpenSLO to Sloth conversion
openslo-convert --from openslo --to sloth -f openslo.yaml | sloth generate -i -

7. Error Budget Math and Burn Rate Formulas

# Core formulas
Error budget (%) = 100 - SLO_target_percent
Error budget (minutes/30d) = 43200 × (1 - SLO_decimal)

# Burn rate → time to exhaustion
Time to exhaustion = SLO_window / burn_rate

# At what error rate is the budget exhausted at exactly 1x burn rate?
Budget exhaustion rate = 1 - SLO_target

# What error rate corresponds to N× burn rate?
Error rate at Nx = N × (1 - SLO_target)

# How much budget does an incident consume?
Budget consumed = incident_error_rate × incident_duration / (SLO_window × (1 - SLO_target))

# Examples for 99.9% SLO (30-day window):
# 14.4x burn: 1.44% error rate → budget exhausted in 2.08 days
# 6x burn:    0.6%  error rate → budget exhausted in 5 days
# 3x burn:    0.3%  error rate → budget exhausted in 10 days
# 1x burn:    0.1%  error rate → budget exhausted in 30 days (exactly on target)

8. SLO Dashboards in Grafana

Sloth generates dashboard annotations, but you need to configure the Grafana dashboard manually or use a pre-built one:

// Grafana dashboard panel — error budget remaining
{
  "type": "stat",
  "title": "Error Budget Remaining",
  "targets": [{
    "expr": "1 - (sum(increase(http_requests_total{job=\"order-service\",code=~\"5..\"}[30d])) / sum(increase(http_requests_total{job=\"order-service\"}[30d]))) / (1 - 0.999)"
  }],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          {"color": "red",    "value": 0},
          {"color": "yellow", "value": 0.25},
          {"color": "green",  "value": 0.50}
        ]
      }
    }
  }
}

PromQL for SLO dashboard panels:

# Current error rate (1h window)
sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="order-service"}[1h]))

# Error budget consumed this month (0-1 scale, 0 = full budget, 1 = exhausted)
(
  sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d]))
  /
  sum(increase(http_requests_total{job="order-service"}[30d]))
) / (1 - 0.999)

# Burn rate (how fast budget is burning right now, 1x = on target)
(
  sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
  /
  sum(rate(http_requests_total{job="order-service"}[1h]))
) / (1 - 0.999)

# Time to budget exhaustion at current burn rate (in hours)
(
  (1 - (
    sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d]))
    /
    sum(increase(http_requests_total{job="order-service"}[30d]))
  ) / (1 - 0.999))
  *
  (30 * 24)  # hours in 30 days
)
/
(
  (
    sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h]))
    /
    sum(rate(http_requests_total{job="order-service"}[1h]))
  ) / (1 - 0.999)
)

9. Alertmanager Integration

Route SLO alerts differently based on severity and burn rate:

# alertmanager.yaml
route:
  receiver: default
  routes:
    # Critical burn rate — page immediately
    - matchers:
        - severity = critical
        - slo = "true"
      receiver: pagerduty-critical
      group_wait: 0s
      group_interval: 5m
      repeat_interval: 1h

    # Warning burn rate — create ticket
    - matchers:
        - severity = warning
        - slo = "true"
      receiver: slack-slo-warnings
      group_wait: 10m
      group_interval: 1h
      repeat_interval: 8h

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_KEY}"
        description: '{{ template "pagerduty.description" . }}'
        details:
          slo: '{{ index .GroupLabels "slo" }}'
          burn_rate: '{{ index .CommonAnnotations "burn_rate" }}'
          runbook: '{{ index .CommonAnnotations "runbook" }}'

  - name: slack-slo-warnings
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL}"
        channel: "#slo-alerts"
        text: |
          SLO Warning: {{ .GroupLabels.alertname }}
          Service: {{ .GroupLabels.service }}
          Budget remaining: see Pyrra dashboard

10. Nobl9 and Commercial SLO Platforms

For teams that want a managed SLO platform, Nobl9 provides: - SLO definitions via their CRD or web UI - Native integration with Prometheus, Datadog, New Relic, Splunk, Dynatrace - Error budget tracking across multiple data sources - SLO-based alerting policies

Nobl9 SLO definition (YAML API):

apiVersion: n9/v1alpha
kind: SLO
metadata:
  name: order-service-availability
  project: production
spec:
  service: order-service
  indicator:
    metricSource:
      name: prometheus-prod
      type: Prometheus
      project: production
    rawMetric:
      query: |
        rate(http_requests_total{job="order-service",code!~"5.."}[{{.Resolution}}])
        /
        rate(http_requests_total{job="order-service"}[{{.Resolution}}])
  timeWindows:
    - unit: Day
      count: 30
      isRolling: true
  objectives:
    - displayName: "Good"
      target: 0.999
      value: 1
      op: gte
  alertPolicies:
    - burnRateCondition:
        burnRate: 14.4
        exhaustionTime: 2h

Quick Reference

Task Command
Validate Sloth spec sloth validate -i slo.yaml
Generate rules (CLI) sloth generate -i slo.yaml -o rules.yaml
Apply Sloth CRD kubectl apply -f slo.yaml
List Pyrra SLOs kubectl get slo -n monitoring
Current burn rate See PromQL above
Error budget PromQL (1 - sum(rate(errors[30d]))/sum(rate(total[30d]))) / (1 - slo_target)
Install Sloth operator helm install sloth slok/sloth -n monitoring
Install Pyrra operator kubectl apply -f pyrra config-operator.yaml
Burn rate at 14.4× 14.4 × (1 - SLO) error rate

Wiki Navigation

Prerequisites