SLO Tooling — Street-Level Ops¶
Quick Diagnosis Commands¶
# Check if Sloth controller is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=sloth
# List all PrometheusServiceLevel CRDs
kubectl get prometheusservicelevel -A
# Check if Sloth generated PrometheusRules from an SLO
kubectl get prometheusrule -n monitoring | grep sloth
# Inspect generated rules from Sloth
kubectl get prometheusrule sloth-slo-order-service -n monitoring -o yaml | grep -A5 "alert:"
# Check Pyrra SLOs
kubectl get servicelevelobjective -A
# Check if PrometheusRules are being picked up by Prometheus Operator
kubectl get prometheusrule -n monitoring -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n'
# Query current error budget in Prometheus
promtool query instant http://localhost:9090 \
'sum(increase(http_requests_total{job="order-service",code=~"5.."}[30d])) /
sum(increase(http_requests_total{job="order-service"}[30d]))'
# Check current burn rate (should stay < 1 for healthy service)
promtool query instant http://localhost:9090 \
'(sum(rate(http_requests_total{job="order-service",code=~"5.."}[1h])) /
sum(rate(http_requests_total{job="order-service"}[1h]))) / 0.001'
# View active SLO alerts
kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
promtool query instant http://localhost:9090 'ALERTS{slo=~".+"}'
Gotcha: Sloth Controller Not Generating Rules¶
You apply a PrometheusServiceLevel CRD and no PrometheusRule appears.
Diagnosis:
# Check Sloth controller logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sloth --tail=50
# Common errors:
# "unknown field" → Sloth CRD version mismatch
# "failed to render" → invalid SLI query template
# "reconciliation error" → RBAC issue
# Check Sloth RBAC — it needs permission to manage PrometheusRule CRDs
kubectl auth can-i create prometheusrule --as=system:serviceaccount:monitoring:sloth -n monitoring
# Validate your SLO spec before applying
sloth validate -i your-slo.yaml
# Check CRD is installed
kubectl get crd | grep sloth
# Expected: prometheusservicelevels.sloth.slok.dev
Common causes:
1. Sloth version mismatch — the spec apiVersion must match the deployed CRD version
2. SLI query uses {{.window}} template syntax — if you write PromQL without the template, Sloth generates broken rules
3. RBAC missing — Sloth's service account cannot create PrometheusRule in your namespace
Pattern: Validate SLO Rules Before They Fire¶
Never trust generated rules without testing them manually:
# Check if the recording rules exist and return data
promtool query instant http://localhost:9090 \
'slo:sli_error:ratio_rate5m{sloth_service="order-service"}'
# If empty, the recording rule is not being evaluated
# → check PrometheusRule was created
# → check Prometheus picked it up (rules tab in Prometheus UI)
# → check the SLI query returns data with actual windows
# Test the raw SLI query with a real window
promtool query instant http://localhost:9090 \
'sum(rate(http_requests_total{job="order-service",code=~"5.."}[5m])) /
sum(rate(http_requests_total{job="order-service"}[5m]))'
# If the denominator is zero, total_query returns no data
# → check job label is correct
# → check metric exists: kubectl exec prometheus -- promtool query instant ... 'http_requests_total'
# Simulate a burn rate spike to test alerting
# (in a test/staging environment only)
for i in $(seq 1 1000); do
curl -s -o /dev/null -w "%{http_code}\n" http://order-service/api/orders/999999
done
Gotcha: SLO Alert Fires Constantly on Low-Traffic Services¶
A service processes 100 requests per day. One 500 error fires the 14.4× burn rate alert. You get paged at 3 AM for a single failed request.
Rule: Multi-window multi-burn-rate math breaks down at very low traffic. One bad event in a 5-minute window of 2 total requests = 50% error rate = 50,000× burn rate.
# Add a minimum traffic threshold to the alert condition
# Only alert if there is meaningful traffic AND high error rate
(
sum(rate(http_requests_total{job="order-service",code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="order-service"}[5m]))
) > (14.4 * 0.001)
and
sum(rate(http_requests_total{job="order-service"}[5m])) > 0.1
# ^ minimum 0.1 rps (6 requests per minute) before alerting
Remember: Multi-burn-rate alerting is designed for services with steady traffic. Below ~1 RPS, a single error creates astronomical burn rates. The minimum traffic guard is not optional for low-traffic services -- without it, you will train your on-call team to ignore SLO alerts.
In Sloth, this becomes a custom alert override:
# Override the generated alert with a traffic guard
alerting:
page_alert:
labels:
severity: critical
annotations:
# Add a note about minimum traffic requirement
description: "Fires only when traffic > 0.1 rps AND burn rate > 14.4x"
Pattern: Multi-Service SLO Aggregation¶
When you have 20 microservices and want a single "platform availability" SLO:
# Aggregate availability across all services
# (weighted by request volume — heavier services dominate)
sum(rate(http_requests_total{code!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
# Per-service error budget consumption ranked
sort_desc(
(
sum by (job) (
increase(http_requests_total{code=~"5.."}[30d])
)
/
sum by (job) (
increase(http_requests_total[30d])
)
) / 0.001 # normalize to budget units (1.0 = 100% budget consumed)
)
# Services that have consumed >50% of their error budget this month
(
sum by (job) (
increase(http_requests_total{code=~"5.."}[30d])
)
/
sum by (job) (
increase(http_requests_total[30d])
)
) / 0.001 > 0.5
Scenario: Error Budget Exhausted — What Now?¶
Your SLO alert fires with "ticket" severity. You check and 80% of the monthly error budget is already consumed with 20 days left in the month.
# Step 1: Quantify the remaining budget
promtool query instant http://localhost:9090 \
'(0.001 - sum(rate(http_requests_total{job="order-service",code=~"5.."}[30d])) /
sum(rate(http_requests_total{job="order-service"}[30d]))) / 0.001'
# Positive = budget remaining, negative = already over SLO
# Step 2: Find when budget started draining fast
# Look at 6-hour burn rate over the past week in Grafana
# Use the SLO dashboard — look for the point where burn rate jumped above 1.0
# Step 3: Correlate with deploys
kubectl get events -n production --sort-by='.lastTimestamp' | grep -i "deploy\|image\|update" | tail -20
# Step 4: Decision framework
# Burn rate < 1x: service is recovering, monitor closely
# Burn rate 1-3x: schedule remediation this sprint
# Burn rate 3-6x: remediation this week, freeze non-critical deploys
# Burn rate > 6x: page, escalate, consider rollback
# Step 5: If rolling back
kubectl rollout undo deployment/order-service -n production
kubectl rollout status deployment/order-service -n production
Error budget exhaustion policy options: - Freeze deploys: No new features until budget is replenished (new month or budget recovered) - Tech debt sprint: Use remaining budget to prioritize reliability work - SLO negotiation: If budget is consistently exhausted, revisit whether 99.9% is the right target
Gotcha: Recording Rules Not Matching Alert Rules¶
Sloth generates recording rules with specific label sets. If your alert rules query recording rules with different labels, you get "no data" and alerts never fire — or never resolve.
# Find what labels the recording rule actually produces
promtool query instant http://localhost:9090 \
'{__name__=~"slo:sli_error:ratio_rate.*"}'
# Typical labels from Sloth:
# {sloth_service="order-service", sloth_slo="requests-availability", sloth_id="order-service-requests-availability"}
# Your alert rule must match these labels exactly
# If you hand-modify alert rules and change label selectors, they break silently
# Always regenerate rules through Sloth rather than editing PrometheusRule directly
Pattern: Latency SLOs with Histograms¶
Availability SLOs are straightforward. Latency SLOs require histogram metrics and careful threshold selection:
# Sloth SLO for latency
- name: "requests-latency-p99"
objective: 99.0
description: "99% of requests complete in under 500ms"
sli:
events:
# Good: requests that completed within threshold
error_query: |
sum(rate(http_request_duration_seconds_bucket{job="order-service",le="0.5"}[{{.window}}]))
# Total: all requests (using the +Inf bucket = count)
total_query: |
sum(rate(http_request_duration_seconds_count{job="order-service"}[{{.window}}]))
This is an inverted availability SLI: "good" events are those within the latency threshold. The math works the same way.
# Verify latency SLI returns sensible values (should be between 0 and 1)
sum(rate(http_request_duration_seconds_bucket{job="order-service",le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="order-service"}[5m]))
# Expected: ~0.99 for a healthy service with 99th percentile < 500ms
Watch out: le="0.5" must be an exact bucket boundary in your histogram. If your histogram has le values of [0.1, 0.25, 1.0, 5.0], using le="0.5" returns nothing. Use le="1.0" or adjust your histogram bucket boundaries.
Default trap: Prometheus client libraries use default histogram buckets optimized for HTTP latency (e.g., Go defaults: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 seconds). If your SLO threshold is 200ms but you have no bucket between 100ms and 250ms, your SLI measurement is coarse and the SLO alert may fire late. Always define custom buckets that include your SLO threshold as an exact boundary.
Pattern: Pyrra UI for Error Budget Reviews¶
Use Pyrra's API for automated budget reviews:
# List all SLOs with current status
curl -s http://pyrra:9099/api/objectives | jq '.[] | {name: .metadata.name, target: .spec.target, availability: .status.availability}'
# Get error budget for specific SLO
curl -s "http://pyrra:9099/api/objectives/order-service-availability" | jq '.status'
# Export SLO status for a weekly review report
curl -s http://pyrra:9099/api/objectives | jq -r '.[] | [
.metadata.name,
.spec.target,
(.status.availability | tostring),
(.status.errorBudget.remaining | tostring)
] | @csv'
Emergency: SLO Alert Flood — Silence and Triage¶
When multiple SLO alerts fire simultaneously (cascade failure):
# Step 1: Understand scope — how many SLOs are breached?
curl -s http://alertmanager:9093/api/v2/alerts | \
jq '[.[] | select(.labels.slo)] | group_by(.labels.service) | map({service: .[0].labels.service, count: length})'
# Step 2: Silence ticket-level alerts to reduce noise during incident
amtool silence add \
--alertmanager.url=http://alertmanager:9093 \
--duration=4h \
--comment="Incident in progress, silencing warning-level SLO alerts" \
severity=warning slo=true
# Step 3: Focus on the highest burn rate first
promtool query instant http://localhost:9090 \
'sort_desc(sum by (sloth_service) (slo:current_burn_rate:ratio))'
# Step 4: After incident, expire the silence
amtool silence expire --alertmanager.url=http://alertmanager:9093 <silence_id>
Useful One-Liners¶
# Current error rate for all services with job label
promtool query instant http://localhost:9090 \
'sort_desc(sum by (job) (rate(http_requests_total{code=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])))'
# Services consuming >10% of 99.9% budget per hour
promtool query instant http://localhost:9090 \
'(sum by (job) (rate(http_requests_total{code=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h]))) / 0.001 > 0.1'
# Days of budget remaining at current burn rate
promtool query instant http://localhost:9090 \
'30 / (sum by (job) (rate(http_requests_total{code=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h])) / 0.001)'
# Sloth dry-run — preview generated rules without applying
sloth generate -i slo.yaml | promtool check rules /dev/stdin
# Apply Sloth SLOs from a directory
for f in slos/*.yaml; do
sloth validate -i "$f" && kubectl apply -f "$f"
done
# Find all Prometheus recording rules related to SLOs
kubectl get prometheusrule -A -o json | \
jq -r '.items[] | select(.spec.groups[].rules[].record | strings | startswith("slo:")) | .metadata.name'