SLO Tooling Footguns¶

1. Using Absolute Error Count Instead of Error Rate in SLIs¶

Your SLO alert fires because the absolute count of 500 errors exceeded a threshold. A service that processes 1,000,000 requests/day can have 1,000 errors and still be well within a 99.9% SLO. But if you alert on count > 500, you page on a normal Tuesday.

Fix: SLIs must be ratios — good events divided by total events. Never use absolute counts in SLO definitions. The burn rate formula only makes sense when your SLI is bounded between 0 and 1. Check every SLO definition: if the query does not produce a value between 0 and 1, it is not a valid SLI.

Remember: The SLI formula is always good events / total events. For availability: successful requests / total requests. For latency: requests faster than threshold / total requests. If your query returns a number outside [0, 1], it is not an SLI — it is a metric. This distinction matters because error budgets and burn rates are derived mathematically from the SLI ratio.

2. Setting SLO Targets Without Historical Data¶

You pick 99.9% because it sounds professional. Your service's actual historical availability is 98.5%. Now your team is permanently out of error budget. Every deploy causes a severity ticket. Engineers start ignoring SLO alerts because they fire constantly.

Fix: Calculate your current baseline before setting a target. Query 90 days of historical data:

sum(increase(http_requests_total{code!~"5.."}[90d]))
/
sum(increase(http_requests_total[90d]))

Set your initial SLO target at or below this baseline. Then improve reliability until you can sustainably raise it. A 98% SLO you reliably meet is more useful than a 99.9% SLO you constantly breach.

3. Hand-Editing PrometheusRules Generated by Sloth¶

Sloth generates a PrometheusRule resource and labels it with sloth_service, sloth_slo, and `sloth_mode=. You edit the rule directly in Kubernetes to fix a threshold. The next time Sloth reconciles (on any change to the CRD), it overwrites your edit. Your change is silently lost.

Fix: Treat Sloth-generated PrometheusRule resources as generated artifacts — never edit them directly. All changes must go through the PrometheusServiceLevel CRD source. If you need to override an alert threshold, use Sloth's override fields or disable a generated alert and write a supplemental rule in a separate PrometheusRule.

4. Single-Window Burn Rate Alerts (Forgetting the Second Window)¶

You write a burn rate alert that fires when the 1-hour error rate exceeds 14.4× the budget consumption rate. A 30-second spike in errors makes the 1-hour window spike and pages on-call. By the time they respond, the spike is resolved and the 1-hour rate has dropped. This is a false positive caused by using only one time window.

Fix: Always use two windows — a short window to detect urgency and a long window to confirm the signal is real. Both must exceed the burn rate threshold:

# SHORT window confirms severity
rate(errors[5m]) / rate(total[5m]) > 14.4 * (1 - 0.999)
AND
# LONG window confirms it is not a blip
rate(errors[1h]) / rate(total[1h]) > 14.4 * (1 - 0.999)

Sloth and Pyrra generate this correctly by default. If you write rules manually, this is the most common mistake.

5. Forgetting That `increase()` Does Not Work With Staleness Gaps¶

Your SLO is defined over a 30-day window using increase(http_requests_total[30d]). The service was restarted 15 days ago and Prometheus has a counter reset recorded. increase() handles resets correctly within its window. But if the series was absent (service down, scrape failing) for several hours, increase() extrapolates across the gap, producing incorrect values — either inflated or deflated error budget calculations.

Fix: Use sum(rate(...[window])) * seconds_in_window for multi-day windows with potentially missing data, or use recording rules that accumulate daily and sum them. Alternatively, use sum_over_time(slo:sli_error:ratio_rate5m[30d]) over Sloth's pre-computed 5-minute ratio recording rule — the 5-minute window limits extrapolation error.

6. Alerting on Error Budget Percentage Without Urgency Context¶

Your alert reads "Error budget at 20% remaining." The on-call engineer does not know whether to panic or monitor. 20% remaining with 1 day left in the month at 1× burn rate means the service will coast to month-end fine. 20% remaining with 20 days left at 6× burn rate means the budget is exhausted in 4 days and action is urgent.

Fix: Include burn rate and time-to-exhaustion in every SLO alert annotation:

annotations:
  summary: "Error budget at {{ $value | humanizePercentage }} remaining"
  description: |
    Current burn rate: {{ with query "..." }}{{ . | first | value | printf "%.1f" }}{{ end }}x
    At this rate, budget exhausted in: {{ ... }} hours
    Runbook: https://runbooks.example.com/slo-budget-response

Sloth templates support this via {{ $burnRate }} in annotation templates.

7. SLO Target Covers the Wrong Population¶

Your availability SLO counts all HTTP requests, including internal health checks from Kubernetes (/healthz, /readyz), and metrics scrapes (/metrics). These always succeed and inflate your good-event count. Your real user-facing availability is worse than the SLO reports.

Fix: Exclude internal traffic from SLO metrics at the SLI level:

# Wrong — includes health checks
sum(rate(http_requests_total{code!~"5.."}[5m]))

# Right — exclude internal endpoints
sum(rate(http_requests_total{code!~"5..",handler!~"/healthz|/readyz|/metrics"}[5m]))

Alternatively, add a slo="true" label at the instrumentation layer and only count metrics that carry it.

8. Missing `for` Duration on Burn Rate Alerts¶

You write a burn rate alert without a for duration. Prometheus evaluates the alert on every scrape interval (15s). A momentary spike causes an immediate page. With for: 0s (the default), there is zero debounce.

Fix: The SRE Workbook recommends specific for durations per alert tier: - 14.4× burn rate: for: 2m (confirms the fast burn is sustained) - 6× burn rate: for: 5m - 3× burn rate: for: 15m - 1× burn rate: for: 1h

Sloth generates these correctly. If writing manually, include for: on every SLO alert.

9. Using SLOs Without Defining an Error Budget Policy¶

You have beautiful SLO dashboards and working alerts. But when the budget is exhausted, nothing changes. No deploys are frozen, no reliability sprint is triggered, no escalation happens. The SLOs become decoration — engineers stop caring because there are no consequences.

Fix: Before defining SLOs, define the error budget policy: 1. What happens when 50% of budget is consumed? (Warning, monitor closely) 2. What happens when 80% is consumed? (Reliability work prioritized over features) 3. What happens when 100% is consumed? (Deploy freeze, mandatory postmortem) 4. Who approves exceptions? (Approver for emergency deploys during budget exhaustion)

Write this as a documented policy and link it from every SLO alert annotation. The tooling is only valuable if the organizational process exists.

10. Treating Pyrra's UI Error Budget as Authoritative Without Verifying the Underlying Data¶

Pyrra shows 45% error budget remaining. You trust it and skip investigation. But the SLI query is counting only a subset of error codes — code="500" — while your application also returns 503s from the load balancer (which Pyrra's query misses). Real availability is lower than displayed.

Fix: Periodically verify SLO calculations manually. Run the raw SLI query in Prometheus and compare to what Pyrra shows. Review SLI query coverage — are you catching all error conditions? Typical misses: gateway errors from the load balancer (not instrumented by the app), timeout responses that return 200 with an error body, gRPC status codes (which use different error semantics than HTTP status codes).

Gotcha: gRPC uses numeric status codes (0=OK, 13=INTERNAL, 14=UNAVAILABLE) that are entirely separate from HTTP status codes. A gRPC error returns HTTP 200 with the error in the gRPC trailer. If your SLI only counts HTTP 5xx, you miss every gRPC failure. Use grpc_server_handled_total with grpc_code!="OK" for gRPC SLIs.

11. Creating Too Many SLOs (Alert Fatigue Through Completeness)¶

You define SLOs for every endpoint, every service, every error code combination. 150 SLOs in Pyrra, 600 generated alert rules in Prometheus. Everything fires simultaneously during any incident. On-call cannot determine which SLO matters most. Engineers tune out SLO alerts entirely.

Fix: Fewer, better SLOs beat many incomplete ones. Start with one availability SLO and one latency SLO per user-facing service (not per endpoint). Add SLOs for critical background jobs and data pipelines only if their failure is user-visible. Review alert volume monthly — if more than 2 SLO alerts fire per on-call shift on average, you have too many or the thresholds are wrong.

12. Defining SLOs for Metrics You Do Not Own¶

Your team's service calls a third-party payment API. You define an SLO that includes payment API errors in your error rate. The payment provider has an outage. Your SLO is breached. Your team gets paged and blamed. But you cannot fix the payment provider.

Fix: SLOs should only cover behaviors your team controls. Use dependency probing (synthetic monitoring) to track third-party SLOs separately. In your service's SLO, either exclude errors caused by dependency failures (add a label to distinguish them) or define a separate "dependency availability" SLO that has a different response process — escalate to vendor, not to your on-call.