SLO Tooling — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about SLO tooling.

Google built an internal SLO management tool used by thousands of teams¶

Google's internal SLO tooling, described in the SRE Workbook (2018), automatically tracks error budgets, generates burn rate alerts, and produces reports for every service. The tooling is deeply integrated into Google's production infrastructure, and every service launch requires defined SLOs. This internal system inspired the entire category of commercial SLO tooling that emerged in the late 2010s.

Sloth was created because manually writing SLO alerting rules is error-prone¶

Sloth, an open-source SLO generator for Prometheus, was created because correctly implementing multi-window multi-burn-rate alerts (as described in Google's SRE Workbook) requires writing dozens of complex PromQL recording and alerting rules. A single SLO specification in Sloth generates 6-12 recording rules and multiple alerting rules, eliminating the manual math that frequently led to incorrect alert thresholds.

Multi-window multi-burn-rate alerting was considered revolutionary when Google published it¶

The multi-window multi-burn-rate alerting technique, described in Chapter 6 of the SRE Workbook, uses different time windows (1h, 6h, 3d) with different severity thresholds to detect SLO violations at different speeds. A fast burn (100% of monthly budget in 1 hour) pages immediately, while a slow burn (10% over 3 days) creates a ticket. Before this approach, most alerting was either too sensitive (false pages) or too slow (missed outages).

OpenSLO was created to standardize SLO definitions across vendors¶

OpenSLO, launched in 2021 by Nobl9 and the community, provides a vendor-neutral YAML specification for defining SLOs. Before OpenSLO, every tool (Datadog, Dynatrace, Nobl9, Google Cloud) had its own proprietary format for SLO definitions, making it impossible to migrate SLO configurations between platforms. The spec covers SLIs, SLOs, error budgets, and alert policies.

Most organizations define SLOs that are either too aggressive or never enforced¶

A 2023 industry survey found that approximately 60% of organizations that defined SLOs either set targets so aggressive they were immediately violated (rendering them meaningless) or set them but never tied consequences to breaching them. Effective SLO programs require organizational buy-in — when error budgets are exhausted, feature development must actually stop, which requires executive sponsorship.

Burn rate alerts replaced threshold alerts for SLO monitoring¶

Traditional threshold alerts (e.g., "error rate > 5%") do not account for the cumulative impact on error budgets. A 5% error rate for 30 seconds is fine; a 0.5% error rate sustained for a week might exhaust the entire monthly budget. Burn rate alerts measure how quickly the error budget is being consumed relative to the time remaining in the period, providing much more actionable signals.

SLO dashboards became the primary executive communication tool for reliability¶

Before SLO tooling, reliability was communicated to executives through incident counts and uptime percentages, which lacked context. SLO dashboards showing error budget remaining (e.g., "87% of monthly budget remaining on day 15") provide intuitive, actionable status that non-technical stakeholders can understand. This transformed reliability conversations from reactive incident discussions to proactive budget management.

Synthetic SLOs fill the gap when real user metrics are unavailable¶

For backend services and infrastructure components without direct user traffic (databases, message queues, internal APIs), SLO tooling supports "synthetic SLOs" — measuring service health through synthetic probes that simulate user behavior. These synthetic checks run continuously and generate the SLI data needed for error budget calculations when real user metrics are not available or not representative.

The Pyrra project makes SLO management Kubernetes-native¶

Pyrra, an open-source SLO manager, uses Kubernetes Custom Resources (CRDs) to define SLOs and automatically generates Prometheus recording rules and Grafana dashboards. By making SLOs part of the Kubernetes resource model, Pyrra enables SLO definitions to be version-controlled, reviewed in pull requests, and deployed through GitOps pipelines alongside the services they monitor.