SRE Practices — Trivia & Interesting Facts¶
Surprising, historical, and little-known facts about Site Reliability Engineering practices.
The title "Site Reliability Engineer" was coined by Ben Treynor Sloss at Google in 2003¶
Ben Treynor Sloss, VP of Engineering at Google, created the SRE role in 2003 with the simple explanation: "SRE is what happens when you ask a software engineer to design an operations function." The original team was 7 engineers. By 2024, Google's SRE organization had grown to thousands of engineers and had inspired SRE teams at virtually every major tech company.
The Google SRE book was the most popular free O'Reilly book ever published¶
"Site Reliability Engineering: How Google Runs Production Systems" (2016), available free online at sre.google, became the most downloaded free book in O'Reilly's history. The book defined SRE as a discipline and established concepts like error budgets, SLOs, toil, and blameless postmortems as industry vocabulary. The companion book, "The Site Reliability Workbook" (2018), provided practical implementation guidance.
Error budgets flip the adversarial relationship between developers and ops¶
Before error budgets, developers wanted to ship fast (risking reliability) and ops wanted stability (resisting change). Error budgets created an objective framework: if the service has remaining error budget, developers can ship; if the budget is exhausted, they must focus on reliability. This simple mechanism transformed a political argument into a data-driven process. The concept is considered Google's single most impactful SRE innovation.
Google SREs spend a maximum of 50% of their time on operational work¶
Google's SRE charter mandates that SREs spend no more than 50% of their time on "toil" — repetitive, automatable operational work. The remaining 50% must be spent on engineering projects that eliminate toil. If a team's toil exceeds 50%, management must either add headcount, reduce the team's service scope, or invest in automation. This 50% rule is the most commonly cited — and most commonly violated — SRE principle at non-Google companies.
SLOs should be set below 100% because 100% is the wrong target¶
A counterintuitive SRE principle: setting a 100% reliability target is harmful because it prevents deployments (any change could reduce reliability), wastes engineering resources pursuing diminishing returns, and is physically impossible at scale. Google's internal guidance suggests that SLOs should be "as low as users will tolerate" — typically 99.9% to 99.99% for most services. The gap between the SLO and 100% is the error budget.
The "toil" definition is more specific than most people think¶
In SRE, "toil" is not just "work I don't like." It has a precise definition: work that is manual, repetitive, automatable, tactical, devoid of lasting value, and scales linearly with service growth. Answering a novel question from a customer is not toil (it's not repetitive). Writing a design document is not toil (it has lasting value). Running the same manual failover procedure for the third time this month is toil.
Postmortems at Google follow a specific template that enforces blamelessness¶
Google's postmortem template includes: summary, impact, root cause, trigger, detection, response timeline, action items, and lessons learned. Crucially, it does not include a "who caused it" field. Action items must be systemic (better monitoring, safer deployment pipelines, improved testing) rather than personal ("engineer X needs more training"). This structure makes blamelessness a procedural guarantee, not just a cultural aspiration.
The concept of "production readiness review" was invented at Google¶
Before a new service can be handed off to Google SREs, it must pass a Production Readiness Review (PRR) that covers monitoring, capacity planning, failure modes, emergency procedures, and SLOs. The PRR ensures that SREs aren't handed services that are impossible to operate. Many companies have adopted their own versions, though most lack the enforcement mechanism — at Google, failing the PRR means SREs won't carry your pager.
Dickerson's hierarchy of reliability is the Maslow's hierarchy for SRE¶
Mikey Dickerson, who led the rescue of HealthCare.gov and later founded the US Digital Service, proposed a hierarchy of service reliability needs (often visualized as a pyramid): monitoring at the base, then incident response, then postmortems, then testing, then capacity planning, then development practices, and finally product design at the top. You can't work on the higher levels until the lower levels are solid — just like Maslow's hierarchy.
On-call compensation varies wildly and is a contentious topic in SRE¶
Some companies pay SREs extra for on-call (typically $1,000-2,000 per week of primary on-call), while others consider it part of the base salary. Google provides on-call bonuses and ensures adequate rest after night pages. Many companies do neither. The lack of standardized on-call compensation is one of the most common complaints in SRE surveys and a significant factor in burnout and attrition.
The "four golden signals" for monitoring came from Google SRE¶
Google's SRE book defined the four golden signals: latency, traffic, errors, and saturation. These four metrics, monitored for every service, provide a baseline understanding of service health. The concept was so influential that it shaped the design of monitoring tools like Prometheus, Datadog, and Grafana. RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) are related frameworks from Tom Wilkie and Brendan Gregg respectively.