Alerting Rules — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about alerting rules.

Google SREs coined the term "alert fatigue" to describe the #1 operational problem¶

Google's SRE team found that on-call engineers who received more than about 2 alerts per 12-hour shift began ignoring or dismissing alerts. Their SRE book (2016) popularized the term "alert fatigue" and established the principle that every alert should be actionable, require human intelligence, and represent a real threat to user experience. Anything else should be a ticket or a log entry.

The Three Mile Island nuclear accident was worsened by 100+ simultaneous alarms¶

On March 28, 1979, the Three Mile Island nuclear reactor operators were overwhelmed by over 100 simultaneous alarms, many contradictory. The alarm printer fell 2 hours behind real-time. Operators could not distinguish critical warnings from minor notifications. This incident became a foundational case study in alarm management and directly influenced how modern alerting systems use severity levels and deduplication.

PagerDuty was founded in 2009 because on-call notification was done via phone trees¶

Before PagerDuty, many ops teams used manual phone trees — a list of numbers to call in order until someone answered. Alex Solomon, Andrew Miklas, and Baskar Puvanathasan founded PagerDuty after experiencing the pain of ad-hoc on-call notification at Amazon. The company reached a $2 billion valuation by the early 2020s, built entirely around the concept of reliable alert routing.

The "boy who cried wolf" problem costs companies millions in missed real alerts¶

A 2019 study by Dimensional Research found that 83% of IT professionals reported experiencing alert fatigue, and 44% admitted to ignoring alerts or turning off notifications entirely. In multiple post-incident analyses, teams discovered that the critical alert had fired but was lost in a flood of non-actionable warnings — sometimes hundreds per day.

Prometheus recording rules exist because real-time aggregation is too expensive¶

Prometheus's recording rules pre-compute expensive queries at regular intervals, storing the results as new time series. This feature was necessary because large Prometheus installations with millions of time series found that complex aggregation queries (e.g., percentiles across thousands of services) could take minutes to compute. Recording rules reduce alert evaluation from minutes to milliseconds.

The "for" clause in Prometheus alerting rules prevents thousands of false pages¶

Prometheus alerting rules support a for duration that requires a condition to be true for a sustained period before firing. Without this clause, brief metric spikes (even single scrape failures) would generate pages. The typical recommendation is a for duration of 5-15 minutes for most alerts, which eliminates transient false positives while still catching genuine outages quickly.

Dead man's switch alerts detect when your monitoring itself has failed¶

A "dead man's switch" alert fires when a system stops sending heartbeats, rather than when a threshold is crossed. This pattern is critical for detecting silent monitoring failures — if your alerting pipeline goes down, no alerts fire, and everything appears healthy. Dead man's switch patterns are considered mandatory in mature SRE organizations and are typically implemented as the last line of defense.

Alert routing trees can have over 100 branches in large organizations¶

Enterprise alerting systems like PagerDuty and Opsgenie support complex routing trees that direct alerts to different teams based on severity, service, time of day, and geography. Large organizations with hundreds of microservices may have routing configurations with over 100 branches. Maintaining these routing rules has become a specialized skill, and misconfigured routing is a common factor in delayed incident response.

The SLO-based alerting model reduced Google's alert volume by 90%¶

When Google SRE teams shifted from symptom-based alerting ("CPU > 80%") to SLO-based alerting ("error budget burn rate exceeds threshold"), they reported reductions in alert volume of up to 90%. The key insight was that most resource utilization alerts are not correlated with user-visible impact, while SLO burn rate directly measures whether users are affected.