Skip to content

The Monitoring We Ignored

Category: The Hard Lesson Domains: monitoring, alerting Read time: ~5 min


Setting the Scene

We had 847 alerts configured in PagerDuty across 12 services. I know the exact number because I counted them during the postmortem. On an average week, the on-call engineer received 300+ notifications. Most were noise — transient CPU spikes, brief connection pool exhaustion, GC pauses that resolved in seconds. The team had developed a reflex: phone buzzes, glance at title, acknowledge, go back to sleep.

We called it "alert yoga." Acknowledge, breathe, acknowledge, breathe.

What Happened

At 11:42 PM on a Friday, a Prometheus alert fired: disk_usage_critical on prod-db-primary-01. This was our main PostgreSQL server — 96 cores, 768 GB RAM, 4 TB NVMe. The alert threshold was 90% disk usage. Real, actionable, important.

The on-call engineer acknowledged it in 8 seconds. Then went back to watching a movie. He later told me he genuinely didn't read it. His thumb had developed muscle memory for the acknowledge button.

The disk continued filling. WAL segments were accumulating because a pg_replication_slot for a decommissioned replica was preventing WAL recycling. At 12:18 AM, the disk hit 100%. PostgreSQL went read-only to protect data integrity. Every write operation across the platform failed.

Our application error rate went from 0.1% to 94% in under a minute. But the error rate alert? It had been firing intermittently for three weeks due to a known bug in a non-critical endpoint. The on-call had snoozed it.

The actual outage detection came at 12:47 AM, 29 minutes after the database went read-only, when a batch job owner emailed the ops mailing list asking why their ETL was failing. The on-call saw the email at 1:03 AM.

The fix was fast once we understood it: SELECT pg_drop_replication_slot('replica_03_slot'); followed by waiting for WAL cleanup. The database was accepting writes again by 1:22 AM. Total write outage: 64 minutes. Time to detect: 45 minutes of that.

The Moment of Truth

In the postmortem, we pulled the PagerDuty analytics. In the previous 30 days, the team had received 1,247 alerts. Of those, 1,190 had been acknowledged within 30 seconds with no action taken. The mean time to acknowledge was 11 seconds. The mean time to actually investigate was 4 hours and 12 minutes. We'd built a system that was indistinguishable from having no monitoring at all.

The Aftermath

We declared "alert bankruptcy." Over two weeks, we deleted 680 alerts. Every remaining alert had to meet three criteria: it indicates a customer-facing impact, it requires human action, and the responder knows what action to take. We dropped from 847 alerts to 167. On-call notifications dropped from 300+ per week to about 15.

Six months later, mean time to engage on a real alert was under 3 minutes.

The Lessons

  1. Alert fatigue kills: When everything is an emergency, nothing is. Your team will develop coping mechanisms that make real alerts invisible.
  2. Fix or remove noisy alerts: Every alert that fires without requiring action is actively making your system less safe. Delete it or fix the underlying condition.
  3. Every alert should be actionable: If the runbook for an alert is "check if it resolves itself," that's not an alert — it's a log line.

What I'd Do Differently

I'd implement an alert review process from day one: any alert that fires more than 3 times in 30 days without resulting in human action gets auto-disabled and reviewed. I'd also add a # Runbook section to every Prometheus alerting rule, and reject rules in code review that don't include one. Finally, I'd track the "acknowledge-to-action" time as a team metric, not just "acknowledge time."

The Quote

"We didn't have a monitoring problem. We had 847 monitoring problems, and we'd trained ourselves to ignore all of them."

Cross-References