Postmortem: Alert Routing Sends All Pages to Decommissioned Channel¶
| Field | Value |
|---|---|
| ID | PM-030 |
| Date | 2025-10-17 |
| Severity | Near-Miss |
| Duration | 0m (no customer impact) |
| Time to Detect | 63h (Monday morning; discovery by manual check) |
| Time to Mitigate | 47m (Alertmanager config corrected and reloaded) |
| Customer Impact | None |
| Revenue Impact | None |
| Teams Involved | SRE, Observability, On-Call Engineering |
| Postmortem Author | Cassandra Njoku |
| Postmortem Date | 2025-10-20 |
Executive Summary¶
On Friday 2025-10-17 at 17:02 UTC, an Alertmanager configuration migration from Slack webhooks to PagerDuty receivers was deployed by SRE engineer Tomás Guerrero. The Slack receiver was removed correctly, but the PagerDuty receiver was defined in the receivers block without being wired into the route tree — the routing rules still referenced the old Slack receiver name, which now resolved to the implicit null receiver. All alerts fired from 17:02 Friday to 09:14 Monday were silently dropped: 47 alerts over 63 hours, zero delivered. The blackout was discovered Monday morning when SRE team lead Cassandra Njoku noticed the PagerDuty incident timeline was empty for the weekend and checked the Alertmanager status UI. No real incidents occurred during the blackout window; all 47 alerts were transient and self-resolved. If a SEV-1 had occurred over the weekend, there would have been no automated notification to the on-call engineer.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 2025-10-17 16:45 | Tomás Guerrero begins applying Alertmanager config migration: removes Slack receiver slack-ops-alerts, adds PagerDuty receiver pagerduty-prod; config passes amtool check-config validation |
| 2025-10-17 17:02 | Alertmanager config reloaded via HTTP POST to /-/reload; Alertmanager returns HTTP 200 |
| 2025-10-17 17:05 | Tomás sends test alert using amtool alert add with severity=critical; observes alert appears in Alertmanager's "Alerts" UI; does not check PagerDuty to confirm receipt |
| 2025-10-17 17:08 | Tomás marks the migration task complete in Jira (OBS-441); signs off for the weekend |
| 2025-10-17 17:22 | First alert fires post-migration: DiskPressureWarning on node worker-14; routed to null; dropped |
| 2025-10-17 18:41 | HighMemoryUtilization alert fires and self-resolves; dropped |
| 2025-10-18 (Saturday) | 19 alerts fire over the day (disk pressure, GC pauses, health check flaps); all dropped |
| 2025-10-19 (Sunday) | 26 alerts fire; all dropped |
| 2025-10-20 09:02 | Cassandra Njoku starts Monday shift; reviews PagerDuty for weekend incident summary; finds zero incidents |
| 2025-10-20 09:07 | Cassandra checks Alertmanager status UI (/api/v2/alerts) directly; sees 3 active alerts with status suppressed, receiver null |
| 2025-10-20 09:09 | Cassandra inspects Alertmanager config: routing tree references slack-ops-alerts receiver (does not exist); pagerduty-prod receiver defined but not referenced in any route |
| 2025-10-20 09:11 | Cassandra pages Tomás and Observability team lead Ingrid Holm; opens #sre-incidents thread |
| 2025-10-20 09:14 | Alertmanager config updated: routing tree receiver field changed to pagerduty-prod |
| 2025-10-20 09:17 | Config reloaded; test alert sent and confirmed received in PagerDuty within 45 seconds |
| 2025-10-20 09:35 | Retrospective call begins; historical alert review determines all 47 alerts were transient |
| 2025-10-20 10:01 | Incident closed; postmortem scheduled |
| 2025-10-20 14:30 | Cassandra begins postmortem authorship; Observability team opens OBS-449 (alerting end-to-end test) |
Impact¶
Customer Impact¶
None — the 47 alerts that fired during the blackout were all transient, self-resolving conditions (disk pressure warnings, GC pause spikes, health check flaps caused by a scheduled rolling restart Saturday morning). None crossed the threshold that would have triggered a user-visible incident.
Internal Impact¶
- Tomás Guerrero (SRE): ~2 hours (Monday remediation, postmortem participation, root cause review)
- Cassandra Njoku (SRE lead): ~3 hours (discovery, incident coordination, postmortem authorship, process review)
- Ingrid Holm (Observability lead): ~2 hours (config investigation, process redesign discussion)
- On-call engineers (Friday-Monday): unaware of blackout during the weekend; no active hours lost, but the risk they carried was uncompensated and unknown to them
- Total: approximately 7 engineering-hours
Data Impact¶
None.
What Would Have Happened¶
Meridian Cloud's on-call rotation covers the weekend with a single engineer per tier (Tier-1 and Tier-2). The mean time to detect for weekend incidents is already approximately 3x the weekday rate, because engineers are not actively monitoring dashboards and rely exclusively on PagerDuty pages to learn of problems. Without any alerting, the only detection paths for a weekend incident would have been: a customer support ticket escalation (typical lag: 20-90 minutes after customers notice, plus ticket routing delay); an engineer proactively checking dashboards on their own initiative (rare and unreliable on weekends); or automated external uptime monitoring (which exists but only monitors HTTP 200 on the public health check endpoint — it does not detect partial failures or internal service degradation).
Based on Meridian Cloud's historical incident frequency — approximately 2.3 SEV-1/2 incidents per month — there is a roughly 15% probability that a qualifying incident would occur on any given weekend. This is not a low probability; over the course of a year, a 15% weekly probability implies approximately 7-8 weekends per year where a real incident fires. The 63-hour blackout encompassed one full weekend. Historical data from 3 real weekend SEV-1s over the past 18 months shows a median time-to-detect (from incident start to first engineer acknowledgment) of 47 minutes without alerting failures. Extrapolating to a zero-alerting scenario using the support-ticket detection path suggests a median time-to-detect of 3-5 hours — a 4-6x degradation.
For Meridian Cloud's most critical service (the API gateway), 3-5 hours of undetected downtime would represent approximately $180K-$300K in SLA credits (based on committed uptime SLAs of 99.9% monthly) and potential customer churn from enterprise accounts whose own operations depend on the API. Additionally, Meridian's enterprise MSA with three customers includes a contractual 30-minute maximum notification requirement for SEV-1 incidents. A 3-5 hour detection gap would have been a material breach of those contracts.
Root Cause¶
What Happened (Technical)¶
Alertmanager's configuration has two distinct sections that must be kept in sync: the receivers block (which defines how to send notifications — Slack webhook URL, PagerDuty routing key, etc.) and the route tree (which defines which alerts go to which receiver, by name). A receiver defined in the receivers block but not referenced in any route rule is inert. A route rule referencing a receiver name that does not exist in the receivers block falls through to the null receiver, which discards the alert silently.
Tomás's migration correctly added the pagerduty-prod entry to the receivers block. However, the routing tree's receiver field at the top level — which acts as the default catch-all for any unmatched alert — still referenced slack-ops-alerts. Because slack-ops-alerts had been deleted from the receivers block, the routing tree had a dangling reference. Alertmanager's amtool check-config validation confirmed that the YAML was syntactically valid — it does not validate that receiver names in route are present in receivers. The configuration loaded without error.
The test Tomás ran after reloading (amtool alert add followed by checking the Alertmanager UI) confirmed that alerts appeared in Alertmanager's internal queue — which they do correctly, regardless of whether the routing destination is valid. He did not check the PagerDuty service to confirm an incident had been created. The Alertmanager UI shows alerts in the "active" or "suppressed" states, and the receiver name shown in the UI is null for misconfigured routes — but this is not visually alarming if the engineer does not know to look for it.
Contributing Factors¶
amtool check-configdoes not validate routing tree receiver references: The standard validation tool for Alertmanager configurations checks YAML syntax and schema but does not cross-reference receiver names in routes against the defined receivers list. This is a known gap in the tooling. The validation step gave false confidence that the configuration was correct.- Smoke test stopped at the Alertmanager UI, not the downstream receiver: Tomás verified that the test alert appeared in Alertmanager's UI — a necessary but not sufficient condition. The correct smoke test is to verify that an alert fired in Alertmanager results in a delivered PagerDuty incident. The additional 30-60 seconds this check would have taken was the difference between catching and missing this misconfiguration.
- Friday 5pm deploy for an alerting infrastructure change: The migration was executed at the start of a weekend, the highest-risk window for undetected alerting failures. A weekend alerting blackout is worse than a weekday one by approximately 3x (detection lag) and there are no engineers actively watching dashboards to notice that pages aren't arriving.
What We Got Lucky About¶
- No real incident occurred during the 63-hour window. This is the central lucky fact. Based on historical incident frequency (2.3 SEV-1/2 per month), the probability of a qualifying incident during any 63-hour window is approximately 20%. The weekend happened to be quiet: a scheduled rolling restart on Saturday morning caused a cluster of health check flaps (26 of the 47 alerts), but this was planned and operators had verified stability before signing off Friday. All other alerts were transient noise.
- All 47 alerts during the blackout were self-resolving. None of the alerts that fired represented a developing incident — they were all spikes that resolved within minutes. Had any of them been the leading indicator of a real incident (e.g., disk pressure progressing to disk-full, or a GC pause spike preceding a memory leak), the blackout would have delayed detection significantly. The alerts that fired happened to be the low-signal, high-frequency noise that typically would not have paged anyone anyway — but that distinction was invisible during the blackout.
Detection¶
How We Detected¶
Cassandra Njoku noticed on Monday morning that the PagerDuty incident timeline was empty for the entire weekend — an absence that is anomalous because at least a few alert firings are typical during any 48-hour window. She then directly queried the Alertmanager API and identified active alerts with receiver: null. Inspection of the routing configuration revealed the dangling receiver reference.
Why This Almost Wasn't Caught¶
There was no automated test that validated the full alerting pipeline end-to-end (Alertmanager → PagerDuty incident creation). The amtool check-config validation passed, giving the false impression that the configuration was correct. Tomás's smoke test stopped at the Alertmanager UI rather than checking PagerDuty. Without Cassandra's Monday morning anomaly detection (noticing the absence of incidents), the blackout might have persisted indefinitely — or until a real incident revealed it.
Response¶
What Went Well¶
- Once Cassandra identified the problem (09:07 UTC), diagnosis was fast: reading the Alertmanager config file and identifying the dangling receiver reference took 2 minutes. The fix (changing one field in the routing tree) took 3 minutes. The simplicity of the fix reflects the clarity of the root cause.
- Cassandra's Monday morning habit of reviewing the PagerDuty weekend summary is what caught this. This is an informal practice — not a documented process step — but it was the only detection mechanism that worked. It should become a formal check.
What Could Have Gone Better¶
- The smoke test after deploying the config change should have included verifying that a test alert reached PagerDuty — not just that it appeared in the Alertmanager UI. This is a 60-second check that would have caught the misconfiguration before Tomás signed off.
- An alerting infrastructure change should not be deployed on a Friday afternoon without a verified end-to-end test. The timing amplified the blast radius: a Monday deploy with the same bug would have been caught within hours by engineers who notice their pages aren't arriving.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| PM030-01 | Write and deploy an Alertmanager configuration linter (pre-deploy hook) that validates all route.receiver references exist in the receivers block; fail the deployment if any reference is dangling |
P0 | Ingrid Holm | In Progress | 2025-10-24 |
| PM030-02 | Add end-to-end alerting pipeline smoke test to the Alertmanager deploy playbook: fire a synthetic test alert and assert a PagerDuty incident is created within 2 minutes; block deploy sign-off until verified | P0 | Tomás Guerrero | In Progress | 2025-10-24 |
| PM030-03 | Implement a scheduled "dead man's switch" alert: a synthetic alert that fires every 15 minutes and must be received by PagerDuty; if PagerDuty does not see it within 20 minutes, a secondary channel (email to all SREs) triggers | P0 | Observability | Open | 2025-10-31 |
| PM030-04 | Add change freeze policy for alerting infrastructure: no alertmanager config changes after 15:00 UTC on Fridays, or any day before a holiday weekend; changes must be deployed and verified during business hours | P1 | Cassandra Njoku | Open | 2025-10-27 |
| PM030-05 | Document "Alertmanager smoke test checklist" in the runbook (OBS-RB-003); make end-to-end PagerDuty verification a mandatory checklist item for all receiver config changes | P1 | Tomás Guerrero | Open | 2025-10-27 |
| PM030-06 | Add Monday morning "weekend alert health check" to SRE team lead checklist: verify PagerDuty received at least N alerts over the weekend; investigate absence if below threshold | P2 | Cassandra Njoku | Open | 2025-10-31 |
Lessons Learned¶
- Validation tools that pass do not mean the system is correct.
amtool check-configvalidated the YAML structure but not the semantic correctness of the configuration — specifically, that routing rules reference receivers that actually exist. A green check from a linter or validator is only as good as what the tool actually checks. Understanding the scope of your validation tooling is as important as running it. - Smoke tests must cover the full path, not just the first hop. Verifying that Alertmanager accepted an alert is not the same as verifying that the alert reached the engineer's phone. End-to-end validation — from alert firing to PagerDuty incident creation — is the only test that proves the alerting pipeline works. Any shorter test path can pass while the pipeline is broken downstream.
- Alert silence can be a symptom, not just an absence. The absence of pages during a 63-hour window was the detection signal. Building proactive checks for anomalous silence — either via dead man's switch alerts or periodic review of alert volume against historical baselines — is as important as building alerts for anomalous noise. The system was failing silently, and only a human noticing the silence caught it.
Cross-References¶
- Failure Pattern: Alerting Pipeline Silent Failure / Configuration Drift / Dangling Reference
- Topic Packs: Alertmanager Configuration, PagerDuty Integration, Observability Reliability, On-Call Engineering
- Runbook: OBS-RB-003 — Alertmanager Configuration Change Procedure; OBS-RB-007 — Alert Pipeline Debugging
- Decision Tree: Observability Triage → No pages received → Check Alertmanager UI receiver column →
null? → Inspect routing tree for dangling receiver references