Postmortem: On-Call Handoff Gap Leaves Alerts Unacknowledged for 3 Hours¶
| Field | Value |
|---|---|
| ID | PM-023 |
| Date | 2025-07-11 |
| Severity | SEV-3 |
| Duration | 3h 0m (start of gap to discovery) |
| Time to Detect | 180m (gap discovered Monday morning, not in real time) |
| Time to Mitigate | 0m (all alerts self-resolved; no active incident to mitigate) |
| Customer Impact | None — all 15 alerts during the gap were transient and self-resolved |
| Revenue Impact | None |
| Teams Involved | Infrastructure Operations, Engineering Leadership |
| Postmortem Author | Carol Osei-Mensah |
| Postmortem Date | 2025-07-14 |
Executive Summary¶
On Friday 2025-07-11 at 18:00 UTC, the on-call rotation handed off from Alice Marchetti (outgoing) to Bob Nguyen (incoming). Bob had switched personal devices that week and had not reinstalled the PagerDuty mobile app on his new phone. Alice ended her shift verbally ("you're on, Bob"), assuming PagerDuty's escalation would handle any gaps. Between 18:00 and 21:00 UTC, 15 alerts fired — covering disk pressure, a flapping health check, and elevated error rates across three services. None were acknowledged or investigated. The gap was discovered Monday morning at 08:47 UTC by engineering team lead Carol Osei-Mensah while reviewing the PagerDuty incident list. All 15 alerts had self-resolved. No customer impact occurred, but the team was completely blind for 3 hours on a Friday evening with no one aware of the gap until two business days later.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 2025-07-11 17:45 | Alice Marchetti sends Slack message to Bob Nguyen: "Heading off, you're on call — quiet week so far" |
| 2025-07-11 18:00 | PagerDuty schedule transitions to Bob Nguyen as primary on-call; no secondary on-call configured for this rotation |
| 2025-07-11 18:00 | Bob's PagerDuty mobile app is not installed on his new phone; desktop notifications are off (he is not at his desk) |
| 2025-07-11 18:14 | Alert 1 fires: node_disk_pressure on worker-07; PagerDuty pages Bob; Bob does not acknowledge |
| 2025-07-11 18:19 | PagerDuty escalation timer expires (5 minutes); escalates to high-urgency; no secondary on-call configured — escalation chain terminates |
| 2025-07-11 18:21 | Alert 1 self-resolves: log rotation ran, disk pressure clears |
| 2025-07-11 18:35 | Alert 2 fires: health check flapping on payments-service pod (3 failed probes then recovery) |
| 2025-07-11 18:36 | Alert 2 self-resolves before PagerDuty escalation timer expires |
| 2025-07-11 19:02 | Alerts 3–7 fire: elevated 5xx error rate on catalog-service (deployment rollout causing brief traffic imbalance); 5 separate alert conditions trigger in sequence |
| 2025-07-11 19:14 | Alerts 3–7 self-resolve: rollout completes, traffic balances, error rate drops |
| 2025-07-11 19:30 | Alerts 8–12 fire: node_memory_pressure warnings on 5 nodes (GC pressure from catalog-service rollout, clears naturally) |
| 2025-07-11 19:47 | Alerts 8–12 self-resolve: GC completes, memory pressure clears |
| 2025-07-11 20:15 | Alerts 13–15 fire: elevated latency p99 on search-service; root cause unknown but self-clears |
| 2025-07-11 20:38 | Alerts 13–15 self-resolve |
| 2025-07-11 21:00 | Alert activity ceases; all 15 incidents in PagerDuty are in "resolved" state; none were acknowledged |
| 2025-07-14 08:47 | Carol Osei-Mensah opens PagerDuty Monday morning; sees 15 resolved incidents from Friday evening, all with 0 acknowledgements |
| 2025-07-14 08:52 | Carol confirms with Bob that he did not receive any pages; confirms handoff gap |
| 2025-07-14 09:15 | Carol escalates to engineering director; incident declared SEV-3 retrospectively; postmortem initiated |
Impact¶
Customer Impact¶
None. All 15 alerts during the coverage gap were for transient conditions that self-resolved. No customer-facing service degradation occurred, and no external monitoring systems reported any availability loss during the window.
Internal Impact¶
- Operational blind spot for 3 hours: The team had zero effective on-call coverage from 18:00 to 21:00 UTC on a Friday. Any real incident during that window — a database failure, a security event, a data pipeline crash — would have gone undetected until at least Monday morning, or until a customer reported it.
- Delayed discovery: The gap was not discovered in real time but retrospectively 2+ business days later. This means the team had no opportunity to identify whether any of the 15 alerts indicated an early warning of a larger issue.
- Process trust erosion: The incident revealed that the team's on-call process depended entirely on a verbal handoff and an individual's device setup — factors that are invisible to the system and unenforceable at scale.
- Carol Osei-Mensah spent approximately 3 hours on retrospective investigation, incident write-up, and remediation planning.
Data Impact¶
None. No data was written, lost, or corrupted during the gap. All services recovered to normal state on their own.
Root Cause¶
What Happened (Technical)¶
Meridian Systems uses PagerDuty for on-call scheduling and alert routing. The on-call rotation for the Infrastructure Operations team was configured with a primary on-call and no secondary or tertiary escalation target. The escalation policy was set to: page the primary on-call; if unacknowledged for 5 minutes, escalate to high-urgency page on the same contact. There was no fallback to a secondary person.
Bob Nguyen switched personal devices the week of the incident. His new phone did not have the PagerDuty mobile app installed. He had received PagerDuty pages previously via email (secondary channel) but had not updated his PagerDuty notification rules after the device switch. The email notification was configured with a 10-minute delay (to suppress brief alert flapping), meaning email pages would arrive 10 minutes after the initial page — but Bob was not checking work email on a Friday evening.
The on-call handoff process was entirely verbal: Alice told Bob in Slack that he was taking over. There was no checklist, no PagerDuty acknowledgement step, no verification that Bob's devices were configured to receive pages, and no test page sent to confirm end-to-end delivery.
PagerDuty does not have a built-in feature to alert when an on-call engineer has not acknowledged any pages over a rolling window (i.e., "on-call engineer is unreachable"). The system assumes that a configured on-call schedule represents a reachable engineer. When the escalation chain terminates without a secondary contact, unacknowledged pages simply resolve or remain open with no further action.
Contributing Factors¶
- No on-call handoff checklist: The handoff process was entirely informal. There was no standardized checklist requiring the incoming engineer to confirm device setup, verify PagerDuty notification delivery, and send a test acknowledgement. The process relied entirely on individual memory and good intentions.
- No secondary or backup on-call configured: The PagerDuty escalation policy had no secondary contact. When Bob failed to acknowledge, the escalation chain had nowhere to go. A single point of failure in human availability is equivalent to a single point of failure in infrastructure.
- PagerDuty lacks "on-call engineer unreachable" alerting: PagerDuty does not natively alert managers or team leads when the primary on-call has not acknowledged any incident over a configurable time window. This is a product gap that must be compensated for with process (e.g., a scheduled check-in) or tooling (e.g., a scheduled PagerDuty API query that alerts if no-ack count exceeds a threshold).
- Device change not treated as an on-call readiness event: Bob's team was aware he had changed devices. There was no policy requiring an engineer to re-verify on-call readiness (app installation, notification test) after a device change. This is a lifecycle gap in on-call operations.
- Post-incident discovery gap (2 days): Even if real incidents had occurred, the gap would not have been discovered until Monday morning. There was no mechanism to alert a manager or team lead that the on-call engineer had been unreachable for more than N minutes.
What We Got Lucky About¶
- All 15 alerts during the 3-hour gap were transient — disk pressure resolved by log rotation, health check flapping corrected itself, error rates normalized after a deployment completed. If a real SEV-1 had occurred (database master failure, data corruption, security breach), there would have been zero response for hours with no customer notification until Monday.
- The gap window was a Friday evening with relatively low traffic. The same gap during a Tuesday afternoon deployment window, when change velocity is highest, would have coincided with a much higher probability of a real incident requiring human intervention.
Detection¶
How We Detected¶
Carol Osei-Mensah discovered the gap during a routine Monday morning PagerDuty review, a habit she had developed after a previous on-call coverage issue (unrelated). She noticed the unusually high number of resolved incidents with zero acknowledgements and immediately investigated. Detection was entirely dependent on one individual's personal habit, not any systematic monitoring.
Why We Didn't Detect Sooner¶
There was no mechanism to detect an on-call coverage gap in real time. PagerDuty does not natively emit an alert when its escalation chain is exhausted without acknowledgement. No manager, team lead, or secondary contact was paged. No scheduled job monitored PagerDuty's on-call status or unacknowledged incident counts. The gap was invisible to the organization until someone manually looked at the incident history.
Response¶
What Went Well¶
- Once Carol discovered the gap, the incident was escalated and documented promptly and without blame. The retrospective was initiated within 30 minutes of discovery.
- Alice and Bob were both transparent about the handoff details, enabling quick root cause identification without finger-pointing.
- The team's immediate interim fix (Carol personally volunteered to serve as backup on-call for the following weekend while longer-term fixes were implemented) prevented a repeat the following week.
What Went Poorly¶
- The gap was discovered 60+ hours after it began. A real incident would have gone entirely unaddressed for the entire weekend.
- There was no secondary on-call. This is a fundamental process failure that made any single human failure (device loss, illness, emergencies) a complete coverage gap.
- The verbal handoff process had no verification step. "You're on, Bob" is not equivalent to "Bob has confirmed his devices are configured and he received a test page."
- The team lead had no visibility into on-call readiness. There was no dashboard, no check-in requirement, and no automated health signal for the on-call coverage status.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| PM-023-01 | Configure a secondary on-call contact in PagerDuty for all rotations; secondary is paged after 10 minutes if primary does not acknowledge | P1 | Carol Osei-Mensah | In Progress | 2025-07-16 |
| PM-023-02 | Create an on-call handoff checklist (Notion template): incoming engineer must confirm app installed, test page sent and received, and check Slack acknowledgement before outgoing engineer signs off | P1 | Carol Osei-Mensah | In Progress | 2025-07-16 |
| PM-023-03 | Build a PagerDuty API monitor (scheduled Lambda, 15-min intervals during on-call hours) that alerts Carol via SMS if the on-call engineer has 2+ unacknowledged incidents older than 10 minutes | P1 | Bob Nguyen | Open | 2025-07-23 |
| PM-023-04 | Add "on-call readiness re-verification" to device change checklist in IT onboarding and off-boarding procedures | P2 | IT Operations | Open | 2025-07-30 |
| PM-023-05 | Schedule a monthly on-call process review: verify escalation policies, secondary contacts, and notification rules for all on-call engineers are current | P2 | Carol Osei-Mensah | Open | 2025-08-01 |
| PM-023-06 | Investigate PagerDuty stakeholder notification features or third-party integrations that provide real-time on-call coverage health dashboard | P3 | Infrastructure Ops | Open | 2025-08-15 |
Lessons Learned¶
- A verbal handoff is not a handoff. An on-call transition is complete only when the incoming engineer has demonstrably received a test page on their actual device configuration. A Slack message or verbal agreement is a social acknowledgement, not an operational one — it cannot verify device state.
- A single-person on-call with no backup is a single point of failure. Every human failure mode (illness, device failure, family emergency, honest mistake) results in zero coverage. Escalation policies must route to a secondary contact who is independently reachable.
- Unreachable on-call is invisible without explicit monitoring. PagerDuty does not know the on-call engineer is unreachable — it only knows pages went unacknowledged. Organizations must explicitly instrument this gap: either with a watchdog process that monitors acknowledgement health, or with a mandatory check-in protocol that fails loudly when missed.
Cross-References¶
- Failure Pattern: Process gap — operational dependency on individual discipline and informal communication with no system-level enforcement or visibility
- Topic Packs: On-call operations, incident management, PagerDuty configuration, escalation policy design
- Runbook:
runbooks/ops/oncall-handoff-checklist.md - Decision Tree: Alert fires → primary on-call does not acknowledge within 5m → escalate to secondary → if no secondary: alert team lead via SMS within 10m