Skip to content

Postmortem: On-Call Handoff Gap Leaves Alerts Unacknowledged for 3 Hours

Field Value
ID PM-023
Date 2025-07-11
Severity SEV-3
Duration 3h 0m (start of gap to discovery)
Time to Detect 180m (gap discovered Monday morning, not in real time)
Time to Mitigate 0m (all alerts self-resolved; no active incident to mitigate)
Customer Impact None — all 15 alerts during the gap were transient and self-resolved
Revenue Impact None
Teams Involved Infrastructure Operations, Engineering Leadership
Postmortem Author Carol Osei-Mensah
Postmortem Date 2025-07-14

Executive Summary

On Friday 2025-07-11 at 18:00 UTC, the on-call rotation handed off from Alice Marchetti (outgoing) to Bob Nguyen (incoming). Bob had switched personal devices that week and had not reinstalled the PagerDuty mobile app on his new phone. Alice ended her shift verbally ("you're on, Bob"), assuming PagerDuty's escalation would handle any gaps. Between 18:00 and 21:00 UTC, 15 alerts fired — covering disk pressure, a flapping health check, and elevated error rates across three services. None were acknowledged or investigated. The gap was discovered Monday morning at 08:47 UTC by engineering team lead Carol Osei-Mensah while reviewing the PagerDuty incident list. All 15 alerts had self-resolved. No customer impact occurred, but the team was completely blind for 3 hours on a Friday evening with no one aware of the gap until two business days later.

Timeline (All times UTC)

Time Event
2025-07-11 17:45 Alice Marchetti sends Slack message to Bob Nguyen: "Heading off, you're on call — quiet week so far"
2025-07-11 18:00 PagerDuty schedule transitions to Bob Nguyen as primary on-call; no secondary on-call configured for this rotation
2025-07-11 18:00 Bob's PagerDuty mobile app is not installed on his new phone; desktop notifications are off (he is not at his desk)
2025-07-11 18:14 Alert 1 fires: node_disk_pressure on worker-07; PagerDuty pages Bob; Bob does not acknowledge
2025-07-11 18:19 PagerDuty escalation timer expires (5 minutes); escalates to high-urgency; no secondary on-call configured — escalation chain terminates
2025-07-11 18:21 Alert 1 self-resolves: log rotation ran, disk pressure clears
2025-07-11 18:35 Alert 2 fires: health check flapping on payments-service pod (3 failed probes then recovery)
2025-07-11 18:36 Alert 2 self-resolves before PagerDuty escalation timer expires
2025-07-11 19:02 Alerts 3–7 fire: elevated 5xx error rate on catalog-service (deployment rollout causing brief traffic imbalance); 5 separate alert conditions trigger in sequence
2025-07-11 19:14 Alerts 3–7 self-resolve: rollout completes, traffic balances, error rate drops
2025-07-11 19:30 Alerts 8–12 fire: node_memory_pressure warnings on 5 nodes (GC pressure from catalog-service rollout, clears naturally)
2025-07-11 19:47 Alerts 8–12 self-resolve: GC completes, memory pressure clears
2025-07-11 20:15 Alerts 13–15 fire: elevated latency p99 on search-service; root cause unknown but self-clears
2025-07-11 20:38 Alerts 13–15 self-resolve
2025-07-11 21:00 Alert activity ceases; all 15 incidents in PagerDuty are in "resolved" state; none were acknowledged
2025-07-14 08:47 Carol Osei-Mensah opens PagerDuty Monday morning; sees 15 resolved incidents from Friday evening, all with 0 acknowledgements
2025-07-14 08:52 Carol confirms with Bob that he did not receive any pages; confirms handoff gap
2025-07-14 09:15 Carol escalates to engineering director; incident declared SEV-3 retrospectively; postmortem initiated

Impact

Customer Impact

None. All 15 alerts during the coverage gap were for transient conditions that self-resolved. No customer-facing service degradation occurred, and no external monitoring systems reported any availability loss during the window.

Internal Impact

  • Operational blind spot for 3 hours: The team had zero effective on-call coverage from 18:00 to 21:00 UTC on a Friday. Any real incident during that window — a database failure, a security event, a data pipeline crash — would have gone undetected until at least Monday morning, or until a customer reported it.
  • Delayed discovery: The gap was not discovered in real time but retrospectively 2+ business days later. This means the team had no opportunity to identify whether any of the 15 alerts indicated an early warning of a larger issue.
  • Process trust erosion: The incident revealed that the team's on-call process depended entirely on a verbal handoff and an individual's device setup — factors that are invisible to the system and unenforceable at scale.
  • Carol Osei-Mensah spent approximately 3 hours on retrospective investigation, incident write-up, and remediation planning.

Data Impact

None. No data was written, lost, or corrupted during the gap. All services recovered to normal state on their own.

Root Cause

What Happened (Technical)

Meridian Systems uses PagerDuty for on-call scheduling and alert routing. The on-call rotation for the Infrastructure Operations team was configured with a primary on-call and no secondary or tertiary escalation target. The escalation policy was set to: page the primary on-call; if unacknowledged for 5 minutes, escalate to high-urgency page on the same contact. There was no fallback to a secondary person.

Bob Nguyen switched personal devices the week of the incident. His new phone did not have the PagerDuty mobile app installed. He had received PagerDuty pages previously via email (secondary channel) but had not updated his PagerDuty notification rules after the device switch. The email notification was configured with a 10-minute delay (to suppress brief alert flapping), meaning email pages would arrive 10 minutes after the initial page — but Bob was not checking work email on a Friday evening.

The on-call handoff process was entirely verbal: Alice told Bob in Slack that he was taking over. There was no checklist, no PagerDuty acknowledgement step, no verification that Bob's devices were configured to receive pages, and no test page sent to confirm end-to-end delivery.

PagerDuty does not have a built-in feature to alert when an on-call engineer has not acknowledged any pages over a rolling window (i.e., "on-call engineer is unreachable"). The system assumes that a configured on-call schedule represents a reachable engineer. When the escalation chain terminates without a secondary contact, unacknowledged pages simply resolve or remain open with no further action.

Contributing Factors

  1. No on-call handoff checklist: The handoff process was entirely informal. There was no standardized checklist requiring the incoming engineer to confirm device setup, verify PagerDuty notification delivery, and send a test acknowledgement. The process relied entirely on individual memory and good intentions.
  2. No secondary or backup on-call configured: The PagerDuty escalation policy had no secondary contact. When Bob failed to acknowledge, the escalation chain had nowhere to go. A single point of failure in human availability is equivalent to a single point of failure in infrastructure.
  3. PagerDuty lacks "on-call engineer unreachable" alerting: PagerDuty does not natively alert managers or team leads when the primary on-call has not acknowledged any incident over a configurable time window. This is a product gap that must be compensated for with process (e.g., a scheduled check-in) or tooling (e.g., a scheduled PagerDuty API query that alerts if no-ack count exceeds a threshold).
  4. Device change not treated as an on-call readiness event: Bob's team was aware he had changed devices. There was no policy requiring an engineer to re-verify on-call readiness (app installation, notification test) after a device change. This is a lifecycle gap in on-call operations.
  5. Post-incident discovery gap (2 days): Even if real incidents had occurred, the gap would not have been discovered until Monday morning. There was no mechanism to alert a manager or team lead that the on-call engineer had been unreachable for more than N minutes.

What We Got Lucky About

  1. All 15 alerts during the 3-hour gap were transient — disk pressure resolved by log rotation, health check flapping corrected itself, error rates normalized after a deployment completed. If a real SEV-1 had occurred (database master failure, data corruption, security breach), there would have been zero response for hours with no customer notification until Monday.
  2. The gap window was a Friday evening with relatively low traffic. The same gap during a Tuesday afternoon deployment window, when change velocity is highest, would have coincided with a much higher probability of a real incident requiring human intervention.

Detection

How We Detected

Carol Osei-Mensah discovered the gap during a routine Monday morning PagerDuty review, a habit she had developed after a previous on-call coverage issue (unrelated). She noticed the unusually high number of resolved incidents with zero acknowledgements and immediately investigated. Detection was entirely dependent on one individual's personal habit, not any systematic monitoring.

Why We Didn't Detect Sooner

There was no mechanism to detect an on-call coverage gap in real time. PagerDuty does not natively emit an alert when its escalation chain is exhausted without acknowledgement. No manager, team lead, or secondary contact was paged. No scheduled job monitored PagerDuty's on-call status or unacknowledged incident counts. The gap was invisible to the organization until someone manually looked at the incident history.

Response

What Went Well

  1. Once Carol discovered the gap, the incident was escalated and documented promptly and without blame. The retrospective was initiated within 30 minutes of discovery.
  2. Alice and Bob were both transparent about the handoff details, enabling quick root cause identification without finger-pointing.
  3. The team's immediate interim fix (Carol personally volunteered to serve as backup on-call for the following weekend while longer-term fixes were implemented) prevented a repeat the following week.

What Went Poorly

  1. The gap was discovered 60+ hours after it began. A real incident would have gone entirely unaddressed for the entire weekend.
  2. There was no secondary on-call. This is a fundamental process failure that made any single human failure (device loss, illness, emergencies) a complete coverage gap.
  3. The verbal handoff process had no verification step. "You're on, Bob" is not equivalent to "Bob has confirmed his devices are configured and he received a test page."
  4. The team lead had no visibility into on-call readiness. There was no dashboard, no check-in requirement, and no automated health signal for the on-call coverage status.

Action Items

ID Action Priority Owner Status Due Date
PM-023-01 Configure a secondary on-call contact in PagerDuty for all rotations; secondary is paged after 10 minutes if primary does not acknowledge P1 Carol Osei-Mensah In Progress 2025-07-16
PM-023-02 Create an on-call handoff checklist (Notion template): incoming engineer must confirm app installed, test page sent and received, and check Slack acknowledgement before outgoing engineer signs off P1 Carol Osei-Mensah In Progress 2025-07-16
PM-023-03 Build a PagerDuty API monitor (scheduled Lambda, 15-min intervals during on-call hours) that alerts Carol via SMS if the on-call engineer has 2+ unacknowledged incidents older than 10 minutes P1 Bob Nguyen Open 2025-07-23
PM-023-04 Add "on-call readiness re-verification" to device change checklist in IT onboarding and off-boarding procedures P2 IT Operations Open 2025-07-30
PM-023-05 Schedule a monthly on-call process review: verify escalation policies, secondary contacts, and notification rules for all on-call engineers are current P2 Carol Osei-Mensah Open 2025-08-01
PM-023-06 Investigate PagerDuty stakeholder notification features or third-party integrations that provide real-time on-call coverage health dashboard P3 Infrastructure Ops Open 2025-08-15

Lessons Learned

  1. A verbal handoff is not a handoff. An on-call transition is complete only when the incoming engineer has demonstrably received a test page on their actual device configuration. A Slack message or verbal agreement is a social acknowledgement, not an operational one — it cannot verify device state.
  2. A single-person on-call with no backup is a single point of failure. Every human failure mode (illness, device failure, family emergency, honest mistake) results in zero coverage. Escalation policies must route to a secondary contact who is independently reachable.
  3. Unreachable on-call is invisible without explicit monitoring. PagerDuty does not know the on-call engineer is unreachable — it only knows pages went unacknowledged. Organizations must explicitly instrument this gap: either with a watchdog process that monitors acknowledgement health, or with a mandatory check-in protocol that fails loudly when missed.

Cross-References

  • Failure Pattern: Process gap — operational dependency on individual discipline and informal communication with no system-level enforcement or visibility
  • Topic Packs: On-call operations, incident management, PagerDuty configuration, escalation policy design
  • Runbook: runbooks/ops/oncall-handoff-checklist.md
  • Decision Tree: Alert fires → primary on-call does not acknowledge within 5m → escalate to secondary → if no secondary: alert team lead via SMS within 10m