Postmortem: UPS Battery Degradation Causes Rack Power Loss During Utility Blip¶

Field	Value
ID	PM-025
Date	2025-11-19
Severity	SEV-3
Duration	0h 35m (detection to full recovery)
Time to Detect	2m
Time to Mitigate	8m (Kubernetes rescheduled pods); 33m (full data service recovery)
Customer Impact	None — Kubernetes rescheduled all affected pods before external latency thresholds were breached
Revenue Impact	None
Teams Involved	Datacenter Operations, Infrastructure Engineering, Data Services
Postmortem Author	Renata Kowalczyk
Postmortem Date	2025-11-22

Executive Summary¶

On 2025-11-19 at 03:47 UTC, a 5-second utility power dip (a routine event that occurs 2-3 times per year at the Meridian Systems primary datacenter) caused a hard power loss to racks 14 and 15. The UPS units in those racks, whose batteries had degraded to approximately 40% of rated capacity, were unable to bridge a 5-second gap that their 95%-capacity peers handled without incident. Approximately 30 servers went down hard without graceful shutdown. Kubernetes detected the node failures and rescheduled all affected pods within 8 minutes. Elasticsearch and Kafka, which had nodes on the affected racks, required 25 additional minutes to recover due to shard rebalancing and replica log replay. No customer-facing services were disrupted and no data was lost. The incident exposed a critical gap in UPS battery maintenance: batteries were 4.5 years old (the vendor recommends replacement at 3 years) and the alert threshold for battery health was set at 20% capacity — far too low to trigger timely replacement.

Timeline (All times UTC)¶

Time	Event
2025-11-19 03:47:12	5-second utility power dip begins at facility grid feed; UPS units across all 12 racks switch to battery mode
2025-11-19 03:47:12	Racks 1–13 and rack 16: UPS batteries at 75–95% capacity; bridging 5 seconds uses approximately 8% of rated capacity; no interruption
2025-11-19 03:47:14	Racks 14 and 15: UPS batteries at ~40% capacity; 5-second draw exceeds available energy; UPS units reach low-battery cutoff and power down output
2025-11-19 03:47:14	~30 servers in racks 14 and 15 lose power; no graceful shutdown; IPMI/BMC connections drop
2025-11-19 03:47:17	Utility power restored at facility feed; UPS units in racks 14–15 begin recharging
2025-11-19 03:47:22	Servers in racks 14–15 begin power-on sequence (automatic power restore configured in BIOS)
2025-11-19 03:47:30	Monitoring system detects node loss; Prometheus `up` gauge drops to 0 for 28 node targets
2025-11-19 03:48:00	PagerDuty fires: `NodeNotReady` for 28 Kubernetes nodes; on-call engineer Leandro Ferreira acknowledges
2025-11-19 03:49	Leandro contacts datacenter operations; confirms power event on racks 14–15 from DCIM dashboard
2025-11-19 03:50	Kubernetes node controller marks affected nodes `NotReady`; pod eviction begins
2025-11-19 03:52	Majority of rescheduled pods are running on surviving nodes; workload traffic normalizes
2025-11-19 03:55	Servers in racks 14–15 fully booted and rejoining the cluster (kubelet restarting)
2025-11-19 03:55	Elasticsearch shard rebalancing begins: 47 primary shards on affected nodes were abruptly terminated; replica promotion and rebalancing underway
2025-11-19 03:58	Kafka log recovery begins on 3 affected brokers: log truncation to last committed offset; partition leadership re-election in progress
2025-11-19 04:10	Kafka partition leadership re-election complete; all topics have active leaders; consumer lag begins recovering
2025-11-19 04:12	Affected nodes fully rejoined Kubernetes cluster; all node statuses `Ready`
2025-11-19 04:15	Elasticsearch cluster health returns to `green`; all shards allocated and replicated
2025-11-19 04:22	Kafka consumer lag fully recovered; all consumer groups at head
2025-11-19 04:22	All systems nominal; Leandro closes incident
2025-11-19 08:30	Datacenter Ops team (Renata Kowalczyk) conducts UPS health audit across all 12 racks

Impact¶

Customer Impact¶

None. All customer-facing services recovered within 8 minutes of the power event as Kubernetes rescheduled pods from failed nodes to healthy nodes. The recovery happened before any external synthetic monitoring thresholds (which require 3+ minutes of sustained failure) were breached. External API availability was 100% for the incident window as measured by external monitors.

Internal Impact¶

Elasticsearch: 47 primary shards required replica promotion and re-sync. Cluster was in yellow health state (primaries active but some replicas still recovering) for approximately 20 minutes and orange (active rebalancing) for 8 minutes. Internal search and analytics queries were served from replicas throughout.
Kafka: 3 of 9 brokers went down hard without flushing in-flight writes. Log recovery truncated each affected partition to the last committed offset. This introduced up to 300ms of message rewind for in-flight producers, which Kafka's at-least-once delivery guarantee handles via producer retries. No messages were permanently lost.
Stateless services: All stateless pods rescheduled successfully within 8 minutes with no data exposure.
Engineering time: Leandro Ferreira (on-call) spent 90 minutes on incident response and monitoring. Renata Kowalczyk spent 4 hours on postmortem, UPS audit, and remediation planning. Datacenter Ops spent 6 hours on the UPS battery audit across all 12 racks.
30 servers required fsck on reboot due to unclean shutdown. Automated fsck added approximately 45-90 seconds to each server's boot time; all servers rejoined the cluster without manual intervention.

Data Impact¶

No data was permanently lost. Kafka's at-least-once delivery and Elasticsearch's replica-based redundancy absorbed the hard shutdowns. The Kafka truncation to last committed offset resulted in producer retries for at-most 300ms of messages, all of which were successfully re-delivered.

Root Cause¶

What Happened (Technical)¶

Uninterruptible Power Supply units (UPS) are the last line of defense between utility power interruptions and server hardware. Their battery banks are rated for a fixed capacity at installation and degrade over time through charge/discharge cycling and electrochemical aging. The manufacturer of the UPS units in racks 14 and 15 (APC Symmetra LX) specifies battery replacement at 3 years or when capacity drops below 80% of rated — whichever comes first.

The UPS units in racks 14 and 15 were installed in May 2021, making them 4.5 years old at the time of the incident. No battery replacement had been performed. The DCIM (Data Center Infrastructure Management) system monitored UPS battery health and was configured to alert at 20% remaining capacity — a threshold so low that the batteries were functionally useless by the time an alert would fire. The actual capacity at time of incident was approximately 40%, which is above the 20% alert threshold and therefore generated no alert. However, 40% capacity is insufficient to bridge even a 5-second power gap at full rack load.

The 5-second utility power dip is a known, recurring event at the Meridian Systems datacenter, occurring 2-3 times per year. All UPS units are expected to bridge this gap. For racks 14 and 15, the energy available during a 5-second bridge at full rack draw (approximately 12kW per rack) was exhausted before utility power was restored. The UPS low-battery protection circuit cut output power to avoid damaging the batteries further, causing an abrupt hard shutdown of all servers in the racks.

Servers with IPMI "automatic power restore on AC return" set to Last State (which restores power when utility is available) began booting immediately after the utility blip resolved — 5 seconds after the outage began. The brief outage was sufficient to corrupt in-flight writes on Kafka brokers and Elasticsearch nodes that had not flushed their write-ahead logs.

Contributing Factors¶

UPS battery alert threshold set at 20% — far too low for operational usefulness: The DCIM alert threshold was configured when the batteries were new, without consideration for the degradation profile over time. A battery at 20% capacity cannot bridge even a 1-second power event at full load. The alert threshold should have been set at 75-80% (the vendor's replacement trigger) so that batteries are replaced before they become operationally ineffective, not after.
Battery replacement schedule not enforced: The vendor recommends replacement at 3 years. The batteries were 4.5 years old. No maintenance contract renewal had been tracked, no calendar reminder or asset management alert existed for the replacement schedule, and responsibility for UPS maintenance had drifted between datacenter ops and a facilities team without clear ownership.
No owner for UPS maintenance contract: The original UPS maintenance contract with APC had expired 18 months earlier. The contract included annual battery capacity testing and proactive replacement recommendations. Without the contract, there were no scheduled capacity tests and no vendor-initiated replacement notices.
Power dip risk not incorporated into infrastructure risk model: The team's risk model documented server failures and network partitions as failure modes but did not explicitly model a utility power dip against the degradation profile of aging UPS batteries. The risk of a common, recurring event (utility blip 2-3x/year) was treated as fully mitigated by UPS presence, without accounting for battery age.

What We Got Lucky About¶

Only 2 of 12 racks were affected — the 2 with the oldest UPS units (4.5 years old). The next-oldest units (racks 11–13, approximately 3.2 years old) still had sufficient capacity to bridge the 5-second gap. A utility dip of 15 seconds rather than 5 seconds would have exhausted those batteries as well, potentially taking 5 additional racks offline.
The Elasticsearch and Kafka clusters were designed with sufficient replica headroom that losing 2 of 9 nodes in each cluster did not cause quorum loss or data unavailability. If the affected racks had hosted a higher proportion of data service nodes (e.g., 4 of 9), the recovery path would have been significantly more complicated.
The power event occurred at 03:47 UTC — early morning, with very low traffic. If the event had occurred during peak traffic hours, the 8-minute pod rescheduling window would have coincided with high query volume, potentially causing visible latency increases even if no availability threshold was breached.

Detection¶

How We Detected¶

Prometheus detected 28 nodes going offline simultaneously and fired a NodeNotReady alert within 30 seconds of the power event. The DCIM system also logged the UPS output power failure event with a timestamp of 03:47:14. Detection was fast because node-level monitoring was comprehensive and the alert threshold for NodeNotReady was set to fire immediately on any node entering the NotReady state (no minimum duration, given that simultaneous node loss at this scale is always actionable).

Why We Didn't Detect Sooner¶

Detection was fast (2 minutes from event to alert acknowledgement). The monitoring gap was not in detecting the incident, but in detecting the pre-condition (degraded UPS batteries) before the incident occurred. The DCIM alert threshold of 20% capacity provided no warning that batteries were below the vendor's 80% replacement threshold. A properly set threshold (alert at <75% capacity) would have fired months or years before the incident and triggered replacement before any operational impact.

Response¶

What Went Well¶

Kubernetes node failure detection and pod rescheduling worked as designed. The 8-minute recovery for stateless workloads was within the expected bounds for the cluster configuration and represented effective infrastructure resilience.
Leandro Ferreira contacted datacenter operations immediately upon acknowledgement and received DCIM confirmation of the power event within 2 minutes, which eliminated all other possible root causes rapidly.
Elasticsearch and Kafka were both configured with sufficient replication (3x for Elasticsearch, replication factor 3 for Kafka) to absorb the loss of affected nodes without data loss or availability interruption.
The server BIOS setting (automatic power restore: Last State) caused servers to reboot immediately when utility power was restored, minimizing the time nodes spent offline and speeding Kubernetes recovery.

What Went Poorly¶

The UPS battery alert threshold (20%) was dangerously misconfigured and had not been reviewed since the batteries were installed 4.5 years earlier. No periodic review of DCIM alert thresholds existed.
The UPS maintenance contract had lapsed 18 months before the incident with no active owner tracking the renewal or scheduling battery capacity tests.
Battery age and replacement schedule were not tracked in the infrastructure asset management system. The team had no visibility into when any UPS battery had been installed or when it was due for replacement.
The power dip risk model assumed UPS mitigation was binary (present/not present). Battery capacity degradation as a continuous risk factor was not modeled.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
PM-025-01	Replace UPS batteries in racks 14 and 15 immediately; conduct capacity test on all 12 racks; replace any battery bank below 80% rated capacity	P1	Renata Kowalczyk	In Progress	2025-11-26
PM-025-02	Update DCIM UPS battery health alert threshold from 20% to 75% (vendor replacement trigger); add a second alert at 85% as an advance warning	P1	Datacenter Operations	In Progress	2025-11-21
PM-025-03	Add UPS battery installation date and replacement due date to infrastructure asset management system; configure automated alert when any battery bank approaches 3-year lifecycle	P1	Renata Kowalczyk	Open	2025-12-03
PM-025-04	Renew APC UPS maintenance contract for all 12 racks; include annual capacity testing and proactive battery replacement notification in SLA	P1	Datacenter Operations	Open	2025-12-10
PM-025-05	Add utility power dip + degraded UPS battery scenario to the infrastructure risk register with quantified probability (2-3x/year event frequency × battery degradation rate)	P2	Infrastructure Engineering	Open	2025-12-10
PM-025-06	Review and document ownership matrix for all datacenter physical infrastructure (UPS, PDU, CRAC units, generators); ensure no maintenance responsibility gap exists	P2	Renata Kowalczyk	Open	2025-12-17

Lessons Learned¶

Infrastructure alert thresholds must reflect operational reality, not worst-case states. A UPS battery alert at 20% capacity provides no actionable lead time — the battery is already operationally useless before the alert fires. Alert thresholds for degradable resources (batteries, disk, capacitor banks) should be set at the point where the resource should be replaced or repaired, not at the point where it has completely failed.
Maintenance contracts are not optional for critical physical infrastructure. Software systems have automated dependency tracking and CVE feeds; physical infrastructure depends on vendor relationships, maintenance schedules, and proactive capacity testing. When a maintenance contract lapses, the organization loses the institutional pressure and scheduling mechanisms that prevent hardware from quietly degrading past safe operating parameters.
Risk models must account for continuous degradation, not just binary failure. A UPS being "present" is not equivalent to a UPS being effective. Any resource that degrades over time (batteries, disks, bearings, seals) must be modeled with a degradation curve, not a binary healthy/failed state. The difference between a 95%-capacity battery and a 40%-capacity battery is invisible in a binary model but is the difference between riding out a 5-second outage and losing two racks of servers.

Cross-References¶

Failure Pattern: Infrastructure degradation — physical resource (UPS battery) degraded past operational effectiveness without detection due to misconfigured monitoring threshold and lapsed maintenance contract
Topic Packs: Datacenter power infrastructure, UPS design and maintenance, Kubernetes node failure and rescheduling, Elasticsearch shard recovery, Kafka partition recovery
Runbook: runbooks/datacenter/ups-battery-replacement.md, runbooks/kubernetes/node-mass-failure-recovery.md
Decision Tree: Mass node failure alert → correlate with DCIM power events → UPS output state → battery capacity → replacement vs. temporary load reduction decision