Incident Replay: Power Supply Redundancy Lost¶
Setup¶
- System context: Production database server with dual redundant power supplies (PSU1 and PSU2) connected to separate PDUs on separate power circuits. This is a critical Tier-1 server.
- Time: Friday 23:15 UTC
- Your role: On-call SRE / datacenter ops
Round 1: Alert Fires¶
[Pressure cue: "iDRAC fires — 'Power Supply 2 status is Critical — AC power lost.' Server is running on single PSU. One more power event and this server goes down."]
What you see: iDRAC dashboard shows PSU2 has lost AC input. PSU1 is healthy and carrying the full load. Server is operational but has lost power redundancy. The PDU dashboard shows Circuit B in the rack is offline.
Choose your action: - A) Immediately migrate all workloads off this server - B) Check the PDU and circuit breaker status for Circuit B - C) Replace PSU2 with a spare power supply - D) Check if other servers in the rack are also affected
If you chose D (recommended):¶
[Result: 6 of 12 servers in the rack show PSU2 warnings — all on the same PDU/circuit. This is a PDU or circuit issue, not a PSU issue. Proceed to Round 2.]
If you chose A:¶
[Result: Migration is safe but takes 20 minutes. Meanwhile you have not identified the scope. Other servers may be at risk too.]
If you chose B:¶
[Result: Good instinct — gets to the same data as D but only for your specific server. Checking fleet impact first is more efficient.]
If you chose C:¶
[Result: The PSU is fine — it is the AC input from the PDU that is gone. Swapping the PSU changes nothing.]
Round 2: First Triage Data¶
[Pressure cue: "6 servers have lost power redundancy. If Circuit A also fails, we lose the entire rack. Datacenter team is 30 minutes away."]
What you see: PDU B is powered off. The PDU management interface is unreachable. The circuit breaker on the rack PDU panel shows the breaker has tripped. Total load on Circuit B before the trip was 4.8kW on a 5kW circuit.
Choose your action: - A) Reset the circuit breaker immediately - B) Check if any new equipment was added to Circuit B recently - C) Redistribute some servers to different PDUs before resetting - D) Call the electrician to inspect before resetting the breaker
If you chose B (recommended):¶
[Result: A new GPU server was added to the rack 2 days ago and plugged into Circuit B. It draws 1.2kW, pushing the circuit from 3.6kW to 4.8kW — over the 80% safety threshold. The breaker tripped under a load spike. Proceed to Round 3.]
If you chose A:¶
[Result: Breaker resets, Circuit B comes back. But the overload condition still exists — it will trip again under the next load spike. Band-aid.]
If you chose C:¶
[Result: Correct approach but you need to identify what is overloading the circuit first.]
If you chose D:¶
[Result: Electrician is hours away for a weekend call. Meanwhile you have 6 servers without redundancy.]
Round 3: Root Cause Identification¶
[Pressure cue: "Root cause is overloaded circuit. Fix it."]
What you see: Root cause: New GPU server added without a power capacity review. Circuit B was at 72% utilization before the add (3.6kW / 5kW). The GPU server pushed it to 96% (4.8kW / 5kW). A transient load spike tripped the breaker.
Choose your action: - A) Move the GPU server's PSU2 to a different circuit, then reset the breaker - B) Remove the GPU server from the rack entirely - C) Reset the breaker and set a lower alert threshold on the PDU - D) Move two older 1U servers to a different circuit to make room for the GPU server
If you chose A (recommended):¶
[Result: GPU server PSU2 moved to Circuit C (which has capacity). Circuit B load drops to 3.6kW. Breaker reset. All servers regain power redundancy. Proceed to Round 4.]
If you chose B:¶
[Result: The GPU server is needed for production workloads. Removing it creates a different problem.]
If you chose C:¶
[Result: Breaker resets but Circuit B is still overloaded. The alert threshold does not prevent tripping.]
If you chose D:¶
[Result: Works but more complex — you are moving 2 servers instead of re-cabling 1. Takes longer.]
Round 4: Remediation¶
[Pressure cue: "Redundancy restored. Close the incident."]
Actions: 1. Verify all 6 servers show both PSUs healthy in iDRAC 2. Verify PDU B load is within safe threshold: check PDU management interface 3. Update the rack power capacity spreadsheet 4. Implement a mandatory power capacity review for all new rack-and-stack 5. Set PDU alerts at 75% circuit utilization to catch overloads before they trip
Damage Report¶
- Total downtime: 0 (servers ran on single PSU)
- Blast radius: 6 servers lost power redundancy for ~2 hours
- Optimal resolution time: 20 minutes (scope impact -> identify overload -> redistribute -> reset)
- If every wrong choice was made: 3+ hours with repeated breaker trips and potential full rack outage
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Capacity Planning
- Footguns: Datacenter