Incident Replay: Power Supply Redundancy Lost¶

Setup¶

System context: Production database server with dual redundant power supplies (PSU1 and PSU2) connected to separate PDUs on separate power circuits. This is a critical Tier-1 server.
Time: Friday 23:15 UTC
Your role: On-call SRE / datacenter ops

Round 1: Alert Fires¶

[Pressure cue: "iDRAC fires — 'Power Supply 2 status is Critical — AC power lost.' Server is running on single PSU. One more power event and this server goes down."]

What you see: iDRAC dashboard shows PSU2 has lost AC input. PSU1 is healthy and carrying the full load. Server is operational but has lost power redundancy. The PDU dashboard shows Circuit B in the rack is offline.

Choose your action: - A) Immediately migrate all workloads off this server - B) Check the PDU and circuit breaker status for Circuit B - C) Replace PSU2 with a spare power supply - D) Check if other servers in the rack are also affected

If you chose D (recommended):¶

[Result: 6 of 12 servers in the rack show PSU2 warnings — all on the same PDU/circuit. This is a PDU or circuit issue, not a PSU issue. Proceed to Round 2.]

If you chose A:¶

[Result: Migration is safe but takes 20 minutes. Meanwhile you have not identified the scope. Other servers may be at risk too.]

If you chose B:¶

[Result: Good instinct — gets to the same data as D but only for your specific server. Checking fleet impact first is more efficient.]

If you chose C:¶

[Result: The PSU is fine — it is the AC input from the PDU that is gone. Swapping the PSU changes nothing.]

Round 2: First Triage Data¶

[Pressure cue: "6 servers have lost power redundancy. If Circuit A also fails, we lose the entire rack. Datacenter team is 30 minutes away."]

What you see: PDU B is powered off. The PDU management interface is unreachable. The circuit breaker on the rack PDU panel shows the breaker has tripped. Total load on Circuit B before the trip was 4.8kW on a 5kW circuit.

Choose your action: - A) Reset the circuit breaker immediately - B) Check if any new equipment was added to Circuit B recently - C) Redistribute some servers to different PDUs before resetting - D) Call the electrician to inspect before resetting the breaker

If you chose B (recommended):¶

[Result: A new GPU server was added to the rack 2 days ago and plugged into Circuit B. It draws 1.2kW, pushing the circuit from 3.6kW to 4.8kW — over the 80% safety threshold. The breaker tripped under a load spike. Proceed to Round 3.]

If you chose A:¶

[Result: Breaker resets, Circuit B comes back. But the overload condition still exists — it will trip again under the next load spike. Band-aid.]

If you chose C:¶

[Result: Correct approach but you need to identify what is overloading the circuit first.]

If you chose D:¶

[Result: Electrician is hours away for a weekend call. Meanwhile you have 6 servers without redundancy.]

Round 3: Root Cause Identification¶

[Pressure cue: "Root cause is overloaded circuit. Fix it."]

What you see: Root cause: New GPU server added without a power capacity review. Circuit B was at 72% utilization before the add (3.6kW / 5kW). The GPU server pushed it to 96% (4.8kW / 5kW). A transient load spike tripped the breaker.

Choose your action: - A) Move the GPU server's PSU2 to a different circuit, then reset the breaker - B) Remove the GPU server from the rack entirely - C) Reset the breaker and set a lower alert threshold on the PDU - D) Move two older 1U servers to a different circuit to make room for the GPU server

If you chose A (recommended):¶

[Result: GPU server PSU2 moved to Circuit C (which has capacity). Circuit B load drops to 3.6kW. Breaker reset. All servers regain power redundancy. Proceed to Round 4.]

If you chose B:¶

[Result: The GPU server is needed for production workloads. Removing it creates a different problem.]

If you chose C:¶

[Result: Breaker resets but Circuit B is still overloaded. The alert threshold does not prevent tripping.]

If you chose D:¶

[Result: Works but more complex — you are moving 2 servers instead of re-cabling 1. Takes longer.]

Round 4: Remediation¶

[Pressure cue: "Redundancy restored. Close the incident."]

Actions: 1. Verify all 6 servers show both PSUs healthy in iDRAC 2. Verify PDU B load is within safe threshold: check PDU management interface 3. Update the rack power capacity spreadsheet 4. Implement a mandatory power capacity review for all new rack-and-stack 5. Set PDU alerts at 75% circuit utilization to catch overloads before they trip

Damage Report¶

Total downtime: 0 (servers ran on single PSU)
Blast radius: 6 servers lost power redundancy for ~2 hours
Optimal resolution time: 20 minutes (scope impact -> identify overload -> redistribute -> reset)
If every wrong choice was made: 3+ hours with repeated breaker trips and potential full rack outage

Incident Replay: Power Supply Redundancy Lost¶

Setup¶

Round 1: Alert Fires¶

If you chose D (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose C:¶

Round 2: First Triage Data¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶