Incident Replay: Rack PDU Overload Alert¶
Setup¶
- System context: High-density compute rack with 16 servers, 2 PDUs (A and B), each rated for 8.6kW. Metered PDUs with SNMP monitoring.
- Time: Thursday 15:00 UTC
- Your role: Datacenter operations engineer
Round 1: Alert Fires¶
[Pressure cue: "PDU monitoring fires — PDU-A in rack C12 at 95% capacity (8.17kW / 8.6kW). Ambient temperature in the rack is 2C above normal. One more server power-up could trip the breaker."]
What you see: PDU-A SNMP data shows steady load increase over the past week. PDU-B is at 72%. The rack was recently filled to capacity with 2 new GPU-accelerated servers.
Choose your action: - A) Immediately power down non-critical servers to reduce load - B) Check per-outlet power draw on PDU-A to identify the heaviest consumers - C) Move some server PSU cables from PDU-A to PDU-B to balance the load - D) Contact facilities to increase the circuit capacity
If you chose B (recommended):¶
[Result: Per-outlet data shows the 2 new GPU servers draw 1.8kW each — nearly double the older 1U servers (0.4kW each). The GPU servers both have PSU1 on PDU-A and PSU2 on PDU-B. One GPU server's PSU allocation should be swapped. Proceed to Round 2.]
If you chose A:¶
[Result: Powering down servers causes service disruption. The load is not an emergency yet — you have time to rebalance.]
If you chose C:¶
[Result: Correct direction but you need to know which cables to move first. Moving randomly could make PDU-B worse.]
If you chose D:¶
[Result: Circuit capacity increases require electrical work — weeks of lead time. Not a near-term fix.]
Round 2: First Triage Data¶
[Pressure cue: "Load is still at 95%. A server reboot or power spike could push it over."]
What you see: Both GPU servers have PSU1 on PDU-A, contributing 3.6kW to that PDU. If one GPU server's cabling were swapped (PSU1 on PDU-B, PSU2 on PDU-A), both PDUs would be at ~83%.
Choose your action: - A) Swap one GPU server's PSU cabling (PSU1 <-> PSU2 PDU assignments) - B) Move all GPU server cables to dedicated circuits - C) Set power capping on the GPU servers via iDRAC - D) Order higher-capacity PDUs
If you chose A (recommended):¶
[Result: With the server running, you swap GPU-server-2's PSU cables at the PDU outlets. The server stays up on the other PSU during each swap. PDU-A drops to 83%, PDU-B rises to 83%. Balanced. Proceed to Round 3.]
If you chose B:¶
[Result: No dedicated GPU circuits exist — this is a shared rack. Infrastructure change needed.]
If you chose C:¶
[Result: Power capping reduces GPU performance, which is the reason these servers exist. Bad trade-off.]
If you chose D:¶
[Result: Higher-capacity PDUs may require new circuits. Weeks of lead time.]
Round 3: Root Cause Identification¶
[Pressure cue: "PDUs balanced. Why was this not caught at rack-and-stack?"]
What you see: Root cause: The rack-and-stack procedure assigns PSU1 to PDU-A and PSU2 to PDU-B for all servers uniformly. With heterogeneous power draws (GPU vs standard), this creates imbalance. The capacity planning spreadsheet did not model per-PDU loading.
Choose your action: - A) Update the capacity planning model to include per-PDU load balancing - B) Add a PDU load check to the post-rack-and-stack verification - C) Set PDU alerting threshold to 80% instead of 90% - D) All of the above
If you chose D (recommended):¶
[Result: Updated model, post-install verification, and lower alert threshold. Future racks will be balanced from day one. Proceed to Round 4.]
If you chose A:¶
[Result: Model helps future planning but does not catch existing imbalances.]
If you chose B:¶
[Result: Post-install check catches issues but the planning phase should prevent them.]
If you chose C:¶
[Result: Earlier alerts help but do not prevent the underlying imbalance.]
Round 4: Remediation¶
[Pressure cue: "Balanced and documented. Close."]
Actions: 1. Verify both PDUs are below 85%: check SNMP data 2. Verify all servers are running on dual PSU redundancy 3. Update rack documentation with correct PDU assignments 4. Update capacity planning model with GPU server power profiles 5. Set PDU alerting at 80% utilization
Damage Report¶
- Total downtime: 0 (rebalanced live)
- Blast radius: Risk of breaker trip affecting 8 servers on PDU-A if unaddressed
- Optimal resolution time: 15 minutes (identify imbalance -> swap cables -> verify)
- If every wrong choice was made: 45+ minutes plus risk of PDU trip during delay
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Capacity Planning
- Footguns: Datacenter