Skip to content

Solution: PDU Overload Warning - Phase Imbalance

Triage

  1. Assess immediate risk: 87% on L2 is above the critical threshold. If GPU servers spike under full load, L2 could reach 95%+ and trip the breaker.
  2. Pull per-outlet power readings from the PDU management interface:
  3. Identify which outlets are on phase L2
  4. Sort devices by power draw on L2
  5. Cross-reference with the server inventory to identify:
  6. The two new GPU servers and their outlet assignments
  7. Which servers are critical vs. deferrable
  8. Check PDU-B to understand the full redundancy picture:
  9. If PDU-A L2 trips, does PDU-B L2 have enough headroom to absorb the shifted load?
  10. At 78%, PDU-B L2 cannot safely absorb the full PDU-A L2 load.

Root Cause

The two new GPU servers (Dell R750xa with A100 GPUs) were cabled to outlets 13-16 on PDU-A, which are all on phase L2. Each GPU server draws approximately 1400W (6.7A at 208V) at typical load, adding 13.4A to L2. The original phase balance was roughly even at ~21A per phase. The installation team did not consult the PDU phase mapping when cabling the new servers.

Fix

  1. Immediate (within 1 hour): Apply BMC power capping on both GPU servers:
  2. Set power cap to 1000W per server via iDRAC: racadm set System.Power.Cap.Watts 1000
  3. This reduces L2 draw by approximately 3.8A, bringing it to ~31A (77%)
  4. Short-term (within 24 hours): Rebalance power connections:
  5. Move one GPU server's PSU-A cable from outlet 14 (L2) to outlet 9 (L1)
  6. Move the other GPU server's PSU-A cable from outlet 16 (L2) to outlet 21 (L3)
  7. Target: L1 ~27A, L2 ~25A, L3 ~28A -- all phases under 70%
  8. Verify after rebalancing:
  9. Monitor PDU readings for 30 minutes under normal load
  10. Confirm all three phases are below 70% threshold
  11. Remove power caps from GPU servers
  12. Monitor again during peak GPU workload
  13. Long-term: Update the rack power plan:
  14. Document actual power draw for GPU server model
  15. Create a phase-aware outlet assignment map for the rack
  16. Set policy: new installs require capacity review before cabling

Rollback / Safety

  • Power capping is non-disruptive and instantly reversible.
  • When moving power cables, ensure the server has redundant PSUs and the other PSU stays connected.
  • Never move power cables under load without confirming redundant power path is active.
  • If a breaker trips during rebalancing, the redundant PDU should keep servers running -- verify PDU-B can handle the momentary full load.

Common Traps

  • Moving cables without checking redundancy: If a server has a single PSU, moving its power cable kills it instantly.
  • Ignoring the redundant PDU: Rebalancing PDU-A without checking PDU-B can create the same problem on the backup feed.
  • Using nameplate ratings for capacity planning: GPU servers can draw 30-50% more than nameplate under burst loads. Use measured values.
  • Phase mapping confusion: PDU outlet numbers do not always map to phases sequentially. Always consult the PDU manual or management interface for the phase-to-outlet mapping.
  • Forgetting to remove power caps: Leaving BMC power caps in place will throttle GPU performance permanently.