Solution: PDU Overload Warning - Phase Imbalance¶
Triage¶
- Assess immediate risk: 87% on L2 is above the critical threshold. If GPU servers spike under full load, L2 could reach 95%+ and trip the breaker.
- Pull per-outlet power readings from the PDU management interface:
- Identify which outlets are on phase L2
- Sort devices by power draw on L2
- Cross-reference with the server inventory to identify:
- The two new GPU servers and their outlet assignments
- Which servers are critical vs. deferrable
- Check PDU-B to understand the full redundancy picture:
- If PDU-A L2 trips, does PDU-B L2 have enough headroom to absorb the shifted load?
- At 78%, PDU-B L2 cannot safely absorb the full PDU-A L2 load.
Root Cause¶
The two new GPU servers (Dell R750xa with A100 GPUs) were cabled to outlets 13-16 on PDU-A, which are all on phase L2. Each GPU server draws approximately 1400W (6.7A at 208V) at typical load, adding 13.4A to L2. The original phase balance was roughly even at ~21A per phase. The installation team did not consult the PDU phase mapping when cabling the new servers.
Fix¶
- Immediate (within 1 hour): Apply BMC power capping on both GPU servers:
- Set power cap to 1000W per server via iDRAC:
racadm set System.Power.Cap.Watts 1000 - This reduces L2 draw by approximately 3.8A, bringing it to ~31A (77%)
- Short-term (within 24 hours): Rebalance power connections:
- Move one GPU server's PSU-A cable from outlet 14 (L2) to outlet 9 (L1)
- Move the other GPU server's PSU-A cable from outlet 16 (L2) to outlet 21 (L3)
- Target: L1 ~27A, L2 ~25A, L3 ~28A -- all phases under 70%
- Verify after rebalancing:
- Monitor PDU readings for 30 minutes under normal load
- Confirm all three phases are below 70% threshold
- Remove power caps from GPU servers
- Monitor again during peak GPU workload
- Long-term: Update the rack power plan:
- Document actual power draw for GPU server model
- Create a phase-aware outlet assignment map for the rack
- Set policy: new installs require capacity review before cabling
Rollback / Safety¶
- Power capping is non-disruptive and instantly reversible.
- When moving power cables, ensure the server has redundant PSUs and the other PSU stays connected.
- Never move power cables under load without confirming redundant power path is active.
- If a breaker trips during rebalancing, the redundant PDU should keep servers running -- verify PDU-B can handle the momentary full load.
Common Traps¶
- Moving cables without checking redundancy: If a server has a single PSU, moving its power cable kills it instantly.
- Ignoring the redundant PDU: Rebalancing PDU-A without checking PDU-B can create the same problem on the backup feed.
- Using nameplate ratings for capacity planning: GPU servers can draw 30-50% more than nameplate under burst loads. Use measured values.
- Phase mapping confusion: PDU outlet numbers do not always map to phases sequentially. Always consult the PDU manual or management interface for the phase-to-outlet mapping.
- Forgetting to remove power caps: Leaving BMC power caps in place will throttle GPU performance permanently.