Solution: PDU Overload Warning - Phase Imbalance¶

Triage¶

Assess immediate risk: 87% on L2 is above the critical threshold. If GPU servers spike under full load, L2 could reach 95%+ and trip the breaker.
Pull per-outlet power readings from the PDU management interface:
Identify which outlets are on phase L2
Sort devices by power draw on L2
Cross-reference with the server inventory to identify:
The two new GPU servers and their outlet assignments
Which servers are critical vs. deferrable
Check PDU-B to understand the full redundancy picture:
If PDU-A L2 trips, does PDU-B L2 have enough headroom to absorb the shifted load?
At 78%, PDU-B L2 cannot safely absorb the full PDU-A L2 load.

Root Cause¶

The two new GPU servers (Dell R750xa with A100 GPUs) were cabled to outlets 13-16 on PDU-A, which are all on phase L2. Each GPU server draws approximately 1400W (6.7A at 208V) at typical load, adding 13.4A to L2. The original phase balance was roughly even at ~21A per phase. The installation team did not consult the PDU phase mapping when cabling the new servers.

Fix¶

Immediate (within 1 hour): Apply BMC power capping on both GPU servers:
Set power cap to 1000W per server via iDRAC: racadm set System.Power.Cap.Watts 1000
This reduces L2 draw by approximately 3.8A, bringing it to ~31A (77%)
Short-term (within 24 hours): Rebalance power connections:
Move one GPU server's PSU-A cable from outlet 14 (L2) to outlet 9 (L1)
Move the other GPU server's PSU-A cable from outlet 16 (L2) to outlet 21 (L3)
Target: L1 ~27A, L2 ~25A, L3 ~28A -- all phases under 70%
Verify after rebalancing:
Monitor PDU readings for 30 minutes under normal load
Confirm all three phases are below 70% threshold
Remove power caps from GPU servers
Monitor again during peak GPU workload
Long-term: Update the rack power plan:
Document actual power draw for GPU server model
Create a phase-aware outlet assignment map for the rack
Set policy: new installs require capacity review before cabling

Rollback / Safety¶

Power capping is non-disruptive and instantly reversible.
When moving power cables, ensure the server has redundant PSUs and the other PSU stays connected.
Never move power cables under load without confirming redundant power path is active.
If a breaker trips during rebalancing, the redundant PDU should keep servers running -- verify PDU-B can handle the momentary full load.

Common Traps¶

Moving cables without checking redundancy: If a server has a single PSU, moving its power cable kills it instantly.
Ignoring the redundant PDU: Rebalancing PDU-A without checking PDU-B can create the same problem on the backup feed.
Using nameplate ratings for capacity planning: GPU servers can draw 30-50% more than nameplate under burst loads. Use measured values.
Phase mapping confusion: PDU outlet numbers do not always map to phases sequentially. Always consult the PDU manual or management interface for the phase-to-outlet mapping.
Forgetting to remove power caps: Leaving BMC power caps in place will throttle GPU performance permanently.