Solution: Power Supply Redundancy Lost¶
Triage¶
-
Check PSU status via BMC:
-
Verify the remaining PSU has adequate capacity:
- Current draw: ~520W
- PSU 1 rated: 800W
-
Headroom: 280W (35%) -- adequate for current workload but no burst margin for CPU-intensive spikes.
-
Check if the issue is the PSU itself or the power feed:
- Request remote hands or facility team to verify the power cable at PDU B outlet is connected and the breaker is not tripped.
-
Check PSU 2 LED status (amber = internal fault).
-
Assess workload risk:
Root Cause¶
PSU 2 has experienced an internal fault (capacitor or MOSFET failure). The BMC reports "Power Supply 2 Failed" with an internal fault code, and the amber LED confirms the unit is not functioning. The power feed from PDU B is healthy (verified by other servers on the same PDU), ruling out an upstream power issue.
The server continues to operate on PSU 1 alone, which is within its rated capacity for the current load. However, redundancy is lost -- any issue with PSU 1, PDU A, or the A-feed circuit would cause an immediate server shutdown.
Fix¶
-
Immediate risk mitigation (if replacement is not available same-day):
This ensures that if PSU 1 fails, workloads have already been moved. -
Attempt PSU reseat first (remote hands or on-site):
- Power cycle PSU 2: pull it out, wait 10 seconds, re-insert firmly.
- Check if the LED goes green and BMC clears the fault.
-
If the fault persists after reseat, the PSU is genuinely failed.
-
Replace the PSU (hot-swappable on DL380 Gen10):
- HPE DL380 PSUs are hot-swap. No server downtime required.
- Pull the failed PSU 2 from the rear bay.
- Insert the replacement PSU.
- Reconnect the power cable to PDU B.
-
Wait 30 seconds for the BMC to detect and initialize.
-
Verify recovery:
-
Uncordon the node:
-
Initiate RMA for the failed PSU if under warranty:
- Record: PSU model, serial number, server serial, and failure date.
- HPE ProSupport: call support with server serial for next-business-day part.
Rollback / Safety¶
- PSU hot-swap is non-disruptive. The server runs on PSU 1 during the entire replacement.
- If the replacement PSU also fails immediately, check the PDU B outlet for voltage issues (request facility team to verify).
- Do NOT pull PSU 1 while PSU 2 is failed -- the server will power off immediately.
- If both PSUs are suspect, schedule a full power cycle during a maintenance window to test both.
Common Traps¶
- Trap: Ignoring the alert because "the server is still running." Without redundancy, a single additional failure = unplanned downtime.
- Trap: Not verifying the power cable/PDU before declaring the PSU failed. A loose cable or tripped PDU breaker looks identical to a PSU fault from the BMC's perspective.
- Trap: Draining a Kubernetes node without checking pod disruption budgets. Use
kubectl drainwith appropriate flags and check PDBs first. - Trap: Not ordering a spare after using the last spare PSU. Maintain spare inventory for common parts.
- Trap: Assuming the replacement PSU is the exact same model. HPE uses different PSU models for different wattage ratings -- verify the part number matches (e.g., 865414-B21 for 800W).