Solution: Power Supply Redundancy Lost¶

Triage¶

Check PSU status via BMC:

sudo ipmitool sensor list | grep -i "PS\|power\|PSU"
sudo ipmitool sdr list | grep -i "PS\|power"

Verify the remaining PSU has adequate capacity:
Current draw: ~520W
PSU 1 rated: 800W
Headroom: 280W (35%) -- adequate for current workload but no burst margin for CPU-intensive spikes.
Check if the issue is the PSU itself or the power feed:
Request remote hands or facility team to verify the power cable at PDU B outlet is connected and the breaker is not tripped.
Check PSU 2 LED status (amber = internal fault).

Assess workload risk:

kubectl get pods -o wide | grep k8s-worker-17 | wc -l
kubectl describe node k8s-worker-17 | grep -A5 "Allocated resources"

Root Cause¶

PSU 2 has experienced an internal fault (capacitor or MOSFET failure). The BMC reports "Power Supply 2 Failed" with an internal fault code, and the amber LED confirms the unit is not functioning. The power feed from PDU B is healthy (verified by other servers on the same PDU), ruling out an upstream power issue.

The server continues to operate on PSU 1 alone, which is within its rated capacity for the current load. However, redundancy is lost -- any issue with PSU 1, PDU A, or the A-feed circuit would cause an immediate server shutdown.

Fix¶

Immediate risk mitigation (if replacement is not available same-day):

# Cordon the node to prevent new pods from scheduling
kubectl cordon k8s-worker-17

# Optionally drain the node if risk tolerance is low
kubectl drain k8s-worker-17 --ignore-daemonsets --delete-emptydir-data

This ensures that if PSU 1 fails, workloads have already been moved.

Attempt PSU reseat first (remote hands or on-site):
Power cycle PSU 2: pull it out, wait 10 seconds, re-insert firmly.
Check if the LED goes green and BMC clears the fault.
If the fault persists after reseat, the PSU is genuinely failed.
Replace the PSU (hot-swappable on DL380 Gen10):
HPE DL380 PSUs are hot-swap. No server downtime required.
Pull the failed PSU 2 from the rear bay.
Insert the replacement PSU.
Reconnect the power cable to PDU B.
Wait 30 seconds for the BMC to detect and initialize.

Verify recovery:

sudo ipmitool sensor list | grep -i "PS"
sudo ipmitool sel list | tail -5
# Should show: Power Supply 2 Presence detected, Redundancy Regained

Uncordon the node:
```
kubectl uncordon k8s-worker-17
```
Initiate RMA for the failed PSU if under warranty:
Record: PSU model, serial number, server serial, and failure date.
HPE ProSupport: call support with server serial for next-business-day part.

Rollback / Safety¶

PSU hot-swap is non-disruptive. The server runs on PSU 1 during the entire replacement.
If the replacement PSU also fails immediately, check the PDU B outlet for voltage issues (request facility team to verify).
Do NOT pull PSU 1 while PSU 2 is failed -- the server will power off immediately.
If both PSUs are suspect, schedule a full power cycle during a maintenance window to test both.

Common Traps¶

Trap: Ignoring the alert because "the server is still running." Without redundancy, a single additional failure = unplanned downtime.
Trap: Not verifying the power cable/PDU before declaring the PSU failed. A loose cable or tripped PDU breaker looks identical to a PSU fault from the BMC's perspective.
Trap: Draining a Kubernetes node without checking pod disruption budgets. Use kubectl drain with appropriate flags and check PDBs first.
Trap: Not ordering a spare after using the last spare PSU. Maintain spare inventory for common parts.
Trap: Assuming the replacement PSU is the exact same model. HPE uses different PSU models for different wattage ratings -- verify the part number matches (e.g., 865414-B21 for 800W).