Skip to content

Solution: Power Supply Redundancy Lost

Triage

  1. Check PSU status via BMC:

    sudo ipmitool sensor list | grep -i "PS\|power\|PSU"
    sudo ipmitool sdr list | grep -i "PS\|power"
    

  2. Verify the remaining PSU has adequate capacity:

  3. Current draw: ~520W
  4. PSU 1 rated: 800W
  5. Headroom: 280W (35%) -- adequate for current workload but no burst margin for CPU-intensive spikes.

  6. Check if the issue is the PSU itself or the power feed:

  7. Request remote hands or facility team to verify the power cable at PDU B outlet is connected and the breaker is not tripped.
  8. Check PSU 2 LED status (amber = internal fault).

  9. Assess workload risk:

    kubectl get pods -o wide | grep k8s-worker-17 | wc -l
    kubectl describe node k8s-worker-17 | grep -A5 "Allocated resources"
    

Root Cause

PSU 2 has experienced an internal fault (capacitor or MOSFET failure). The BMC reports "Power Supply 2 Failed" with an internal fault code, and the amber LED confirms the unit is not functioning. The power feed from PDU B is healthy (verified by other servers on the same PDU), ruling out an upstream power issue.

The server continues to operate on PSU 1 alone, which is within its rated capacity for the current load. However, redundancy is lost -- any issue with PSU 1, PDU A, or the A-feed circuit would cause an immediate server shutdown.

Fix

  1. Immediate risk mitigation (if replacement is not available same-day):

    # Cordon the node to prevent new pods from scheduling
    kubectl cordon k8s-worker-17
    
    # Optionally drain the node if risk tolerance is low
    kubectl drain k8s-worker-17 --ignore-daemonsets --delete-emptydir-data
    
    This ensures that if PSU 1 fails, workloads have already been moved.

  2. Attempt PSU reseat first (remote hands or on-site):

  3. Power cycle PSU 2: pull it out, wait 10 seconds, re-insert firmly.
  4. Check if the LED goes green and BMC clears the fault.
  5. If the fault persists after reseat, the PSU is genuinely failed.

  6. Replace the PSU (hot-swappable on DL380 Gen10):

  7. HPE DL380 PSUs are hot-swap. No server downtime required.
  8. Pull the failed PSU 2 from the rear bay.
  9. Insert the replacement PSU.
  10. Reconnect the power cable to PDU B.
  11. Wait 30 seconds for the BMC to detect and initialize.

  12. Verify recovery:

    sudo ipmitool sensor list | grep -i "PS"
    sudo ipmitool sel list | tail -5
    # Should show: Power Supply 2 Presence detected, Redundancy Regained
    

  13. Uncordon the node:

    kubectl uncordon k8s-worker-17
    

  14. Initiate RMA for the failed PSU if under warranty:

  15. Record: PSU model, serial number, server serial, and failure date.
  16. HPE ProSupport: call support with server serial for next-business-day part.

Rollback / Safety

  • PSU hot-swap is non-disruptive. The server runs on PSU 1 during the entire replacement.
  • If the replacement PSU also fails immediately, check the PDU B outlet for voltage issues (request facility team to verify).
  • Do NOT pull PSU 1 while PSU 2 is failed -- the server will power off immediately.
  • If both PSUs are suspect, schedule a full power cycle during a maintenance window to test both.

Common Traps

  • Trap: Ignoring the alert because "the server is still running." Without redundancy, a single additional failure = unplanned downtime.
  • Trap: Not verifying the power cable/PDU before declaring the PSU failed. A loose cable or tripped PDU breaker looks identical to a PSU fault from the BMC's perspective.
  • Trap: Draining a Kubernetes node without checking pod disruption budgets. Use kubectl drain with appropriate flags and check PDBs first.
  • Trap: Not ordering a spare after using the last spare PSU. Maintain spare inventory for common parts.
  • Trap: Assuming the replacement PSU is the exact same model. HPE uses different PSU models for different wattage ratings -- verify the part number matches (e.g., 865414-B21 for 800W).