Skip to content

Grading Checklist: Power Supply Redundancy Lost

A good response must include:

  • Verified PSU status via BMC (ipmitool sensor list, ipmitool sdr list)
  • Confirmed the remaining PSU can handle the full load (520W < 800W capacity)
  • Checked the power cable and PDU outlet for PSU 2 (rule out cable/PDU issues)
  • Assessed the risk: server running without redundancy means any PSU 1 issue = downtime
  • Considered draining workloads from the Kubernetes node as a precaution
  • Checked warranty status and initiated replacement process
  • Verified PSUs are hot-swappable (they are on DL380 Gen10)
  • Planned the PSU replacement -- can be done live without downtime
  • Attempted reseating PSU 2 before declaring it failed
  • Documented the failure for asset management and trending
  • Assessed whether other servers with same-age PSUs should be proactively checked
  • Verified that after replacement, redundancy is restored and alerts clear