Grading Checklist: Power Supply Redundancy Lost¶
A good response must include:
- Verified PSU status via BMC (
ipmitool sensor list,ipmitool sdr list) - Confirmed the remaining PSU can handle the full load (520W < 800W capacity)
- Checked the power cable and PDU outlet for PSU 2 (rule out cable/PDU issues)
- Assessed the risk: server running without redundancy means any PSU 1 issue = downtime
- Considered draining workloads from the Kubernetes node as a precaution
- Checked warranty status and initiated replacement process
- Verified PSUs are hot-swappable (they are on DL380 Gen10)
- Planned the PSU replacement -- can be done live without downtime
- Attempted reseating PSU 2 before declaring it failed
- Documented the failure for asset management and trending
- Assessed whether other servers with same-age PSUs should be proactively checked
- Verified that after replacement, redundancy is restored and alerts clear