Grading Checklist: Thermal Throttle - Fan Failure¶
A good response must include:
- Checked BMC sensor data (
ipmitool sensor list) for fan RPM and CPU temperatures - Identified the specific failed fan and its location in the server chassis
- Verified CPU throttling by checking current frequency vs. base frequency
- Assessed proximity to thermal shutdown threshold (typically 95-100C)
- Determined if the fan module is hot-swappable (most modern servers support this)
- Recommended reducing workload or migrating jobs before replacement if temps are critical
- Identified spare fan availability and ordered replacement if needed
- Checked remaining fans for increased RPM (compensation behavior)
- Considered physical inspection for dust, blanking panels, or airflow obstruction
- Proposed improved alert handling (the original fan alert was ignored for 2 days)
- Documented the fan failure for asset management and warranty tracking
- Recommended proactive thermal monitoring with escalation procedures