Skip to content

Grading Checklist: Thermal Throttle - Fan Failure

A good response must include:

  • Checked BMC sensor data (ipmitool sensor list) for fan RPM and CPU temperatures
  • Identified the specific failed fan and its location in the server chassis
  • Verified CPU throttling by checking current frequency vs. base frequency
  • Assessed proximity to thermal shutdown threshold (typically 95-100C)
  • Determined if the fan module is hot-swappable (most modern servers support this)
  • Recommended reducing workload or migrating jobs before replacement if temps are critical
  • Identified spare fan availability and ordered replacement if needed
  • Checked remaining fans for increased RPM (compensation behavior)
  • Considered physical inspection for dust, blanking panels, or airflow obstruction
  • Proposed improved alert handling (the original fan alert was ignored for 2 days)
  • Documented the fan failure for asset management and warranty tracking
  • Recommended proactive thermal monitoring with escalation procedures