Skip to content

Solution: Thermal Throttle - Fan Failure

Triage

  1. Check all fan and temperature sensors immediately:

    ipmitool sensor list | grep -iE "fan|temp"
    

  2. Confirm CPU throttling:

    # Check current vs. max frequency
    cat /proc/cpuinfo | grep "MHz" | head -4
    lscpu | grep "CPU MHz\|CPU max MHz"
    
    # Or use turbostat for detailed view
    turbostat --Summary --quiet sleep 1
    

  3. Check BMC event log for the timeline:

    ipmitool sel list | tail -20
    

  4. Assess urgency: if CPU temp is above 90C, consider immediate workload reduction.

Root Cause

Fan 3 failed (bearing failure after 14 months of continuous operation). With one fan down, the remaining fans increased RPM to compensate, but cannot maintain adequate airflow across CPU 1's heatsink. CPU 1 temperature rose to 85C, triggering the processor's Thermal Control Circuit (TCC), which reduces clock frequency to prevent damage.

The thermal throttling is why throughput dropped by ~50% despite CPU utilization appearing normal -- the CPU is busy but running at reduced clock speed.

Fix

  1. Immediate: Reduce thermal load if temps are critical (>90C):

    # Optionally reduce CPU frequency cap temporarily
    cpupower frequency-set -u 2000MHz
    
    # Or migrate batch jobs to another server
    

  2. Replace the failed fan (hot-swappable on Dell R740):

  3. Dell R740 fans are hot-swappable; no shutdown required.
  4. Open the top cover (or front panel, depending on model).
  5. Locate Fan 3 (labeled on the fan assembly).
  6. Pull the release latch and remove the failed fan module.
  7. Insert the replacement fan module until it clicks.
  8. The BMC will detect the new fan within 30-60 seconds.

  9. Verify recovery:

    # Check fan RPM -- new fan should spin up
    ipmitool sensor list | grep -i fan
    
    # Monitor CPU temperature -- should start dropping within 2-3 minutes
    watch -n 5 'ipmitool sensor list | grep -i temp'
    
    # Verify CPU frequency returns to normal
    cat /proc/cpuinfo | grep "MHz" | head -4
    

  10. Remove any temporary frequency cap:

    cpupower frequency-set -u $(lscpu | grep "CPU max" | awk '{print $NF}')
    

  11. Clear BMC alerts after verification:

    ipmitool sel clear
    

  12. Physical inspection:

  13. While the cover is open, inspect for dust accumulation on heatsinks and remaining fans.
  14. Verify all blanking panels are installed in empty drive bays and PCIe slots.
  15. Ensure cables are routed to not obstruct airflow.

Rollback / Safety

  • Fan replacement is hot-swap and non-disruptive. If the replacement fan is also defective, the server will continue running (degraded) on remaining fans.
  • If CPU temperature exceeds 95C, initiate a graceful shutdown to prevent thermal damage.
  • Keep the server cover closed during operation (even during fan swap, minimize open time) -- modern servers are designed for directed airflow that requires the cover.
  • If no spare fan is available, reduce workload to 50% to lower thermal output until the part arrives.

Common Traps

  • Trap: Ignoring the initial fan alert. A failed fan is a P2 hardware issue that should be addressed within 24 hours, not ignored.
  • Trap: Looking at CPU utilization percentage instead of frequency. Throttled CPUs show high utilization but low throughput.
  • Trap: Running the server with the cover off "for better cooling." Server chassis are designed for directed airflow; removing the cover actually worsens cooling in most designs.
  • Trap: Not checking warranty status before ordering a replacement. Dell ProSupport covers fan modules and will ship next-business-day.
  • Trap: Forgetting to clear the BMC alert after fixing, causing alert fatigue on future genuine issues.