Solution: Thermal Throttle - Fan Failure¶
Triage¶
-
Check all fan and temperature sensors immediately:
-
Confirm CPU throttling:
-
Check BMC event log for the timeline:
-
Assess urgency: if CPU temp is above 90C, consider immediate workload reduction.
Root Cause¶
Fan 3 failed (bearing failure after 14 months of continuous operation). With one fan down, the remaining fans increased RPM to compensate, but cannot maintain adequate airflow across CPU 1's heatsink. CPU 1 temperature rose to 85C, triggering the processor's Thermal Control Circuit (TCC), which reduces clock frequency to prevent damage.
The thermal throttling is why throughput dropped by ~50% despite CPU utilization appearing normal -- the CPU is busy but running at reduced clock speed.
Fix¶
-
Immediate: Reduce thermal load if temps are critical (>90C):
-
Replace the failed fan (hot-swappable on Dell R740):
- Dell R740 fans are hot-swappable; no shutdown required.
- Open the top cover (or front panel, depending on model).
- Locate Fan 3 (labeled on the fan assembly).
- Pull the release latch and remove the failed fan module.
- Insert the replacement fan module until it clicks.
-
The BMC will detect the new fan within 30-60 seconds.
-
Verify recovery:
-
Remove any temporary frequency cap:
-
Clear BMC alerts after verification:
-
Physical inspection:
- While the cover is open, inspect for dust accumulation on heatsinks and remaining fans.
- Verify all blanking panels are installed in empty drive bays and PCIe slots.
- Ensure cables are routed to not obstruct airflow.
Rollback / Safety¶
- Fan replacement is hot-swap and non-disruptive. If the replacement fan is also defective, the server will continue running (degraded) on remaining fans.
- If CPU temperature exceeds 95C, initiate a graceful shutdown to prevent thermal damage.
- Keep the server cover closed during operation (even during fan swap, minimize open time) -- modern servers are designed for directed airflow that requires the cover.
- If no spare fan is available, reduce workload to 50% to lower thermal output until the part arrives.
Common Traps¶
- Trap: Ignoring the initial fan alert. A failed fan is a P2 hardware issue that should be addressed within 24 hours, not ignored.
- Trap: Looking at CPU utilization percentage instead of frequency. Throttled CPUs show high utilization but low throughput.
- Trap: Running the server with the cover off "for better cooling." Server chassis are designed for directed airflow; removing the cover actually worsens cooling in most designs.
- Trap: Not checking warranty status before ordering a replacement. Dell ProSupport covers fan modules and will ship next-business-day.
- Trap: Forgetting to clear the BMC alert after fixing, causing alert fatigue on future genuine issues.