Incident Replay: Thermal Throttling from Fan Failure¶
Setup¶
- System context: High-performance compute server (Dell R740xd) running ML training workloads. Dual CPU, 8 GPU accelerators, heavy thermal load. Server is in a hot-aisle/cold-aisle datacenter.
- Time: Thursday 16:00 UTC
- Your role: On-call SRE / datacenter ops
Round 1: Alert Fires¶
[Pressure cue: "GPU monitoring fires — training job performance dropped 40%. GPU temperatures above 85C. Auto-throttling engaged. ML team wants their GPUs back at full speed."]
What you see:
nvidia-smi shows all 8 GPUs above 80C with thermal throttling active. CPU temperatures are also elevated. The server's ambient inlet temperature reads 28C (datacenter standard: 22-25C).
Choose your action: - A) Reduce GPU workload to lower temperatures - B) Check iDRAC for fan status and thermal readings - C) Check if the datacenter CRAC (cooling) unit is malfunctioning - D) Open the server chassis to improve airflow
If you chose B (recommended):¶
[Result: iDRAC shows Fan 3 and Fan 4 (of 6) are at 0 RPM — failed. The remaining 4 fans are at 100% but cannot maintain adequate airflow for the full thermal load. Inlet temperature is 28C (slightly above normal due to reduced exhaust). Proceed to Round 2.]
If you chose A:¶
[Result: Reduces temperatures but the ML training job fails its deadline. The team wants the root cause fixed, not a workaround.]
If you chose C:¶
[Result: CRAC units are operating normally. The elevated inlet temperature is localized to this server — caused by its own reduced cooling capacity recycling hot exhaust.]
If you chose D:¶
[Result: Never operate a server with an open chassis in a production datacenter. It disrupts hot-aisle/cold-aisle containment and voids the warranty.]
Round 2: First Triage Data¶
[Pressure cue: "2 of 6 fans are dead. GPU training is throttled. ML team deadline is tomorrow."]
What you see: Fans 3 and 4 failed simultaneously — both are on the same power rail from the fan board connector. iDRAC lifecycle log shows "Fan 3 RPM below threshold" and "Fan 4 RPM below threshold" starting 2 hours ago.
Choose your action: - A) Replace the two failed fans with spares from datacenter stock - B) Check the fan board connector — both fans on the same rail suggests a power issue - C) Increase the remaining fans to maximum speed via iDRAC - D) Move the workload to another server while you troubleshoot
If you chose B (recommended):¶
[Result: Physical inspection reveals the fan board connector for fans 3-4 is partially unseated — likely from vibration over time. Re-seating the connector restores power. Both fans spin up to full speed. Temperatures begin dropping. Proceed to Round 3.]
If you chose A:¶
[Result: You replace the fans but the new fans also do not spin — the power connector is the issue, not the fans themselves. 20 minutes wasted on unnecessary swap.]
If you chose C:¶
[Result: The 4 working fans are already at 100%. There is no headroom.]
If you chose D:¶
[Result: Correct for immediate impact mitigation but does not fix the server. And GPU server capacity may not be available elsewhere.]
Round 3: Root Cause Identification¶
[Pressure cue: "Fans back online. Temperatures dropping. What caused the connector to unseat?"]
What you see: Root cause: Vibration from the high-speed fans and GPU cooling caused the fan board connector to gradually work loose over months. This is a known issue with this server model documented in a Dell service bulletin.
Choose your action: - A) Secure the connector with a retention clip and check all similar servers - B) Apply the Dell service bulletin fix (retention bracket) during next maintenance - C) Add fan RPM monitoring to detect partial failures earlier - D) All of the above
If you chose D (recommended):¶
[Result: Connector secured, fleet checked for same issue (found 2 more servers with loose connectors), fan RPM monitoring added. Proceed to Round 4.]
If you chose A:¶
[Result: Good immediate fix but the monitoring gap remains.]
If you chose B:¶
[Result: Service bulletin is the right reference but waiting for the next maintenance window risks more failures.]
If you chose C:¶
[Result: Monitoring helps detect early but does not prevent the connector issue.]
Round 4: Remediation¶
[Pressure cue: "Temperatures nominal. GPU training resumed at full speed."]
Actions:
1. Verify all 6 fans running at normal RPM: check iDRAC
2. Verify GPU temperatures below 75C: nvidia-smi
3. Verify ML training job resumed at full throughput
4. Apply retention clips to all fan board connectors in similar servers
5. Add fan RPM degradation alerting (alert if any fan drops below 3000 RPM)
Damage Report¶
- Total downtime: 0 (server running but thermally throttled)
- Blast radius: GPU training throughput reduced 40% for 2 hours; ML pipeline delayed
- Optimal resolution time: 15 minutes (check fans -> reseat connector -> verify)
- If every wrong choice was made: 3+ hours including unnecessary fan replacements and workload migrations
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Dell PowerEdge Servers
- Footguns: Datacenter