Portal | Level: L1: Foundations | Topics: Server Hardware, Rack & Stack | Domain: Datacenter & Hardware
Scenario: CPU Thermal Throttling During Peak Hours¶
Situation¶
At 11:30 AM, the application performance team reports that request latency on compute-node-12 (a 2U HP ProLiant DL380 Gen10) doubles between 10 AM and 2 PM daily but is normal at night. CPU utilization looks identical day vs night (around 70%), ruling out a load-related explanation. The server hosts a CPU-intensive data processing pipeline. This has been worsening gradually over the past two weeks.
What You Know¶
- Dual Intel Xeon Gold 6248R (24 cores each) processors
- The server is in the middle of a row in a hot-aisle/cold-aisle datacenter
- Ambient datacenter temperature has been normal according to the facility team (68F/20C)
- No hardware alerts have fired from iLO (HP's out-of-band management)
topandhtopshow CPUs at 70% butperf statshows lower-than-expected IPC (instructions per cycle)
Investigation Steps¶
1. Check Current CPU Temperatures and Thermal State¶
Command(s):
# Read all sensor data via IPMI (temperatures, fan speeds, power)
sudo ipmitool sdr type Temperature
sudo ipmitool sdr type Fan
# Get detailed sensor readings with thresholds
sudo ipmitool sdr list full | grep -iE "temp|fan|inlet|exhaust|cpu"
# Check CPU frequency -- throttled CPUs drop below base frequency
cat /proc/cpuinfo | grep "cpu MHz" | sort -t: -k2 -n | head -5
cat /proc/cpuinfo | grep "cpu MHz" | sort -t: -k2 -n | tail -5
# More precise frequency check with turbostat
sudo turbostat --quiet --show Core,CPU,Avg_MHz,Busy%,Bzy_MHz,TSC_MHz,CoreTmp,PkgTmp --interval 5 --num_iterations 1
Bzy_MHz (actual frequency under load) is below base clock (e.g., 2.1 GHz instead of 3.0 GHz), the CPU is actively thermal throttling. Compare Avg_MHz to TSC_MHz -- a large gap means the CPU is spending time in lower power states. Fan RPMs significantly below normal (or at zero) indicate fan failure.
2. Check Thermal Event History and Throttle Counters¶
Command(s):
# Check IPMI system event log for thermal events
sudo ipmitool sel list | grep -iE "temp|thermal|throt|fan|critical"
# Check for kernel-reported throttle events
dmesg | grep -iE "throttl|thermal|temperature|cpu.*clock"
# Read CPU throttle counters from MSR (model-specific registers)
sudo rdmsr -a 0x19C # IA32_THERM_STATUS -- bit 0 = currently throttling
# Alternative: check via sysfs
for z in /sys/devices/system/cpu/cpu*/thermal_throttle/core_throttle_count; do echo "$z: $(cat $z)"; done
# Check package-level throttle count
for z in /sys/devices/system/cpu/cpu*/thermal_throttle/package_throttle_count; do echo "$z: $(cat $z)"; done | head -4
core_throttle_count or package_throttle_count confirm thermal throttling has occurred. These counters increment each time the CPU enters a throttled state. If the SEL shows Upper Critical temperature events correlated with 10 AM - 2 PM, this matches the performance degradation window. Check if only one CPU package is affected (asymmetric cooling problem) or both.
3. Identify the Physical Cause¶
Command(s):
# Compare inlet (front) and exhaust (rear) temperatures
sudo ipmitool sdr list full | grep -iE "inlet|exhaust|ambient"
# Check individual fan status -- look for failed or degraded fans
sudo ipmitool sdr list full | grep -i fan
# HP-specific: use iLO RESTful API or hpasmcli
sudo hpasmcli -s "show fans"
sudo hpasmcli -s "show temp"
sudo hpasmcli -s "show powersupply"
# Check if power cap is limiting CPU performance
sudo ipmitool dcmi power reading
sudo ipmitool raw 0x30 0xCE 0x01 0x00 # vendor-specific power limit query
Root Cause¶
Two of the six internal fans (Fan 3 and Fan 4, which cool the CPU zone) had failed. The iLO firmware was configured with a "relaxed" thermal policy that only alerts at Critical threshold (95C) rather than Warning threshold (85C), so no alert fired even though CPUs were hitting 90C. The remaining four fans ran at maximum RPM but could not provide adequate cooling under sustained load. During off-peak hours (nighttime), lower ambient temperature and reduced compute load kept CPUs just below the throttle threshold. During peak hours (10 AM - 2 PM), the combination of full compute load and slightly higher ambient temperature (other servers in the row also running hot) pushed CPUs past the throttle point. The two-week worsening timeline correlated with the second fan failing a week after the first.
Fix¶
Immediate:
# Step 1: Reduce load on this node immediately
# Drain the node from the job scheduler / load balancer
# (application-specific: e.g., kubectl cordon, haproxy disable server, etc.)
# Step 2: Schedule emergency maintenance to replace failed fans
# HP ProLiant fans are hot-swappable -- no shutdown needed
# Datacenter tech replaces Fan 3 and Fan 4 modules
# Step 3: After fan replacement, verify temperatures drop
sudo ipmitool sdr type Temperature
sudo ipmitool sdr type Fan
# Step 4: Restore workload and verify no throttling
sudo turbostat --quiet --show Core,CPU,Avg_MHz,Bzy_MHz,CoreTmp,PkgTmp --interval 5 --num_iterations 3
Preventive:
- Lower iLO/iDRAC thermal alert thresholds to fire at Warning (85C), not just Critical (95C)
- Monitor CPU frequency alongside utilization -- throttled CPUs show normal utilization but reduced frequency and IPC
- Set up fan RPM monitoring with alerts for any fan dropping below 50% of expected RPM or reading 0 RPM
- Implement turbostat or collectd CPU frequency metrics in your monitoring stack (Prometheus node_cpu_scaling_frequency_hertz)
- Schedule periodic datacenter walkthroughs to check for dust buildup, missing blanking panels, and hot-aisle containment breaches
- Track package_throttle_count as a metric -- any increase should trigger an alert
Common Mistakes¶
- Looking only at CPU utilization and not CPU frequency -- a throttled CPU shows high utilization but is running at half speed
- Blaming the application for performance regression without checking hardware telemetry first
- Trusting that "no hardware alerts" means no hardware problem -- default alert thresholds are often too permissive
- Not checking the physical environment -- a missing blanking panel, a neighboring server's exhaust blowing into the cold aisle, or a blocked floor tile can cause localized hot spots
- Running stress tests to diagnose the problem -- this makes the thermal situation worse and can trigger a protective shutdown
Interview Angle¶
Q: An application has higher latency during the day but utilization looks the same. What could cause this?
Good answer shape: This pattern is a classic sign of thermal throttling. When CPUs overheat, they reduce clock speed to lower power consumption, which drops performance without changing utilization percentage. I would check CPU temperatures via IPMI (ipmitool sdr), compare actual CPU frequency to base clock (turbostat or /proc/cpuinfo), and look at throttle event counters in sysfs. The time-of-day correlation suggests a marginal cooling situation where slightly higher daytime ambient temperature pushes the system past its thermal limit. Root causes include failed fans, dust-clogged heatsinks, or datacenter airflow problems. A strong answer also mentions monitoring CPU frequency as a standard metric alongside utilization.
Wiki Navigation¶
Prerequisites¶
- Datacenter & Server Hardware (Topic Pack, L1)
Related Content¶
- Case Study: Cable Management Wrong Port (Case Study, L1) — Rack & Stack, Server Hardware
- Case Study: Link Flaps Bad Optic (Case Study, L1) — Rack & Stack, Server Hardware
- Datacenter & Server Hardware (Topic Pack, L1) — Rack & Stack, Server Hardware
- Skillcheck: Datacenter (Assessment, L1) — Rack & Stack, Server Hardware
- Bare-Metal Provisioning (Topic Pack, L2) — Server Hardware
- Case Study: BIOS Settings Reset After CMOS (Case Study, L1) — Server Hardware
- Case Study: Database Replication Lag — Root Cause Is RAID Degradation (Case Study, L2) — Server Hardware
- Case Study: Firmware Update Boot Loop (Case Study, L2) — Server Hardware
- Case Study: Memory ECC Errors Increasing (Case Study, L1) — Server Hardware
- Case Study: Power Supply Redundancy Lost (Case Study, L1) — Server Hardware