Portal | Level: L1: Foundations | Topics: Server Hardware, Rack & Stack | Domain: Datacenter & Hardware

Scenario: CPU Thermal Throttling During Peak Hours¶

Situation¶

At 11:30 AM, the application performance team reports that request latency on compute-node-12 (a 2U HP ProLiant DL380 Gen10) doubles between 10 AM and 2 PM daily but is normal at night. CPU utilization looks identical day vs night (around 70%), ruling out a load-related explanation. The server hosts a CPU-intensive data processing pipeline. This has been worsening gradually over the past two weeks.

What You Know¶

Dual Intel Xeon Gold 6248R (24 cores each) processors
The server is in the middle of a row in a hot-aisle/cold-aisle datacenter
Ambient datacenter temperature has been normal according to the facility team (68F/20C)
No hardware alerts have fired from iLO (HP's out-of-band management)
top and htop show CPUs at 70% but perf stat shows lower-than-expected IPC (instructions per cycle)

Investigation Steps¶

1. Check Current CPU Temperatures and Thermal State¶

Command(s):

# Read all sensor data via IPMI (temperatures, fan speeds, power)
sudo ipmitool sdr type Temperature
sudo ipmitool sdr type Fan

# Get detailed sensor readings with thresholds
sudo ipmitool sdr list full | grep -iE "temp|fan|inlet|exhaust|cpu"

# Check CPU frequency -- throttled CPUs drop below base frequency
cat /proc/cpuinfo | grep "cpu MHz" | sort -t: -k2 -n | head -5
cat /proc/cpuinfo | grep "cpu MHz" | sort -t: -k2 -n | tail -5

# More precise frequency check with turbostat
sudo turbostat --quiet --show Core,CPU,Avg_MHz,Busy%,Bzy_MHz,TSC_MHz,CoreTmp,PkgTmp --interval 5 --num_iterations 1

What to look for: CPU package temperatures above 85C indicate thermal stress. Base clock for Xeon 6248R is 3.0 GHz with turbo up to 4.0 GHz. If Bzy_MHz (actual frequency under load) is below base clock (e.g., 2.1 GHz instead of 3.0 GHz), the CPU is actively thermal throttling. Compare Avg_MHz to TSC_MHz -- a large gap means the CPU is spending time in lower power states. Fan RPMs significantly below normal (or at zero) indicate fan failure.

2. Check Thermal Event History and Throttle Counters¶

Command(s):

# Check IPMI system event log for thermal events
sudo ipmitool sel list | grep -iE "temp|thermal|throt|fan|critical"

# Check for kernel-reported throttle events
dmesg | grep -iE "throttl|thermal|temperature|cpu.*clock"

# Read CPU throttle counters from MSR (model-specific registers)
sudo rdmsr -a 0x19C  # IA32_THERM_STATUS -- bit 0 = currently throttling

# Alternative: check via sysfs
for z in /sys/devices/system/cpu/cpu*/thermal_throttle/core_throttle_count; do echo "$z: $(cat $z)"; done

# Check package-level throttle count
for z in /sys/devices/system/cpu/cpu*/thermal_throttle/package_throttle_count; do echo "$z: $(cat $z)"; done | head -4

What to look for: Non-zero values in core_throttle_count or package_throttle_count confirm thermal throttling has occurred. These counters increment each time the CPU enters a throttled state. If the SEL shows Upper Critical temperature events correlated with 10 AM - 2 PM, this matches the performance degradation window. Check if only one CPU package is affected (asymmetric cooling problem) or both.

3. Identify the Physical Cause¶

Command(s):

# Compare inlet (front) and exhaust (rear) temperatures
sudo ipmitool sdr list full | grep -iE "inlet|exhaust|ambient"

# Check individual fan status -- look for failed or degraded fans
sudo ipmitool sdr list full | grep -i fan

# HP-specific: use iLO RESTful API or hpasmcli
sudo hpasmcli -s "show fans"
sudo hpasmcli -s "show temp"
sudo hpasmcli -s "show powersupply"

# Check if power cap is limiting CPU performance
sudo ipmitool dcmi power reading
sudo ipmitool raw 0x30 0xCE 0x01 0x00  # vendor-specific power limit query

What to look for: If inlet temp is normal (20-25C) but exhaust is extremely high (50C+), heat is building up inside the chassis. If one or more fans show 0 RPM or significantly reduced RPM, that is the root cause. Also check whether the server is in a "degraded cooling" mode where remaining fans spin at maximum but cannot compensate. A high delta between inlet and CPU temps (> 40C) with normal fan speeds suggests blocked airflow -- dust buildup on heatsinks or blanking panels removed.

Root Cause¶

Two of the six internal fans (Fan 3 and Fan 4, which cool the CPU zone) had failed. The iLO firmware was configured with a "relaxed" thermal policy that only alerts at Critical threshold (95C) rather than Warning threshold (85C), so no alert fired even though CPUs were hitting 90C. The remaining four fans ran at maximum RPM but could not provide adequate cooling under sustained load. During off-peak hours (nighttime), lower ambient temperature and reduced compute load kept CPUs just below the throttle threshold. During peak hours (10 AM - 2 PM), the combination of full compute load and slightly higher ambient temperature (other servers in the row also running hot) pushed CPUs past the throttle point. The two-week worsening timeline correlated with the second fan failing a week after the first.

Fix¶

Immediate:

# Step 1: Reduce load on this node immediately
# Drain the node from the job scheduler / load balancer
# (application-specific: e.g., kubectl cordon, haproxy disable server, etc.)

# Step 2: Schedule emergency maintenance to replace failed fans
# HP ProLiant fans are hot-swappable -- no shutdown needed
# Datacenter tech replaces Fan 3 and Fan 4 modules

# Step 3: After fan replacement, verify temperatures drop
sudo ipmitool sdr type Temperature
sudo ipmitool sdr type Fan

# Step 4: Restore workload and verify no throttling
sudo turbostat --quiet --show Core,CPU,Avg_MHz,Bzy_MHz,CoreTmp,PkgTmp --interval 5 --num_iterations 3

Preventive: - Lower iLO/iDRAC thermal alert thresholds to fire at Warning (85C), not just Critical (95C) - Monitor CPU frequency alongside utilization -- throttled CPUs show normal utilization but reduced frequency and IPC - Set up fan RPM monitoring with alerts for any fan dropping below 50% of expected RPM or reading 0 RPM - Implement turbostat or collectd CPU frequency metrics in your monitoring stack (Prometheus node_cpu_scaling_frequency_hertz) - Schedule periodic datacenter walkthroughs to check for dust buildup, missing blanking panels, and hot-aisle containment breaches - Track package_throttle_count as a metric -- any increase should trigger an alert

Common Mistakes¶

Looking only at CPU utilization and not CPU frequency -- a throttled CPU shows high utilization but is running at half speed
Blaming the application for performance regression without checking hardware telemetry first
Trusting that "no hardware alerts" means no hardware problem -- default alert thresholds are often too permissive
Not checking the physical environment -- a missing blanking panel, a neighboring server's exhaust blowing into the cold aisle, or a blocked floor tile can cause localized hot spots
Running stress tests to diagnose the problem -- this makes the thermal situation worse and can trigger a protective shutdown

Interview Angle¶

Q: An application has higher latency during the day but utilization looks the same. What could cause this? Good answer shape: This pattern is a classic sign of thermal throttling. When CPUs overheat, they reduce clock speed to lower power consumption, which drops performance without changing utilization percentage. I would check CPU temperatures via IPMI (ipmitool sdr), compare actual CPU frequency to base clock (turbostat or /proc/cpuinfo), and look at throttle event counters in sysfs. The time-of-day correlation suggests a marginal cooling situation where slightly higher daytime ambient temperature pushes the system past its thermal limit. Root causes include failed fans, dust-clogged heatsinks, or datacenter airflow problems. A strong answer also mentions monitoring CPU frequency as a standard metric alongside utilization.

Prerequisites¶

Datacenter & Server Hardware (Topic Pack, L1)

Case Study: Cable Management Wrong Port (Case Study, L1) — Rack & Stack, Server Hardware
Case Study: Link Flaps Bad Optic (Case Study, L1) — Rack & Stack, Server Hardware
Datacenter & Server Hardware (Topic Pack, L1) — Rack & Stack, Server Hardware
Skillcheck: Datacenter (Assessment, L1) — Rack & Stack, Server Hardware
Bare-Metal Provisioning (Topic Pack, L2) — Server Hardware
Case Study: BIOS Settings Reset After CMOS (Case Study, L1) — Server Hardware
Case Study: Database Replication Lag — Root Cause Is RAID Degradation (Case Study, L2) — Server Hardware
Case Study: Firmware Update Boot Loop (Case Study, L2) — Server Hardware
Case Study: Memory ECC Errors Increasing (Case Study, L1) — Server Hardware
Case Study: Power Supply Redundancy Lost (Case Study, L1) — Server Hardware

Scenario: CPU Thermal Throttling During Peak Hours¶

Situation¶

What You Know¶

Investigation Steps¶

1. Check Current CPU Temperatures and Thermal State¶

2. Check Thermal Event History and Throttle Counters¶

3. Identify the Physical Cause¶

Root Cause¶

Fix¶

Common Mistakes¶

Interview Angle¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Scenario: CPU Thermal Throttling During Peak Hours¶

Situation¶

What You Know¶

Investigation Steps¶

1. Check Current CPU Temperatures and Thermal State¶

2. Check Thermal Event History and Throttle Counters¶

3. Identify the Physical Cause¶

Root Cause¶

Fix¶

Common Mistakes¶

Interview Angle¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶