Incident Replay: Kernel Soft Lockup¶

Setup¶

System context: High-traffic web server handling 10K requests/second. The server becomes unresponsive for 10-30 seconds periodically. Console shows "BUG: soft lockup" messages.
Time: Wednesday 15:45 UTC
Your role: On-call SRE / Linux systems engineer

Round 1: Alert Fires¶

[Pressure cue: "Load balancer health checks failing intermittently for web-prod-09. Server drops out of the pool every few minutes, causing request spikes on other servers."]

What you see: SSH sessions freeze for 10-30 seconds periodically. dmesg (when you can read it) shows "BUG: soft lockup - CPU#3 stuck for 22s!" and a kernel stack trace pointing to the storage subsystem.

Choose your action: - A) Reboot the server immediately - B) Analyze the soft lockup stack trace to identify the stuck code path - C) Increase the soft lockup threshold to suppress the warnings - D) Check CPU temperature for thermal throttling

If you chose B (recommended):¶

[Result: The stack trace shows the CPU is stuck in ext4_writepages -> blk_mq_run_hw_queues -> spin_lock wait. A storage I/O path is holding a lock and not releasing it. The soft lockup triggers when a CPU is stuck in kernel code for >20 seconds without yielding. Proceed to Round 2.]

If you chose A:¶

[Result: Reboot clears the symptom temporarily but it will recur. You need to understand why the storage path is stalling.]

If you chose C:¶

[Result: Increasing the threshold hides the symptom. The CPU is genuinely stuck — the warnings are correct and important.]

If you chose D:¶

[Result: CPU temperature is 62C — well within normal. Not a thermal issue.]

Round 2: First Triage Data¶

[Pressure cue: "Server freezing every 3-5 minutes. Load balancer keeps removing it. Traffic pressure on remaining servers."]

What you see: The soft lockup coincides with heavy disk I/O. iostat -x 1 shows the SSD reaching 100% utilization during the freezes. iotop shows a backup script running rsync with no I/O throttling, consuming all available IOPS.

Choose your action: - A) Kill the backup rsync process immediately - B) Use ionice to reduce the backup process I/O priority - C) Check if the SSD has bad sectors causing I/O stalls - D) Increase the I/O scheduler queue depth

If you chose A (recommended):¶

[Result: kill $(pgrep rsync) — backup process terminated. I/O utilization drops from 100% to 30%. Soft lockups stop. Server becomes responsive. Proceed to Round 3.]

If you chose B:¶

[Result: ionice -c 3 -p $(pgrep rsync) sets idle-class I/O priority. Helps somewhat but the SSD is still saturated when application I/O bursts. Partial fix.]

If you chose C:¶

[Result: SMART data shows the SSD is healthy. The issue is I/O contention, not hardware failure.]

If you chose D:¶

[Result: Queue depth changes do not help when the device is already at 100% utilization. The bottleneck is total IOPS.]

Round 3: Root Cause Identification¶

[Pressure cue: "Server stable. Why was a backup running during peak hours?"]

What you see: Root cause: The backup cron job was scheduled for 15:00 UTC (intended to be 03:00 but the AM/PM was wrong in the crontab). The rsync backup has no I/O throttling (--bwlimit or ionice), consuming all available IOPS. On a busy server, this causes kernel-level I/O stalls and soft lockups.

Choose your action: - A) Fix the cron schedule to 03:00 and add ionice to the rsync command - B) Move backups to a dedicated backup server using snapshots - C) Add I/O throttling to the backup script and keep the current schedule - D) Fix the schedule, add ionice, and add an I/O utilization alert

If you chose D (recommended):¶

[Result: Cron fixed to 03:00. rsync wrapped with ionice -c 2 -n 7 and --bwlimit=50m. I/O alert at 90% utilization for 60+ seconds. Proceed to Round 4.]

If you chose A:¶

[Result: Good but no monitoring to detect future I/O saturation from other causes.]

If you chose B:¶

[Result: Correct long-term architecture but requires snapshot infrastructure that does not exist.]

If you chose C:¶

[Result: Throttling helps but running a heavy backup during peak hours is still suboptimal.]

Round 4: Remediation¶

[Pressure cue: "Server stable. Verify and close."]

Actions: 1. Verify no soft lockup messages: dmesg | grep -i "soft lockup" 2. Verify I/O utilization is normal: iostat -x 1 5 3. Verify server is back in the load balancer pool 4. Fix the cron schedule and add ionice wrapper 5. Add I/O utilization monitoring with alerting

Damage Report¶

Total downtime: 0 (server periodically dropped from LB pool; other servers absorbed traffic)
Blast radius: Intermittent degradation for ~45 minutes; increased latency on remaining servers
Optimal resolution time: 8 minutes (read dmesg -> identify I/O contention -> kill rsync)
If every wrong choice was made: 2+ hours of periodic freezes plus risk of full server lockup

Incident Replay: Kernel Soft Lockup¶

Setup¶

Round 1: Alert Fires¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose D (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose C:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶