Incident Replay: Kernel Soft Lockup¶
Setup¶
- System context: High-traffic web server handling 10K requests/second. The server becomes unresponsive for 10-30 seconds periodically. Console shows "BUG: soft lockup" messages.
- Time: Wednesday 15:45 UTC
- Your role: On-call SRE / Linux systems engineer
Round 1: Alert Fires¶
[Pressure cue: "Load balancer health checks failing intermittently for web-prod-09. Server drops out of the pool every few minutes, causing request spikes on other servers."]
What you see:
SSH sessions freeze for 10-30 seconds periodically. dmesg (when you can read it) shows "BUG: soft lockup - CPU#3 stuck for 22s!" and a kernel stack trace pointing to the storage subsystem.
Choose your action: - A) Reboot the server immediately - B) Analyze the soft lockup stack trace to identify the stuck code path - C) Increase the soft lockup threshold to suppress the warnings - D) Check CPU temperature for thermal throttling
If you chose B (recommended):¶
[Result: The stack trace shows the CPU is stuck in
ext4_writepages->blk_mq_run_hw_queues-> spin_lock wait. A storage I/O path is holding a lock and not releasing it. The soft lockup triggers when a CPU is stuck in kernel code for >20 seconds without yielding. Proceed to Round 2.]
If you chose A:¶
[Result: Reboot clears the symptom temporarily but it will recur. You need to understand why the storage path is stalling.]
If you chose C:¶
[Result: Increasing the threshold hides the symptom. The CPU is genuinely stuck — the warnings are correct and important.]
If you chose D:¶
[Result: CPU temperature is 62C — well within normal. Not a thermal issue.]
Round 2: First Triage Data¶
[Pressure cue: "Server freezing every 3-5 minutes. Load balancer keeps removing it. Traffic pressure on remaining servers."]
What you see:
The soft lockup coincides with heavy disk I/O. iostat -x 1 shows the SSD reaching 100% utilization during the freezes. iotop shows a backup script running rsync with no I/O throttling, consuming all available IOPS.
Choose your action:
- A) Kill the backup rsync process immediately
- B) Use ionice to reduce the backup process I/O priority
- C) Check if the SSD has bad sectors causing I/O stalls
- D) Increase the I/O scheduler queue depth
If you chose A (recommended):¶
[Result:
kill $(pgrep rsync)— backup process terminated. I/O utilization drops from 100% to 30%. Soft lockups stop. Server becomes responsive. Proceed to Round 3.]
If you chose B:¶
[Result:
ionice -c 3 -p $(pgrep rsync)sets idle-class I/O priority. Helps somewhat but the SSD is still saturated when application I/O bursts. Partial fix.]
If you chose C:¶
[Result: SMART data shows the SSD is healthy. The issue is I/O contention, not hardware failure.]
If you chose D:¶
[Result: Queue depth changes do not help when the device is already at 100% utilization. The bottleneck is total IOPS.]
Round 3: Root Cause Identification¶
[Pressure cue: "Server stable. Why was a backup running during peak hours?"]
What you see:
Root cause: The backup cron job was scheduled for 15:00 UTC (intended to be 03:00 but the AM/PM was wrong in the crontab). The rsync backup has no I/O throttling (--bwlimit or ionice), consuming all available IOPS. On a busy server, this causes kernel-level I/O stalls and soft lockups.
Choose your action: - A) Fix the cron schedule to 03:00 and add ionice to the rsync command - B) Move backups to a dedicated backup server using snapshots - C) Add I/O throttling to the backup script and keep the current schedule - D) Fix the schedule, add ionice, and add an I/O utilization alert
If you chose D (recommended):¶
[Result: Cron fixed to 03:00. rsync wrapped with
ionice -c 2 -n 7and--bwlimit=50m. I/O alert at 90% utilization for 60+ seconds. Proceed to Round 4.]
If you chose A:¶
[Result: Good but no monitoring to detect future I/O saturation from other causes.]
If you chose B:¶
[Result: Correct long-term architecture but requires snapshot infrastructure that does not exist.]
If you chose C:¶
[Result: Throttling helps but running a heavy backup during peak hours is still suboptimal.]
Round 4: Remediation¶
[Pressure cue: "Server stable. Verify and close."]
Actions:
1. Verify no soft lockup messages: dmesg | grep -i "soft lockup"
2. Verify I/O utilization is normal: iostat -x 1 5
3. Verify server is back in the load balancer pool
4. Fix the cron schedule and add ionice wrapper
5. Add I/O utilization monitoring with alerting
Damage Report¶
- Total downtime: 0 (server periodically dropped from LB pool; other servers absorbed traffic)
- Blast radius: Intermittent degradation for ~45 minutes; increased latency on remaining servers
- Optimal resolution time: 8 minutes (read dmesg -> identify I/O contention -> kill rsync)
- If every wrong choice was made: 2+ hours of periodic freezes plus risk of full server lockup
Cross-References¶
- Primer: Kernel Troubleshooting
- Primer: Linux Performance
- Primer: Disk & Storage Ops
- Footguns: Linux Ops