- linux
- l1
- runbook
- linux-performance
- process-management --- Portal | Level: L1: Foundations | Topics: Linux Performance Tuning, Process Management | Domain: Linux
Runbook: High CPU (Runaway Process)¶
| Field | Value |
|---|---|
| Domain | Linux |
| Alert | node_cpu_seconds_total{mode="idle"} < 0.10 sustained for 5 min (i.e., CPU > 90% busy) |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | SSH access to the node, sudo access |
Quick Assessment (30 seconds)¶
If output shows: One process consuming >80% CPU → Single runaway process, continue from Step 1 If output shows: CPU time split across many processes insy (system) mode → Kernel-level issue (high syscall rate, interrupt storm) — this may be an I/O saturation or network issue, not a simple runaway
Step 1: Identify Top CPU Consumers¶
Why: You need to know exactly which process or processes are consuming CPU and whether this is expected load or a runaway before taking any action.
# Snapshot of top CPU consumers (non-interactive)
top -b -n1 -o %CPU | head -30
# More detailed view with threads
ps -eo pid,ppid,pcpu,pmem,nlwp,comm,args --sort=-%cpu | head -20
# Check which processes are consuming the most CPU time overall
ps -eo pid,pcpu,cputime,comm --sort=-%cpu | head -20
# For a Kubernetes node, check which pods are burning CPU
kubectl top pods -A --sort-by=cpu | head -20
kubectl top nodes
top itself is slow or unresponsive, the system is severely overloaded. SSH from a second terminal and use kill -STOP <PID> to pause the runaway process temporarily while you investigate.
Step 2: Profile the Process¶
Why: Killing a runaway process without understanding what it is doing provides no information to prevent recurrence. A 30-second profile is worth taking.
# Quick strace to see what syscalls the process is making
sudo strace -p <PID> -c -f -e trace=all -- sleep 10
# Check if the process is in a tight loop or doing I/O
sudo cat /proc/<PID>/wchan # Shows what kernel function the process is sleeping in
sudo cat /proc/<PID>/status | grep -E "VmRSS|VmSize|Threads|State"
# For deeper profiling (requires perf installed):
sudo perf top -p <PID> # Live CPU profile — press 'q' to quit
# For Java: thread dump to see what the JVM is doing
sudo kill -3 <JAVA_PID> # Prints stack trace to stdout/log
# Or: jstack <PID> if jdk tools available
# strace: if you see rapid repeated syscalls (read/write/epoll), it is a tight loop
% time seconds usecs/call calls errors syscall
99.00 10.123 1 101234 0 read
perf record -g -p <PID> -- sleep 10 && perf report for a post-collection report.
Step 3: Determine If It Is a Tight Loop vs. Legitimate Load¶
Why: A tight loop (100% on a single core) indicates a bug. Spread load across many cores may be legitimate work. The response is different.
# Check if CPU consumption is on one core (tight loop) or spread
mpstat -P ALL 2 3
# Check if the process has been running long (legitimate) or just started (crash loop consuming CPU)
ps -o pid,etimes,pcpu,comm -p <PID>
# Check recent logs for the process
sudo journalctl -u <SERVICE_NAME> --since "10 minutes ago" | tail -50
# For container: check recent pod logs
kubectl logs <POD_NAME> -n <NAMESPACE> --since=10m | tail -50
# Legitimate load: CPU spread across many cores, process has run for hours
# Tight loop: one CPU at 100%, others idle, process just started
CPU %usr %sys %iowait %irq %soft %idle
all 95.0 3.0 0.1 0.0 0.1 1.8
0 99.8 0.1 0.0 0.0 0.1 0.1 <- tight loop on core 0
1 0.1 0.1 0.0 0.0 0.0 99.8
Step 4: Check If It Is a Known Workload or a Runaway¶
Why: Before terminating any process, verify it is not an expected batch job, reindex operation, or backup task that is legitimately CPU-intensive.
# Check process start time and command line
ps -o pid,lstart,cmd -p <PID>
# Check if it was recently deployed or changed
sudo journalctl -u <SERVICE_NAME> --since "1 hour ago" | grep -i "start\|restart\|deploy"
# Check crontab for scheduled jobs that might explain the timing
sudo crontab -l
sudo cat /etc/cron.d/*
sudo cat /var/spool/cron/crontabs/*
# Check systemd timers
systemctl list-timers --all | grep active
# Known workload: started at a scheduled time, matches cron/systemd timer, service description explains the load
# Runaway: started recently with no matching schedule, consuming far more CPU than usual
Step 5: If Runaway — SIGTERM Gracefully, Then SIGKILL¶
Why: SIGTERM gives the process a chance to flush buffers, release locks, and clean up. SIGKILL is immediate but can leave behind corrupted state, lock files, or partial writes.
# First: try graceful shutdown
sudo kill -SIGTERM <PID>
# Wait 30 seconds and check if the process has exited
sleep 30
ps -p <PID>
# If still running, then use SIGKILL
sudo kill -SIGKILL <PID>
# For a Kubernetes pod (let Kubernetes handle the restart)
kubectl delete pod <POD_NAME> -n <NAMESPACE>
# This triggers the pod's terminationGracePeriodSeconds, then SIGKILL if needed
# Verify the process is gone
ps -p <PID>
cat /proc/<PID>/wchan — if it shows a kernel function, the process is stuck in an uninterruptible kernel state. This may require a node reboot.
Step 6: Investigate Root Cause in Logs and Traces¶
Why: A killed runaway process that restarts without a root cause fix will become a runaway again.
# Check application logs for errors leading up to the high CPU event
sudo journalctl -u <SERVICE_NAME> --since "30 minutes ago"
# Check for OOM, exception storms, or infinite loop indicators
sudo journalctl -u <SERVICE_NAME> | grep -i "error\|exception\|loop\|retry" | tail -50
# Check git/deploy history to see if a recent deploy is correlated
git -C /path/to/app/repo log --oneline --since="2 hours ago" 2>/dev/null
# Review monitoring dashboards for CPU trend (did it spike suddenly or creep up?)
# Ideally: an error message indicating the root cause, timestamped before the CPU spike
Mar 19 10:00:05 myapp[12345]: ERROR: Database connection pool exhausted, retrying forever
Verification¶
Success looks like: CPU idle above 20%. No single process consuming more than 80% CPU. Alert has cleared in monitoring. If still broken: Escalate — see below.Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | Platform/Application on-call | "Runaway CPU on node |
| Data loss suspected | Application team lead | "SIGKILL sent to |
| Scope expanding to multiple nodes | SRE lead | "High CPU spreading to multiple nodes, possible cascade from shared bottleneck or bad deploy" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Killing with SIGKILL immediately: Always try SIGTERM first and wait 30 seconds. SIGKILL skips cleanup code and can leave lock files, corrupt local state, or drop queued work. SIGTERM gives the application a chance to shut down gracefully.
- Not capturing a profile before killing: A 10-second
perf topor strace run takes almost no time and provides the evidence needed to file a proper bug report. Killing without profiling means the next engineer faces the same mystery. - Confusing system CPU with user CPU: High
%sys(kernel) CPU means the kernel is doing most of the work — often due to I/O, network syscalls, or context switching. High%usr(user) means application code is the culprit. The fix is different for each.
Cross-References¶
- Topic Pack: Linux CPU Performance and Profiling (deep background)
- Related Runbook: OOM Killer Activated
Wiki Navigation¶
Related Content¶
- /proc Filesystem (Topic Pack, L2) — Linux Performance Tuning
- Linux Kernel Tuning (Topic Pack, L2) — Linux Performance Tuning
- Linux Memory Management (Topic Pack, L1) — Linux Performance Tuning
- Linux Performance Flashcards (CLI) (flashcard_deck, L1) — Linux Performance Tuning
- Linux Performance Tuning (Topic Pack, L2) — Linux Performance Tuning
- Linux Processes Flashcards (CLI) (flashcard_deck, L1) — Process Management
- Process Management (Topic Pack, L1) — Process Management
- Runbook: Zombie Processes Accumulating (Runbook, L2) — Process Management