Skip to content

Runbook: High CPU (Runaway Process)

Field Value
Domain Linux
Alert node_cpu_seconds_total{mode="idle"} < 0.10 sustained for 5 min (i.e., CPU > 90% busy)
Severity P2
Est. Resolution Time 15-30 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites SSH access to the node, sudo access

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
top -b -n1 | head -20
If output shows: One process consuming >80% CPU → Single runaway process, continue from Step 1 If output shows: CPU time split across many processes in sy (system) mode → Kernel-level issue (high syscall rate, interrupt storm) — this may be an I/O saturation or network issue, not a simple runaway

Step 1: Identify Top CPU Consumers

Why: You need to know exactly which process or processes are consuming CPU and whether this is expected load or a runaway before taking any action.

# Snapshot of top CPU consumers (non-interactive)
top -b -n1 -o %CPU | head -30

# More detailed view with threads
ps -eo pid,ppid,pcpu,pmem,nlwp,comm,args --sort=-%cpu | head -20

# Check which processes are consuming the most CPU time overall
ps -eo pid,pcpu,cputime,comm --sort=-%cpu | head -20

# For a Kubernetes node, check which pods are burning CPU
kubectl top pods -A --sort-by=cpu | head -20
kubectl top nodes
Expected output:
PID    PPID  %CPU  %MEM  NLWP  COMMAND
12345  1234  98.0   5.2    48  java
If this fails: If top itself is slow or unresponsive, the system is severely overloaded. SSH from a second terminal and use kill -STOP <PID> to pause the runaway process temporarily while you investigate.

Step 2: Profile the Process

Why: Killing a runaway process without understanding what it is doing provides no information to prevent recurrence. A 30-second profile is worth taking.

# Quick strace to see what syscalls the process is making
sudo strace -p <PID> -c -f -e trace=all -- sleep 10

# Check if the process is in a tight loop or doing I/O
sudo cat /proc/<PID>/wchan     # Shows what kernel function the process is sleeping in
sudo cat /proc/<PID>/status | grep -E "VmRSS|VmSize|Threads|State"

# For deeper profiling (requires perf installed):
sudo perf top -p <PID>         # Live CPU profile — press 'q' to quit

# For Java: thread dump to see what the JVM is doing
sudo kill -3 <JAVA_PID>        # Prints stack trace to stdout/log
# Or: jstack <PID> if jdk tools available
Expected output:
# strace: if you see rapid repeated syscalls (read/write/epoll), it is a tight loop
% time     seconds  usecs/call     calls    errors  syscall
 99.00     10.123         1    101234           0  read
If this fails: If you cannot run strace (process too fast, permissions), use perf record -g -p <PID> -- sleep 10 && perf report for a post-collection report.

Step 3: Determine If It Is a Tight Loop vs. Legitimate Load

Why: A tight loop (100% on a single core) indicates a bug. Spread load across many cores may be legitimate work. The response is different.

# Check if CPU consumption is on one core (tight loop) or spread
mpstat -P ALL 2 3

# Check if the process has been running long (legitimate) or just started (crash loop consuming CPU)
ps -o pid,etimes,pcpu,comm -p <PID>

# Check recent logs for the process
sudo journalctl -u <SERVICE_NAME> --since "10 minutes ago" | tail -50

# For container: check recent pod logs
kubectl logs <POD_NAME> -n <NAMESPACE> --since=10m | tail -50
Expected output:
# Legitimate load: CPU spread across many cores, process has run for hours
# Tight loop: one CPU at 100%, others idle, process just started
CPU    %usr   %sys   %iowait  %irq   %soft  %idle
all    95.0    3.0      0.1    0.0     0.1    1.8
0     99.8    0.1      0.0    0.0     0.1    0.1   <- tight loop on core 0
1      0.1    0.1      0.0    0.0     0.0   99.8
If this fails: If the process appears stuck (not a tight loop but high CPU) and logs show errors, it may be waiting on a lock or spinning on a corrupt data structure. Capture a thread dump before proceeding.

Step 4: Check If It Is a Known Workload or a Runaway

Why: Before terminating any process, verify it is not an expected batch job, reindex operation, or backup task that is legitimately CPU-intensive.

# Check process start time and command line
ps -o pid,lstart,cmd -p <PID>

# Check if it was recently deployed or changed
sudo journalctl -u <SERVICE_NAME> --since "1 hour ago" | grep -i "start\|restart\|deploy"

# Check crontab for scheduled jobs that might explain the timing
sudo crontab -l
sudo cat /etc/cron.d/*
sudo cat /var/spool/cron/crontabs/*

# Check systemd timers
systemctl list-timers --all | grep active
Expected output:
# Known workload: started at a scheduled time, matches cron/systemd timer, service description explains the load
# Runaway: started recently with no matching schedule, consuming far more CPU than usual
If this fails: If unsure whether the load is expected, contact the service owner via your incident channel before killing the process — unnecessary kills can corrupt in-flight work.

Step 5: If Runaway — SIGTERM Gracefully, Then SIGKILL

Why: SIGTERM gives the process a chance to flush buffers, release locks, and clean up. SIGKILL is immediate but can leave behind corrupted state, lock files, or partial writes.

# First: try graceful shutdown
sudo kill -SIGTERM <PID>

# Wait 30 seconds and check if the process has exited
sleep 30
ps -p <PID>

# If still running, then use SIGKILL
sudo kill -SIGKILL <PID>

# For a Kubernetes pod (let Kubernetes handle the restart)
kubectl delete pod <POD_NAME> -n <NAMESPACE>
# This triggers the pod's terminationGracePeriodSeconds, then SIGKILL if needed

# Verify the process is gone
ps -p <PID>
Expected output:
# After SIGTERM or SIGKILL:
ps: process '<PID>' not found
If this fails: If the process ignores SIGTERM and SIGKILL does not work (rare: zombie or kernel thread), check cat /proc/<PID>/wchan — if it shows a kernel function, the process is stuck in an uninterruptible kernel state. This may require a node reboot.

Step 6: Investigate Root Cause in Logs and Traces

Why: A killed runaway process that restarts without a root cause fix will become a runaway again.

# Check application logs for errors leading up to the high CPU event
sudo journalctl -u <SERVICE_NAME> --since "30 minutes ago"

# Check for OOM, exception storms, or infinite loop indicators
sudo journalctl -u <SERVICE_NAME> | grep -i "error\|exception\|loop\|retry" | tail -50

# Check git/deploy history to see if a recent deploy is correlated
git -C /path/to/app/repo log --oneline --since="2 hours ago" 2>/dev/null

# Review monitoring dashboards for CPU trend (did it spike suddenly or creep up?)
Expected output:
# Ideally: an error message indicating the root cause, timestamped before the CPU spike
Mar 19 10:00:05 myapp[12345]: ERROR: Database connection pool exhausted, retrying forever
If this fails: If no logs explain the behavior, enable debug logging for the service and re-deploy in a non-production environment to reproduce.

Verification

# Confirm the issue is resolved
top -b -n1 | head -5
Success looks like: CPU idle above 20%. No single process consuming more than 80% CPU. Alert has cleared in monitoring. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min Platform/Application on-call "Runaway CPU on node , process consuming >90% CPU, node may become unresponsive"
Data loss suspected Application team lead "SIGKILL sent to on due to CPU runaway, in-flight requests may have been dropped"
Scope expanding to multiple nodes SRE lead "High CPU spreading to multiple nodes, possible cascade from shared bottleneck or bad deploy"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Killing with SIGKILL immediately: Always try SIGTERM first and wait 30 seconds. SIGKILL skips cleanup code and can leave lock files, corrupt local state, or drop queued work. SIGTERM gives the application a chance to shut down gracefully.
  2. Not capturing a profile before killing: A 10-second perf top or strace run takes almost no time and provides the evidence needed to file a proper bug report. Killing without profiling means the next engineer faces the same mystery.
  3. Confusing system CPU with user CPU: High %sys (kernel) CPU means the kernel is doing most of the work — often due to I/O, network syscalls, or context switching. High %usr (user) means application code is the culprit. The fix is different for each.

Cross-References


Wiki Navigation