On-Call Survival: Linux/OS¶
Print this. Pin it. Read it at 3 AM.
Alert: Disk Full (/ or /var)¶
Severity: P1 (root disk full — system will freeze)
First command:
What you're looking for: Which filesystem is at 100% or near it.Decision tree:
Is /var/log full?
├── Yes → journalctl --vacuum-size=500M
│ Find biggest logs: du -sh /var/log/* | sort -rh | head -10
│ Rotate/truncate: truncate -s 0 /var/log/<big-file>.log
└── No → Is /var/lib/docker or /var/lib/containerd full?
├── Yes → docker system prune -f (or crictl rmi --prune for containerd)
│ WARNING: removes unused images — confirm not needed
└── No → Find what's eating space:
du -sh /* 2>/dev/null | sort -rh | head -10
du -sh /home/* | sort -rh | head -10
Large core dumps? ls -lh /var/crash/ /tmp/core*
Escalate if unsure what to delete: "Disk full on <host>, biggest dirs: <paste>"
Escalation trigger: Cannot free space without deleting unknown files; root filesystem at 100%; system calls failing with ENOSPC.
Safe actions: df -h, du to find usage, check logs, journal vacuum.
Dangerous actions: Deleting files in /var/lib/, docker prune, resizing filesystem.
Alert: Out of Memory / OOM Kill¶
Severity: P1 (kernel OOM killer firing)
First command:
What you're looking for: Which process was killed and how often.Decision tree:
Is it killing a critical service (database, app server)?
├── Yes → systemctl status <service>; restart if down: systemctl restart <service>
│ Then: free -h; top -b -n 1 | head -20 (find memory hogs)
│ Escalate: "OOM killing <service> on <host>, restart loop possible"
└── No → Is total memory usage near 100%?
├── Yes → ps aux --sort=-%mem | head -15
│ Is a runaway process eating memory? → kill -9 <pid> (last resort)
└── No → Could be a memory leak in a specific process.
Watch: watch -n 5 'ps aux --sort=-%mem | head -10'
Escalate if leak suspected: "Process <name> PID <pid> memory growing on <host>"
Escalation trigger: OOM kill loop; critical database or app repeatedly killed; system unresponsive.
Safe actions: dmesg, free -h, top, ps aux — read-only.
Dangerous actions: kill -9 <pid> (stops a process immediately), sysctl vm.overcommit_memory changes.
Alert: High CPU / Load Average Spike¶
Severity: P2
First command:
What you're looking for: Load average vs CPU count (load > 2x CPU count = problem). Which PID is eating CPU.Decision tree:
Is load average > 2x CPU count for > 5 min?
├── Yes → Is it one process consuming CPU?
│ ps aux --sort=-%cpu | head -10
│ Known process (backup, cron job)? → Wait or kill the cron.
│ Unknown process? → Check: lsof -p <pid>; ls -la /proc/<pid>/exe
└── No (high load but low CPU)?
→ Could be I/O wait: iostat -x 1 5 (look at %iowait column)
├── High iowait → Which process? iotop -a -b -n 3 | head -20
│ Is a disk failing? dmesg -T | grep -i "error\|fail" | tail -20
└── Low iowait, high load → CPU throttling? Check cgroup limits.
Escalate: "High load on <host>, low CPU, low iowait: <paste top>"
Escalation trigger: Load average > 4x CPU count sustained; system unresponsive via SSH; suspected runaway/malicious process.
Safe actions: top, ps aux, uptime, dmesg — read-only.
Dangerous actions: kill -9 (stops process), changing CPU limits, rebooting.
Alert: Service Failed (systemd)¶
Severity: P1 (critical service) / P2 (non-critical)
First command:
What you're looking for:Active: failed state, the last error lines, exit code.
Decision tree:
Is it a one-time crash or repeated failure?
├── One-time → systemctl restart <service>; watch status for 60s
└── Repeated (failed X times in Y min)?
→ journalctl -u <service> -n 50 --no-pager
├── Config error ("invalid config", "permission denied")?
│ → Fix config; systemctl daemon-reload; systemctl restart <service>
├── Port already in use?
│ → ss -tlnp | grep <port>; kill conflicting process
└── Unknown error → Escalate: "Service <name> in failed loop: <paste journal output>"
Escalation trigger: Critical service (nginx, postgresql, kubelet) cannot be restarted; unit file corruption; socket conflict that cannot be safely resolved.
Safe actions: systemctl status, journalctl -u <service> — read-only.
Dangerous actions: systemctl restart (brief downtime), systemctl daemon-reload, edit unit file.
Alert: Zombie Processes / PID Exhaustion¶
Severity: P2 (zombie count climbing)
First command:
What you're looking for: Large zombie count or PID count approaching pid_max (default 32768).Decision tree:
Zombie count > 20 and growing?
├── Yes → Find parent: ps aux | awk '$8 == "Z" {print $3}' (shows PPIDs)
│ kill -9 <parent-pid> (reaping parent reaps zombies — confirm parent is safe to kill)
└── No → PID exhaustion risk?
├── Yes → Find PID-leaking process: ps aux --sort=pid | tail -20
│ Alert app/dev team: "PID exhaustion on <host>, <count> PIDs in use"
└── No → Monitor: ps aux | grep -c Z > 5 every 5m; alert if trend continues
Escalation trigger: PID count > 90% of pid_max; zombie-creating parent is a critical service; system fork-bombing.
Safe actions: Count zombies, identify parent PID — read-only.
Dangerous actions: kill -9 parent process (may affect running service), increase pid_max.
Quick Reference¶
Most Useful Commands¶
# Disk usage — filesystem level
df -h
# Disk usage — directory level (find what's big)
du -sh /* 2>/dev/null | sort -rh | head -10
# Memory overview
free -h
# OOM kills in kernel log
dmesg -T | grep -i "oom\|killed process" | tail -20
# Top processes by CPU
ps aux --sort=-%cpu | head -15
# Top processes by memory
ps aux --sort=-%mem | head -15
# Load average + CPU count
uptime; nproc
# I/O wait per disk
iostat -x 1 3
# Service status
systemctl status <service>
# Service logs (last 50 lines)
journalctl -u <service> -n 50 --no-pager
# Open files for a process
lsof -p <pid>
# Listening ports
ss -tlnp
Escalation Contacts¶
| Situation | Team | Channel |
|---|---|---|
| Disk full — unknown files | Infra / SRE | #infra-oncall |
| OOM loop on critical service | App team + SRE | #infra-oncall |
| System unresponsive | Infra (host-level access) | PagerDuty: infra-critical |
| Suspected compromise | Security | #security-incidents |
Safe vs Dangerous Actions¶
| Safe (do without asking) | Dangerous (get approval) |
|---|---|
| Read df, top, ps, dmesg | Kill -9 any process |
| journalctl (read logs) | Restart critical services |
| lsof, ss (read sockets) | Delete files to free space |
| Journal vacuum (logs only) | docker/crictl prune |
| systemctl status | Reboot the host |