Skip to content

On-Call Survival: Linux/OS

Print this. Pin it. Read it at 3 AM.


Alert: Disk Full (/ or /var)

Severity: P1 (root disk full — system will freeze)

First command:

df -h
What you're looking for: Which filesystem is at 100% or near it.

Decision tree:

Is /var/log full?
├── Yes  journalctl --vacuum-size=500M
         Find biggest logs: du -sh /var/log/* | sort -rh | head -10
         Rotate/truncate: truncate -s 0 /var/log/<big-file>.log
└── No  Is /var/lib/docker or /var/lib/containerd full?
    ├── Yes  docker system prune -f   (or crictl rmi --prune for containerd)
             WARNING: removes unused images  confirm not needed
    └── No  Find what's eating space:
             du -sh /* 2>/dev/null | sort -rh | head -10
             du -sh /home/* | sort -rh | head -10
             Large core dumps? ls -lh /var/crash/ /tmp/core*
             Escalate if unsure what to delete: "Disk full on <host>, biggest dirs: <paste>"

Escalation trigger: Cannot free space without deleting unknown files; root filesystem at 100%; system calls failing with ENOSPC.

Safe actions: df -h, du to find usage, check logs, journal vacuum.

Dangerous actions: Deleting files in /var/lib/, docker prune, resizing filesystem.


Alert: Out of Memory / OOM Kill

Severity: P1 (kernel OOM killer firing)

First command:

dmesg -T | grep -i "oom\|killed process" | tail -20
What you're looking for: Which process was killed and how often.

Decision tree:

Is it killing a critical service (database, app server)?
├── Yes  systemctl status <service>; restart if down: systemctl restart <service>
         Then: free -h; top -b -n 1 | head -20 (find memory hogs)
         Escalate: "OOM killing <service> on <host>, restart loop possible"
└── No  Is total memory usage near 100%?
    ├── Yes  ps aux --sort=-%mem | head -15
             Is a runaway process eating memory?  kill -9 <pid> (last resort)
    └── No  Could be a memory leak in a specific process.
             Watch: watch -n 5 'ps aux --sort=-%mem | head -10'
             Escalate if leak suspected: "Process <name> PID <pid> memory growing on <host>"

Escalation trigger: OOM kill loop; critical database or app repeatedly killed; system unresponsive.

Safe actions: dmesg, free -h, top, ps aux — read-only.

Dangerous actions: kill -9 <pid> (stops a process immediately), sysctl vm.overcommit_memory changes.


Alert: High CPU / Load Average Spike

Severity: P2

First command:

uptime && top -b -n 1 | head -25
What you're looking for: Load average vs CPU count (load > 2x CPU count = problem). Which PID is eating CPU.

Decision tree:

Is load average > 2x CPU count for > 5 min?
├── Yes  Is it one process consuming CPU?
         ps aux --sort=-%cpu | head -10
         Known process (backup, cron job)?   Wait or kill the cron.
         Unknown process?  Check: lsof -p <pid>; ls -la /proc/<pid>/exe
└── No (high load but low CPU)?
     Could be I/O wait: iostat -x 1 5 (look at %iowait column)
    ├── High iowait  Which process? iotop -a -b -n 3 | head -20
                     Is a disk failing? dmesg -T | grep -i "error\|fail" | tail -20
    └── Low iowait, high load  CPU throttling? Check cgroup limits.
                                 Escalate: "High load on <host>, low CPU, low iowait: <paste top>"

Escalation trigger: Load average > 4x CPU count sustained; system unresponsive via SSH; suspected runaway/malicious process.

Safe actions: top, ps aux, uptime, dmesg — read-only.

Dangerous actions: kill -9 (stops process), changing CPU limits, rebooting.


Alert: Service Failed (systemd)

Severity: P1 (critical service) / P2 (non-critical)

First command:

systemctl status <service-name>
What you're looking for: Active: failed state, the last error lines, exit code.

Decision tree:

Is it a one-time crash or repeated failure?
├── One-time  systemctl restart <service>; watch status for 60s
└── Repeated (failed X times in Y min)?
     journalctl -u <service> -n 50 --no-pager
    ├── Config error ("invalid config", "permission denied")?
        Fix config; systemctl daemon-reload; systemctl restart <service>
    ├── Port already in use?
        ss -tlnp | grep <port>; kill conflicting process
    └── Unknown error  Escalate: "Service <name> in failed loop: <paste journal output>"

Escalation trigger: Critical service (nginx, postgresql, kubelet) cannot be restarted; unit file corruption; socket conflict that cannot be safely resolved.

Safe actions: systemctl status, journalctl -u <service> — read-only.

Dangerous actions: systemctl restart (brief downtime), systemctl daemon-reload, edit unit file.


Alert: Zombie Processes / PID Exhaustion

Severity: P2 (zombie count climbing)

First command:

ps aux | grep -c Z   # count zombies
cat /proc/sys/kernel/pid_max && ps aux | wc -l
What you're looking for: Large zombie count or PID count approaching pid_max (default 32768).

Decision tree:

Zombie count > 20 and growing?
├── Yes  Find parent: ps aux | awk '$8 == "Z" {print $3}' (shows PPIDs)
         kill -9 <parent-pid>  (reaping parent reaps zombies  confirm parent is safe to kill)
└── No  PID exhaustion risk?
    ├── Yes  Find PID-leaking process: ps aux --sort=pid | tail -20
             Alert app/dev team: "PID exhaustion on <host>, <count> PIDs in use"
    └── No  Monitor: ps aux | grep -c Z > 5 every 5m; alert if trend continues

Escalation trigger: PID count > 90% of pid_max; zombie-creating parent is a critical service; system fork-bombing.

Safe actions: Count zombies, identify parent PID — read-only.

Dangerous actions: kill -9 parent process (may affect running service), increase pid_max.


Quick Reference

Most Useful Commands

# Disk usage — filesystem level
df -h

# Disk usage — directory level (find what's big)
du -sh /* 2>/dev/null | sort -rh | head -10

# Memory overview
free -h

# OOM kills in kernel log
dmesg -T | grep -i "oom\|killed process" | tail -20

# Top processes by CPU
ps aux --sort=-%cpu | head -15

# Top processes by memory
ps aux --sort=-%mem | head -15

# Load average + CPU count
uptime; nproc

# I/O wait per disk
iostat -x 1 3

# Service status
systemctl status <service>

# Service logs (last 50 lines)
journalctl -u <service> -n 50 --no-pager

# Open files for a process
lsof -p <pid>

# Listening ports
ss -tlnp

Escalation Contacts

Situation Team Channel
Disk full — unknown files Infra / SRE #infra-oncall
OOM loop on critical service App team + SRE #infra-oncall
System unresponsive Infra (host-level access) PagerDuty: infra-critical
Suspected compromise Security #security-incidents

Safe vs Dangerous Actions

Safe (do without asking) Dangerous (get approval)
Read df, top, ps, dmesg Kill -9 any process
journalctl (read logs) Restart critical services
lsof, ss (read sockets) Delete files to free space
Journal vacuum (logs only) docker/crictl prune
systemctl status Reboot the host

Shift Handoff Template

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]