On-Call Survival: Linux/OS¶

Print this. Pin it. Read it at 3 AM.

Alert: Disk Full (/ or /var)¶

Severity: P1 (root disk full — system will freeze)

First command:

df -h

What you're looking for: Which filesystem is at 100% or near it.

Decision tree:

Is /var/log full?
├── Yes → journalctl --vacuum-size=500M
│         Find biggest logs: du -sh /var/log/* | sort -rh | head -10
│         Rotate/truncate: truncate -s 0 /var/log/<big-file>.log
└── No → Is /var/lib/docker or /var/lib/containerd full?
    ├── Yes → docker system prune -f   (or crictl rmi --prune for containerd)
    │         WARNING: removes unused images — confirm not needed
    └── No → Find what's eating space:
             du -sh /* 2>/dev/null | sort -rh | head -10
             du -sh /home/* | sort -rh | head -10
             Large core dumps? ls -lh /var/crash/ /tmp/core*
             Escalate if unsure what to delete: "Disk full on <host>, biggest dirs: <paste>"

Escalation trigger: Cannot free space without deleting unknown files; root filesystem at 100%; system calls failing with ENOSPC.

Safe actions: df -h, du to find usage, check logs, journal vacuum.

Dangerous actions: Deleting files in /var/lib/, docker prune, resizing filesystem.

Alert: Out of Memory / OOM Kill¶

Severity: P1 (kernel OOM killer firing)

First command:

dmesg -T | grep -i "oom\|killed process" | tail -20

What you're looking for: Which process was killed and how often.

Decision tree:

Is it killing a critical service (database, app server)?
├── Yes → systemctl status <service>; restart if down: systemctl restart <service>
│         Then: free -h; top -b -n 1 | head -20 (find memory hogs)
│         Escalate: "OOM killing <service> on <host>, restart loop possible"
└── No → Is total memory usage near 100%?
    ├── Yes → ps aux --sort=-%mem | head -15
    │         Is a runaway process eating memory? → kill -9 <pid> (last resort)
    └── No → Could be a memory leak in a specific process.
             Watch: watch -n 5 'ps aux --sort=-%mem | head -10'
             Escalate if leak suspected: "Process <name> PID <pid> memory growing on <host>"

Escalation trigger: OOM kill loop; critical database or app repeatedly killed; system unresponsive.

Safe actions: dmesg, free -h, top, ps aux — read-only.

Dangerous actions: kill -9 <pid> (stops a process immediately), sysctl vm.overcommit_memory changes.

Alert: High CPU / Load Average Spike¶

Severity: P2

First command:

uptime && top -b -n 1 | head -25

What you're looking for: Load average vs CPU count (load > 2x CPU count = problem). Which PID is eating CPU.

Decision tree:

Is load average > 2x CPU count for > 5 min?
├── Yes → Is it one process consuming CPU?
│         ps aux --sort=-%cpu | head -10
│         Known process (backup, cron job)?  → Wait or kill the cron.
│         Unknown process? → Check: lsof -p <pid>; ls -la /proc/<pid>/exe
└── No (high load but low CPU)?
    → Could be I/O wait: iostat -x 1 5 (look at %iowait column)
    ├── High iowait → Which process? iotop -a -b -n 3 | head -20
    │                 Is a disk failing? dmesg -T | grep -i "error\|fail" | tail -20
    └── Low iowait, high load → CPU throttling? Check cgroup limits.
                                 Escalate: "High load on <host>, low CPU, low iowait: <paste top>"

Escalation trigger: Load average > 4x CPU count sustained; system unresponsive via SSH; suspected runaway/malicious process.

Safe actions: top, ps aux, uptime, dmesg — read-only.

Dangerous actions: kill -9 (stops process), changing CPU limits, rebooting.

Alert: Service Failed (systemd)¶

Severity: P1 (critical service) / P2 (non-critical)

First command:

systemctl status <service-name>

What you're looking for: Active: failed state, the last error lines, exit code.

Decision tree:

Is it a one-time crash or repeated failure?
├── One-time → systemctl restart <service>; watch status for 60s
└── Repeated (failed X times in Y min)?
    → journalctl -u <service> -n 50 --no-pager
    ├── Config error ("invalid config", "permission denied")?
    │   → Fix config; systemctl daemon-reload; systemctl restart <service>
    ├── Port already in use?
    │   → ss -tlnp | grep <port>; kill conflicting process
    └── Unknown error → Escalate: "Service <name> in failed loop: <paste journal output>"

Escalation trigger: Critical service (nginx, postgresql, kubelet) cannot be restarted; unit file corruption; socket conflict that cannot be safely resolved.

Safe actions: systemctl status, journalctl -u <service> — read-only.

Dangerous actions: systemctl restart (brief downtime), systemctl daemon-reload, edit unit file.

Alert: Zombie Processes / PID Exhaustion¶

Severity: P2 (zombie count climbing)

First command:

ps aux | grep -c Z   # count zombies
cat /proc/sys/kernel/pid_max && ps aux | wc -l

What you're looking for: Large zombie count or PID count approaching pid_max (default 32768).

Decision tree:

Zombie count > 20 and growing?
├── Yes → Find parent: ps aux | awk '$8 == "Z" {print $3}' (shows PPIDs)
│         kill -9 <parent-pid>  (reaping parent reaps zombies — confirm parent is safe to kill)
└── No → PID exhaustion risk?
    ├── Yes → Find PID-leaking process: ps aux --sort=pid | tail -20
    │         Alert app/dev team: "PID exhaustion on <host>, <count> PIDs in use"
    └── No → Monitor: ps aux | grep -c Z > 5 every 5m; alert if trend continues

Escalation trigger: PID count > 90% of pid_max; zombie-creating parent is a critical service; system fork-bombing.

Safe actions: Count zombies, identify parent PID — read-only.

Dangerous actions: kill -9 parent process (may affect running service), increase pid_max.

Quick Reference¶

Most Useful Commands¶

# Disk usage — filesystem level
df -h

# Disk usage — directory level (find what's big)
du -sh /* 2>/dev/null | sort -rh | head -10

# Memory overview
free -h

# OOM kills in kernel log
dmesg -T | grep -i "oom\|killed process" | tail -20

# Top processes by CPU
ps aux --sort=-%cpu | head -15

# Top processes by memory
ps aux --sort=-%mem | head -15

# Load average + CPU count
uptime; nproc

# I/O wait per disk
iostat -x 1 3

# Service status
systemctl status <service>

# Service logs (last 50 lines)
journalctl -u <service> -n 50 --no-pager

# Open files for a process
lsof -p <pid>

# Listening ports
ss -tlnp

Escalation Contacts¶

Situation	Team	Channel
Disk full — unknown files	Infra / SRE	#infra-oncall
OOM loop on critical service	App team + SRE	#infra-oncall
System unresponsive	Infra (host-level access)	PagerDuty: infra-critical
Suspected compromise	Security	#security-incidents

Safe vs Dangerous Actions¶

Safe (do without asking)	Dangerous (get approval)
Read df, top, ps, dmesg	Kill -9 any process
journalctl (read logs)	Restart critical services
lsof, ss (read sockets)	Delete files to free space
Journal vacuum (logs only)	docker/crictl prune
systemctl status	Reboot the host

Shift Handoff Template¶

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]