Skip to content

Linux System Administration - Street Ops

What experienced sysadmins know that textbooks don't teach.

Incident Runbooks

Disk Full Emergency

1. Identify the problem:
   df -h                         # Which filesystem is full?
   df -i                         # Is it inode exhaustion instead?

2. Find what's consuming space:
   du -sh /* 2>/dev/null | sort -rh | head -10
   # Drill into the biggest directory:
   du -sh /var/* | sort -rh | head -10
   du -sh /var/log/* | sort -rh | head -10

3. Quick wins to free space:
   # Truncate (don't delete) active log files:
   > /var/log/huge-app.log       # Truncates to 0 without breaking file handle
   # WRONG: rm /var/log/huge-app.log  # Process still holds handle, space not freed

   # Clean package manager cache:
   dnf clean all                 # RHEL/CentOS
   apt clean                     # Debian/Ubuntu

   # Find and remove old journal logs:
   journalctl --disk-usage
   journalctl --vacuum-size=500M

   # Find deleted-but-held files (space used, not visible in du):
   lsof +L1 | grep deleted
   # Restart the process holding the file to release space

4. Inode exhaustion (df shows space available, but "No space left on device"):
   find / -xdev -type d | while read d; do echo "$(ls -a "$d" 2>/dev/null | wc -l) $d"; done | sort -rn | head -20
   # Usually: millions of tiny session/cache files in one directory

Zombie Processes

1. Identify zombies:
   ps aux | awk '$8 == "Z" {print}'
   # Or: ps -eo pid,ppid,stat,cmd | grep -w Z

2. Understand what they are:
   - A zombie is a process that finished but its parent hasn't read the exit status
   - They consume no CPU/memory, just a PID table entry
   - A few zombies are usually harmless
   - Thousands of zombies indicate a buggy parent process

3. Find the parent:
   ps -eo pid,ppid,stat,cmd | grep -w Z
   # Note the PPID column - that's the parent to investigate

4. Fix:
   - Send SIGCHLD to the parent: kill -17 <ppid>
   - If parent is broken, restart the parent process
   - SIGKILL on the zombie itself does nothing (it's already dead)
   - If parent is PID 1 (init/systemd), zombies will be reaped eventually

Stuck/Hung Mount (NFS, CIFS)

1. Symptoms:
   - ls or df hangs when accessing the mount point
   - Processes accessing the mount go to "D" state (uninterruptible sleep)
   - Even `umount` hangs

2. Diagnose:
   mount | grep nfs               # Check NFS mounts
   nfsstat -m                     # NFS mount options and server
   showmount -e <nfs-server>      # What the server exports

3. Identify stuck processes:
   lsof +f -- /mnt/nfs-share     # What's using the mount?
   fuser -vm /mnt/nfs-share      # Processes using mount

4. Recovery:
   # Lazy unmount (detaches immediately, cleans up when idle):
   umount -l /mnt/nfs-share

   # Force unmount (for NFS):
   umount -f /mnt/nfs-share

   # Nuclear option - kill all processes using the mount:
   fuser -km /mnt/nfs-share
   umount -l /mnt/nfs-share

5. Prevention:
   - Mount NFS with soft,timeo=30,retrans=3 (timeout instead of hang)
   - Use autofs for on-demand mounting
   - Monitor NFS server availability

Log Rotation Gone Wrong

1. Symptoms:
   - Log files growing without bound
   - Disk filling up from /var/log
   - Application writing to rotated file (old file descriptor)

2. Check logrotate config:
   cat /etc/logrotate.conf               # Global config
   ls /etc/logrotate.d/                   # Per-app configs
   cat /etc/logrotate.d/nginx            # Specific app

3. Test manually:
   logrotate -d /etc/logrotate.d/nginx   # Dry run (debug mode)
   logrotate -f /etc/logrotate.d/nginx   # Force rotation now

4. Common problems:
   - Missing "copytruncate" or "postrotate" signal:
     # Without these, the app keeps writing to the old (now renamed) file
     # Fix: add "copytruncate" for apps that don't handle SIGHUP
     # Or: add postrotate script to send SIGHUP

   - Wrong permissions on rotated files:
     # Add: create 0640 appuser appgroup

   - logrotate state file corrupted:
     cat /var/lib/logrotate/status
     # Remove the entry for the problematic log and re-run

5. Sample logrotate config:
   /var/log/myapp/*.log {
       daily
       rotate 14
       compress
       delaycompress
       missingok
       notifempty
       copytruncate
       maxsize 500M
   }

Kernel Panic / System Crash

1. After reboot, find what happened:
   journalctl -b -1                # Previous boot logs
   journalctl -k -b -1             # Kernel messages from previous boot
   last -x | head -20              # Reboot/shutdown history
   dmesg -T | tail -100            # Current boot kernel messages

2. Common causes:
   - OOM killer:
     journalctl -b -1 | grep -i "out of memory"
     journalctl -b -1 | grep -i "oom-killer"
     # Shows which process was killed and memory state

   - Kernel bug/driver crash:
     journalctl -b -1 | grep -i "panic\|BUG\|oops"

   - Hardware failure:
     journalctl -b -1 | grep -iE "hardware error\|mce\|edac"

3. OOM prevention:
   # Check current memory pressure:
   free -h
   cat /proc/meminfo | grep -E "MemTotal|MemAvailable|SwapTotal|SwapFree"

   # Check which processes use the most memory:
   ps aux --sort=-%mem | head -10

   # Tune OOM killer priority:
   echo -1000 > /proc/<pid>/oom_score_adj   # Protect this process
   echo 1000 > /proc/<pid>/oom_score_adj    # Kill this one first

4. Enable crash dumps for future analysis:
   systemctl enable kdump
   systemctl start kdump

Service Won't Start

1. Get the full picture:
   systemctl status myservice -l      # Status + recent logs
   journalctl -u myservice --no-pager # Full log history

2. Common failure reasons:
   - Port already in use:
     ss -tlnp | grep :8080
     # Another process is listening. Find it, decide which wins.

   - Permission denied:
     # Check file ownership matches the User= in the unit file
     # Check SELinux: ausearch -m avc -ts recent
     # Check capabilities if using non-root user

   - Missing dependency:
     systemctl list-dependencies myservice
     # A required service (database, network) isn't up yet

   - Config syntax error:
     # Most services have a config test mode:
     nginx -t
     httpd -t
     named-checkconf

   - ExecStart path wrong:
     systemctl cat myservice    # Show the actual unit file
     which myapp                # Verify the binary path
     file /usr/local/bin/myapp  # Verify it's executable

3. SELinux (RHEL/CentOS):
   getenforce                         # Is SELinux enforcing?
   ausearch -m avc -ts recent         # Recent denials
   sealert -a /var/log/audit/audit.log # Human-readable
   restorecon -Rv /var/www/html       # Reset file contexts

Gotchas & War Stories

The deleted-but-open file trap You rm a 50GB log file, df still shows the disk full. The process still has the file handle open. The space isn't freed until the process closes the file or restarts. Use lsof +L1 | grep deleted to find these. Truncate instead of delete: > /path/to/file.

The /etc/resolv.conf overwrite NetworkManager, systemd-resolved, and DHCP clients all fight over /etc/resolv.conf. You edit it manually, then it gets overwritten on the next DHCP renewal. Fix: configure DNS through the appropriate manager (nmcli, netplan, or resolved.conf).

sudo vs su sudo su - gives you a root shell with root's environment. sudo -i does the same thing. sudo command runs just that command as root. Never use sudo su; use sudo -i for a root shell or sudo per-command for auditing.

The fork bomb :(){ :|:& };: will bring a system to its knees in seconds. Protect against it: set ulimit values in /etc/security/limits.conf. Every user should have a max process limit.

Timezone pain A server in UTC logs events at one timestamp, an application in local time logs another. Always run servers in UTC. timedatectl set-timezone UTC.

Essential Troubleshooting Commands

# System overview
uptime                         # Load average, uptime
free -h                        # Memory usage
vmstat 1 5                     # CPU, memory, I/O snapshot
iostat -xz 1 5                 # Disk I/O per device
sar -u 1 5                     # CPU utilization over time

# Process investigation
strace -p <pid> -f -e trace=network   # System calls (network)
strace -p <pid> -c                    # Syscall summary
lsof -p <pid>                         # Open files/sockets
pmap <pid>                            # Memory map

# File and search
find / -name "*.conf" -mtime -1       # Config files changed in last day
find / -perm -4000 -type f            # SUID files
grep -r "error" /var/log/ --include="*.log" -l   # Files containing errors

# User activity
w                              # Who is logged in and what they're doing
last -20                       # Recent logins
lastb -20                      # Failed login attempts
ausearch -m USER_LOGIN -ts today   # Audit log (if auditd running)

Network Diagnostics

Task: Is the Interface Up and Configured?

# Brief view — fastest way to check link and IP state
$ ip -br link show
lo      UP      00:00:00:00:00:00 <LOOPBACK,UP>
eth0    UP      aa:bb:cc:dd:ee:ff <BROADCAST,MULTICAST,UP>
eth1    DOWN    11:22:33:44:55:66 <BROADCAST,MULTICAST>

$ ip -br addr show
lo      UP      127.0.0.1/8 ::1/128
eth0    UP      10.0.0.5/24
eth1    DOWN

# eth1 is down — bring it up
$ ip link set eth1 up

Task: Find What Is Listening on a Port

# Something is already using port 8080
$ ss -tlnp | grep 8080
LISTEN  0  128  *:8080  *:*  users:(("java",pid=12345,fd=42))

# It is a java process with PID 12345
$ ps -p 12345 -o pid,user,cmd
  PID USER     CMD
12345 appuser  /usr/bin/java -jar myapp.jar

Task: Test Connectivity to a Remote Port

# Is the remote service reachable?
$ nc -zv 10.0.0.20 5432
Connection to 10.0.0.20 5432 port [tcp/postgresql] succeeded!

# Timeout quickly if not reachable
$ nc -zv -w 3 10.0.0.20 3306
nc: connect to 10.0.0.20 port 3306 (tcp) timed out: Operation now in progress

Task: Capture Traffic to Debug Application Issues

# See all traffic to/from a specific host on port 443
$ tcpdump -i eth0 -nn host 10.0.0.20 and port 443 -c 20
14:23:01.001 IP 10.0.0.5.48230 > 10.0.0.20.443: Flags [S], seq 1234
14:23:01.002 IP 10.0.0.20.443 > 10.0.0.5.48230: Flags [S.], seq 5678, ack 1235
14:23:01.002 IP 10.0.0.5.48230 > 10.0.0.20.443: Flags [.], ack 5679

# SYN, SYN-ACK, ACK — three-way handshake completes. Connection works.

# Capture to file for Wireshark analysis
$ tcpdump -i eth0 -w /tmp/capture.pcap -c 5000 host 10.0.0.20

# Look for RST packets (connection resets)
$ tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-rst != 0' and host 10.0.0.20

Task: Trace the Network Path

# Continuous traceroute with packet loss stats
$ mtr -rw -c 50 8.8.8.8
HOST                    Loss%  Snt  Avg  Best  Wrst  StDev
1. 10.0.0.1             0.0%   50   0.5   0.3   1.2   0.2
2. 172.16.0.1            0.0%   50   2.1   1.8   4.5   0.4
3. ???                  100.0    50   0.0   0.0   0.0   0.0
4. 72.14.215.65          2.0%   50  12.3  11.5  18.7   1.1

# Hop 3 is ICMP rate-limiting (not real loss — loss only at final hop matters)

# TCP traceroute — bypasses firewalls that block ICMP/UDP
$ traceroute -T -p 443 -n 8.8.8.8

Task: Check Which Route the Kernel Uses

# Where does traffic to 10.100.5.3 go?
$ ip route get 10.100.5.3
10.100.5.3 via 10.0.0.1 dev eth0 src 10.0.0.5 uid 0

# Show all routes
$ ip route show
default via 10.0.0.1 dev eth0
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.5
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

Task: Diagnose NIC Errors

# Check for physical layer problems
$ ethtool eth0 | grep -E "Speed|Duplex|Link"
Speed: 10000Mb/s
Duplex: Full
Link detected: yes

# Check error counters
$ ethtool -S eth0 | grep -iE "error|drop|crc"
rx_errors: 0
tx_errors: 0
rx_dropped: 847
rx_crc_errors: 0

# rx_dropped > 0 — kernel dropping packets. Check ring buffer:
$ ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:     4096
Current hardware settings:
RX:      256

# Ring buffer is small — increase it
$ ethtool -G eth0 rx 4096

Task: DNS Troubleshooting

# Quick resolution check
$ dig +short example.com
93.184.216.34

# Full query with timing
$ dig example.com
;; Query time: 12 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)

# Query a specific DNS server
$ dig @8.8.8.8 example.com A

# Reverse lookup
$ dig -x 93.184.216.34

# Check if DNS traffic is flowing
$ tcpdump -i eth0 -nn port 53 -c 5

Task: Bandwidth Test Between Two Hosts

# On the server side
$ iperf3 -s

# On the client side
$ iperf3 -c 10.0.0.20 -t 10
[ ID] Interval       Transfer   Bitrate
[  5] 0.00-10.00 sec  11.2 GBytes  9.62 Gbits/sec

Task: Find Connections in Bad States

# Too many TIME_WAIT connections — port exhaustion risk
$ ss -tn state time-wait | wc -l
14823

# CLOSE_WAIT accumulating — app not closing connections
$ ss -tnp state close-wait
CLOSE-WAIT  1  0  10.0.0.5:45678  10.0.0.20:5432  users:(("python",pid=9876,fd=12))

# Socket summary
$ ss -s
Total: 1284
TCP:   982 (estab 456, closed 234, orphaned 12, timewait 287)

Task: Debug HTTP with curl

# Verbose connection + TLS details
$ curl -vvv https://api.example.com/health 2>&1 | head -30

# Just timing and status code
$ curl -o /dev/null -s -w 'HTTP %{http_code} in %{time_total}s\n' \
    https://api.example.com/health
HTTP 200 in 0.234s

# Resolve to a specific IP (bypass DNS)
$ curl --resolve api.example.com:443:10.0.0.50 https://api.example.com/health

Power One-Liners

Pretty-print mounted filesystems

mount | column -t

or the modern version:

findmnt --real -o TARGET,SOURCE,FSTYPE,OPTIONS

[!TIP] When to use: Quick filesystem overview during storage triage.

Create temporary RAM disk

mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk

Breakdown: tmpfs lives entirely in RAM (+ swap if needed). Lightning fast I/O. Data lost on unmount/reboot. Size can use m, g suffixes.

[!TIP] When to use: Speeding up builds, test suites, or temporary processing of large datasets.

Schedule a one-off command (no cron needed)

echo "systemctl restart nginx" | at 02:00
at now + 30 minutes <<< "echo 'reminder: check the deploy' | mail -s 'Deploy check' ops@example.com"

Breakdown: at is the forgotten cousin of cron — schedules a one-time execution. Supports natural time specs: midnight, noon, now + 2 hours, teatime (4pm). Job output is mailed.

[!TIP] When to use: Scheduling a restart during maintenance window, delayed cleanup tasks, reminders.

Follow log with full navigation

less +F /var/log/syslog

Breakdown: +F starts less in "follow mode" (like tail -f). Press ctrl-c to stop following and use full less navigation (search, scroll). Press F to resume following. Best of both worlds.

[!TIP] When to use: Tailing logs when you also need to search backward through them.

Emergency graceful reboot (frozen system)

Alt+SysRq+R-E-I-S-U-B

Breakdown: Magic SysRq key combo. Raw keyboard, End all processes (SIGTERM), kIll all (SIGKILL), Sync disks, Unmount, reBoot. Mnemonic: "Reboot Even If System Utterly Broken."

[!TIP] When to use: System completely frozen, no SSH, no shell. Last resort before hard power cycle.

Find the 20 largest files on disk

find / -xdev -type f -print0 | xargs -0 du -h | sort -rh | head -20

Breakdown: -xdev stays on one filesystem (don't cross into /proc, /sys). -print0/-0 handles spaces in names. sort -rh sorts human-readable sizes in reverse.

[!TIP] When to use: Emergency disk space triage — / is 95% full and you need to find what's eating it.

Find files modified in the last hour

find /var/log -xdev -mmin -60 -type f -ls

[!TIP] When to use: Investigating what changed after a deployment or incident.

See Also


Quick Reference