Linux System Administration - Street Ops¶
What experienced sysadmins know that textbooks don't teach.
Incident Runbooks¶
Disk Full Emergency¶
1. Identify the problem:
df -h # Which filesystem is full?
df -i # Is it inode exhaustion instead?
2. Find what's consuming space:
du -sh /* 2>/dev/null | sort -rh | head -10
# Drill into the biggest directory:
du -sh /var/* | sort -rh | head -10
du -sh /var/log/* | sort -rh | head -10
3. Quick wins to free space:
# Truncate (don't delete) active log files:
> /var/log/huge-app.log # Truncates to 0 without breaking file handle
# WRONG: rm /var/log/huge-app.log # Process still holds handle, space not freed
# Clean package manager cache:
dnf clean all # RHEL/CentOS
apt clean # Debian/Ubuntu
# Find and remove old journal logs:
journalctl --disk-usage
journalctl --vacuum-size=500M
# Find deleted-but-held files (space used, not visible in du):
lsof +L1 | grep deleted
# Restart the process holding the file to release space
4. Inode exhaustion (df shows space available, but "No space left on device"):
find / -xdev -type d | while read d; do echo "$(ls -a "$d" 2>/dev/null | wc -l) $d"; done | sort -rn | head -20
# Usually: millions of tiny session/cache files in one directory
Zombie Processes¶
1. Identify zombies:
ps aux | awk '$8 == "Z" {print}'
# Or: ps -eo pid,ppid,stat,cmd | grep -w Z
2. Understand what they are:
- A zombie is a process that finished but its parent hasn't read the exit status
- They consume no CPU/memory, just a PID table entry
- A few zombies are usually harmless
- Thousands of zombies indicate a buggy parent process
3. Find the parent:
ps -eo pid,ppid,stat,cmd | grep -w Z
# Note the PPID column - that's the parent to investigate
4. Fix:
- Send SIGCHLD to the parent: kill -17 <ppid>
- If parent is broken, restart the parent process
- SIGKILL on the zombie itself does nothing (it's already dead)
- If parent is PID 1 (init/systemd), zombies will be reaped eventually
Stuck/Hung Mount (NFS, CIFS)¶
1. Symptoms:
- ls or df hangs when accessing the mount point
- Processes accessing the mount go to "D" state (uninterruptible sleep)
- Even `umount` hangs
2. Diagnose:
mount | grep nfs # Check NFS mounts
nfsstat -m # NFS mount options and server
showmount -e <nfs-server> # What the server exports
3. Identify stuck processes:
lsof +f -- /mnt/nfs-share # What's using the mount?
fuser -vm /mnt/nfs-share # Processes using mount
4. Recovery:
# Lazy unmount (detaches immediately, cleans up when idle):
umount -l /mnt/nfs-share
# Force unmount (for NFS):
umount -f /mnt/nfs-share
# Nuclear option - kill all processes using the mount:
fuser -km /mnt/nfs-share
umount -l /mnt/nfs-share
5. Prevention:
- Mount NFS with soft,timeo=30,retrans=3 (timeout instead of hang)
- Use autofs for on-demand mounting
- Monitor NFS server availability
Log Rotation Gone Wrong¶
1. Symptoms:
- Log files growing without bound
- Disk filling up from /var/log
- Application writing to rotated file (old file descriptor)
2. Check logrotate config:
cat /etc/logrotate.conf # Global config
ls /etc/logrotate.d/ # Per-app configs
cat /etc/logrotate.d/nginx # Specific app
3. Test manually:
logrotate -d /etc/logrotate.d/nginx # Dry run (debug mode)
logrotate -f /etc/logrotate.d/nginx # Force rotation now
4. Common problems:
- Missing "copytruncate" or "postrotate" signal:
# Without these, the app keeps writing to the old (now renamed) file
# Fix: add "copytruncate" for apps that don't handle SIGHUP
# Or: add postrotate script to send SIGHUP
- Wrong permissions on rotated files:
# Add: create 0640 appuser appgroup
- logrotate state file corrupted:
cat /var/lib/logrotate/status
# Remove the entry for the problematic log and re-run
5. Sample logrotate config:
/var/log/myapp/*.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
copytruncate
maxsize 500M
}
Kernel Panic / System Crash¶
1. After reboot, find what happened:
journalctl -b -1 # Previous boot logs
journalctl -k -b -1 # Kernel messages from previous boot
last -x | head -20 # Reboot/shutdown history
dmesg -T | tail -100 # Current boot kernel messages
2. Common causes:
- OOM killer:
journalctl -b -1 | grep -i "out of memory"
journalctl -b -1 | grep -i "oom-killer"
# Shows which process was killed and memory state
- Kernel bug/driver crash:
journalctl -b -1 | grep -i "panic\|BUG\|oops"
- Hardware failure:
journalctl -b -1 | grep -iE "hardware error\|mce\|edac"
3. OOM prevention:
# Check current memory pressure:
free -h
cat /proc/meminfo | grep -E "MemTotal|MemAvailable|SwapTotal|SwapFree"
# Check which processes use the most memory:
ps aux --sort=-%mem | head -10
# Tune OOM killer priority:
echo -1000 > /proc/<pid>/oom_score_adj # Protect this process
echo 1000 > /proc/<pid>/oom_score_adj # Kill this one first
4. Enable crash dumps for future analysis:
systemctl enable kdump
systemctl start kdump
Service Won't Start¶
1. Get the full picture:
systemctl status myservice -l # Status + recent logs
journalctl -u myservice --no-pager # Full log history
2. Common failure reasons:
- Port already in use:
ss -tlnp | grep :8080
# Another process is listening. Find it, decide which wins.
- Permission denied:
# Check file ownership matches the User= in the unit file
# Check SELinux: ausearch -m avc -ts recent
# Check capabilities if using non-root user
- Missing dependency:
systemctl list-dependencies myservice
# A required service (database, network) isn't up yet
- Config syntax error:
# Most services have a config test mode:
nginx -t
httpd -t
named-checkconf
- ExecStart path wrong:
systemctl cat myservice # Show the actual unit file
which myapp # Verify the binary path
file /usr/local/bin/myapp # Verify it's executable
3. SELinux (RHEL/CentOS):
getenforce # Is SELinux enforcing?
ausearch -m avc -ts recent # Recent denials
sealert -a /var/log/audit/audit.log # Human-readable
restorecon -Rv /var/www/html # Reset file contexts
Gotchas & War Stories¶
The deleted-but-open file trap
You rm a 50GB log file, df still shows the disk full. The process still has the file handle open. The space isn't freed until the process closes the file or restarts. Use lsof +L1 | grep deleted to find these. Truncate instead of delete: > /path/to/file.
The /etc/resolv.conf overwrite
NetworkManager, systemd-resolved, and DHCP clients all fight over /etc/resolv.conf. You edit it manually, then it gets overwritten on the next DHCP renewal. Fix: configure DNS through the appropriate manager (nmcli, netplan, or resolved.conf).
sudo vs su
sudo su - gives you a root shell with root's environment. sudo -i does the same thing. sudo command runs just that command as root. Never use sudo su; use sudo -i for a root shell or sudo per-command for auditing.
The fork bomb
:(){ :|:& };: will bring a system to its knees in seconds. Protect against it: set ulimit values in /etc/security/limits.conf. Every user should have a max process limit.
Timezone pain
A server in UTC logs events at one timestamp, an application in local time logs another. Always run servers in UTC. timedatectl set-timezone UTC.
Essential Troubleshooting Commands¶
# System overview
uptime # Load average, uptime
free -h # Memory usage
vmstat 1 5 # CPU, memory, I/O snapshot
iostat -xz 1 5 # Disk I/O per device
sar -u 1 5 # CPU utilization over time
# Process investigation
strace -p <pid> -f -e trace=network # System calls (network)
strace -p <pid> -c # Syscall summary
lsof -p <pid> # Open files/sockets
pmap <pid> # Memory map
# File and search
find / -name "*.conf" -mtime -1 # Config files changed in last day
find / -perm -4000 -type f # SUID files
grep -r "error" /var/log/ --include="*.log" -l # Files containing errors
# User activity
w # Who is logged in and what they're doing
last -20 # Recent logins
lastb -20 # Failed login attempts
ausearch -m USER_LOGIN -ts today # Audit log (if auditd running)
Network Diagnostics¶
Task: Is the Interface Up and Configured?¶
# Brief view — fastest way to check link and IP state
$ ip -br link show
lo UP 00:00:00:00:00:00 <LOOPBACK,UP>
eth0 UP aa:bb:cc:dd:ee:ff <BROADCAST,MULTICAST,UP>
eth1 DOWN 11:22:33:44:55:66 <BROADCAST,MULTICAST>
$ ip -br addr show
lo UP 127.0.0.1/8 ::1/128
eth0 UP 10.0.0.5/24
eth1 DOWN
# eth1 is down — bring it up
$ ip link set eth1 up
Task: Find What Is Listening on a Port¶
# Something is already using port 8080
$ ss -tlnp | grep 8080
LISTEN 0 128 *:8080 *:* users:(("java",pid=12345,fd=42))
# It is a java process with PID 12345
$ ps -p 12345 -o pid,user,cmd
PID USER CMD
12345 appuser /usr/bin/java -jar myapp.jar
Task: Test Connectivity to a Remote Port¶
# Is the remote service reachable?
$ nc -zv 10.0.0.20 5432
Connection to 10.0.0.20 5432 port [tcp/postgresql] succeeded!
# Timeout quickly if not reachable
$ nc -zv -w 3 10.0.0.20 3306
nc: connect to 10.0.0.20 port 3306 (tcp) timed out: Operation now in progress
Task: Capture Traffic to Debug Application Issues¶
# See all traffic to/from a specific host on port 443
$ tcpdump -i eth0 -nn host 10.0.0.20 and port 443 -c 20
14:23:01.001 IP 10.0.0.5.48230 > 10.0.0.20.443: Flags [S], seq 1234
14:23:01.002 IP 10.0.0.20.443 > 10.0.0.5.48230: Flags [S.], seq 5678, ack 1235
14:23:01.002 IP 10.0.0.5.48230 > 10.0.0.20.443: Flags [.], ack 5679
# SYN, SYN-ACK, ACK — three-way handshake completes. Connection works.
# Capture to file for Wireshark analysis
$ tcpdump -i eth0 -w /tmp/capture.pcap -c 5000 host 10.0.0.20
# Look for RST packets (connection resets)
$ tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-rst != 0' and host 10.0.0.20
Task: Trace the Network Path¶
# Continuous traceroute with packet loss stats
$ mtr -rw -c 50 8.8.8.8
HOST Loss% Snt Avg Best Wrst StDev
1. 10.0.0.1 0.0% 50 0.5 0.3 1.2 0.2
2. 172.16.0.1 0.0% 50 2.1 1.8 4.5 0.4
3. ??? 100.0 50 0.0 0.0 0.0 0.0
4. 72.14.215.65 2.0% 50 12.3 11.5 18.7 1.1
# Hop 3 is ICMP rate-limiting (not real loss — loss only at final hop matters)
# TCP traceroute — bypasses firewalls that block ICMP/UDP
$ traceroute -T -p 443 -n 8.8.8.8
Task: Check Which Route the Kernel Uses¶
# Where does traffic to 10.100.5.3 go?
$ ip route get 10.100.5.3
10.100.5.3 via 10.0.0.1 dev eth0 src 10.0.0.5 uid 0
# Show all routes
$ ip route show
default via 10.0.0.1 dev eth0
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.5
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
Task: Diagnose NIC Errors¶
# Check for physical layer problems
$ ethtool eth0 | grep -E "Speed|Duplex|Link"
Speed: 10000Mb/s
Duplex: Full
Link detected: yes
# Check error counters
$ ethtool -S eth0 | grep -iE "error|drop|crc"
rx_errors: 0
tx_errors: 0
rx_dropped: 847
rx_crc_errors: 0
# rx_dropped > 0 — kernel dropping packets. Check ring buffer:
$ ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
Current hardware settings:
RX: 256
# Ring buffer is small — increase it
$ ethtool -G eth0 rx 4096
Task: DNS Troubleshooting¶
# Quick resolution check
$ dig +short example.com
93.184.216.34
# Full query with timing
$ dig example.com
;; Query time: 12 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
# Query a specific DNS server
$ dig @8.8.8.8 example.com A
# Reverse lookup
$ dig -x 93.184.216.34
# Check if DNS traffic is flowing
$ tcpdump -i eth0 -nn port 53 -c 5
Task: Bandwidth Test Between Two Hosts¶
# On the server side
$ iperf3 -s
# On the client side
$ iperf3 -c 10.0.0.20 -t 10
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 11.2 GBytes 9.62 Gbits/sec
Task: Find Connections in Bad States¶
# Too many TIME_WAIT connections — port exhaustion risk
$ ss -tn state time-wait | wc -l
14823
# CLOSE_WAIT accumulating — app not closing connections
$ ss -tnp state close-wait
CLOSE-WAIT 1 0 10.0.0.5:45678 10.0.0.20:5432 users:(("python",pid=9876,fd=12))
# Socket summary
$ ss -s
Total: 1284
TCP: 982 (estab 456, closed 234, orphaned 12, timewait 287)
Task: Debug HTTP with curl¶
# Verbose connection + TLS details
$ curl -vvv https://api.example.com/health 2>&1 | head -30
# Just timing and status code
$ curl -o /dev/null -s -w 'HTTP %{http_code} in %{time_total}s\n' \
https://api.example.com/health
HTTP 200 in 0.234s
# Resolve to a specific IP (bypass DNS)
$ curl --resolve api.example.com:443:10.0.0.50 https://api.example.com/health
Power One-Liners¶
Pretty-print mounted filesystems¶
or the modern version:
[!TIP] When to use: Quick filesystem overview during storage triage.
Create temporary RAM disk¶
Breakdown: tmpfs lives entirely in RAM (+ swap if needed). Lightning fast I/O. Data lost on unmount/reboot. Size can use m, g suffixes.
[!TIP] When to use: Speeding up builds, test suites, or temporary processing of large datasets.
Schedule a one-off command (no cron needed)¶
echo "systemctl restart nginx" | at 02:00
at now + 30 minutes <<< "echo 'reminder: check the deploy' | mail -s 'Deploy check' ops@example.com"
Breakdown: at is the forgotten cousin of cron — schedules a one-time execution. Supports natural time specs: midnight, noon, now + 2 hours, teatime (4pm). Job output is mailed.
[!TIP] When to use: Scheduling a restart during maintenance window, delayed cleanup tasks, reminders.
Follow log with full navigation¶
Breakdown: +F starts less in "follow mode" (like tail -f). Press ctrl-c to stop following and use full less navigation (search, scroll). Press F to resume following. Best of both worlds.
[!TIP] When to use: Tailing logs when you also need to search backward through them.
Emergency graceful reboot (frozen system)¶
Breakdown: Magic SysRq key combo. Raw keyboard, End all processes (SIGTERM), kIll all (SIGKILL), Sync disks, Unmount, reBoot. Mnemonic: "Reboot Even If System Utterly Broken."
[!TIP] When to use: System completely frozen, no SSH, no shell. Last resort before hard power cycle.
Find the 20 largest files on disk¶
Breakdown: -xdev stays on one filesystem (don't cross into /proc, /sys). -print0/-0 handles spaces in names. sort -rh sorts human-readable sizes in reverse.
[!TIP] When to use: Emergency disk space triage —
/is 95% full and you need to find what's eating it.
Find files modified in the last hour¶
[!TIP] When to use: Investigating what changed after a deployment or incident.
See Also¶
- Linux Ops — Storage — disk, filesystem, LVM, and mount operations
- Linux Performance — CPU, memory, I/O, and network bottleneck diagnosis
Quick Reference¶
- Cheatsheet: Linux-Ops
- Deep Dive: Linux Boot Sequence