Skip to content

Portal | Level: L0: Entry | Topics: Linux Fundamentals, Bash / Shell Scripting | Domain: Linux

Linux Operations Drills

Remember: The five essential Linux diagnostic commands: top (CPU/memory live), df -h (disk space), free -h (memory), ss -tlnp (listening ports), journalctl -p err (recent errors). Mnemonic: "TDFJ-S" — Top, Disk-free, Free, Journalctl, Socket-stats. These five commands answer "what is wrong" in 90% of Linux incidents.

Drill 1: Find Top CPU Consumers

Difficulty: Easy

Q: Find the top 5 processes consuming the most CPU.

Answer
ps aux --sort=-%cpu | head -6

# Or with top (batch mode, single snapshot):
top -bn1 | head -12

# More detailed with pidstat:
pidstat -u 1 3    # 3 samples, 1s interval

# Live monitoring:
htop              # Interactive (filter with F4)

Drill 2: Investigate Disk Space

Difficulty: Easy

Q: A server is reporting disk full. Find which directories are consuming the most space under /var.

Answer
# Quick overview
df -h

# Find biggest directories under /var
du -sh /var/* | sort -rh | head -10

# Go deeper into the largest directory
du -sh /var/log/* | sort -rh | head -10

# Find files larger than 100MB
find /var -type f -size +100M -exec ls -lh {} \;

# Check for deleted files still held open
lsof +L1 | grep deleted
# These consume space but don't appear in du
Key insight: `lsof +L1` finds files deleted from disk but still held open by a process — a common cause of "disk full but du doesn't add up."

Drill 3: Systemd Service Management

Difficulty: Easy

Q: A service keeps crashing. Check its status, read its logs, and configure it to restart automatically.

Answer
# Check status
systemctl status myapp

# Read recent logs
journalctl -u myapp --since "30 min ago"
journalctl -u myapp -f    # Follow live

# Check for restart settings in the unit file
systemctl cat myapp

# If no auto-restart, create an override:
systemctl edit myapp
# Add:
# [Service]
# Restart=on-failure
# RestartSec=5

systemctl daemon-reload
systemctl restart myapp

Drill 4: Memory Pressure Investigation

Difficulty: Medium

Q: Applications are being OOM-killed. Investigate memory usage and identify the culprit.

Answer
# Check for OOM kills
dmesg | grep -i "out of memory"
journalctl -k | grep -i "oom"

# Current memory overview
free -h

# Per-process memory usage (sorted)
ps aux --sort=-%mem | head -10

# Detailed memory breakdown
cat /proc/meminfo | head -20

# Check swap usage
swapon --show
vmstat 1 5    # Watch si/so columns for swap activity

# Find specific process memory
pmap -x $(pgrep java) | tail -5
Key indicators: - `Buffers/Cached` being reclaimed = normal - Swap usage climbing = memory pressure - OOM score: `cat /proc/PID/oom_score` (higher = more likely to be killed)

Drill 5: Network Connectivity Debugging

Difficulty: Medium

Q: A server can't reach an external API on port 443. Systematically debug the issue.

Answer
# 1. DNS resolution
dig api.example.com +short
# Or: nslookup api.example.com

# 2. Can we reach the IP?
ping -c 3 $(dig +short api.example.com | head -1)

# 3. Can we reach the port?
nc -zv api.example.com 443 -w 5
# Or: curl -v --connect-timeout 5 https://api.example.com

# 4. Check local firewall
iptables -L -n | grep 443
# Or: nft list ruleset | grep 443

# 5. Check routing
ip route get $(dig +short api.example.com | head -1)
traceroute api.example.com

# 6. Check if a proxy is needed
env | grep -i proxy
Order: DNS → Routing → Firewall → Application

Drill 6: File Permissions Troubleshooting

Difficulty: Medium

Q: A web server returns 403 Forbidden. The config looks correct. What do you check?

Answer
# 1. Check file permissions
ls -la /var/www/html/index.html
# Web server user (www-data/nginx) needs read access

# 2. Check directory permissions (traverse requires +x)
namei -l /var/www/html/index.html
# Every directory in the path needs execute (x) permission

# 3. Check file ownership
stat /var/www/html/index.html

# 4. Fix permissions
chmod 644 /var/www/html/index.html     # File: rw-r--r--
chmod 755 /var/www/html/                # Dir: rwxr-xr-x

# 5. Check SELinux (if enabled)
getenforce
ls -Z /var/www/html/index.html
restorecon -Rv /var/www/html/

# 6. Check AppArmor (if enabled)
aa-status
Common cause: a parent directory (e.g., `/var/www`) lacks execute permission for the web server user.

Drill 7: Cron Job Debugging

Difficulty: Medium

Q: A cron job isn't running. How do you debug it?

Answer
# 1. Is cron running?
systemctl status cron   # or crond

# 2. Check the crontab
crontab -l              # Current user
crontab -l -u www-data  # Specific user

# 3. Check cron logs
grep CRON /var/log/syslog | tail -20
# Or: journalctl -u cron --since "1 hour ago"

# 4. Common issues:
# - Wrong PATH (cron has minimal PATH)
#   Fix: add PATH=/usr/local/bin:/usr/bin:/bin at top of crontab
# - Missing executable permission
# - Script uses relative paths (cron runs from $HOME)
# - Environment variables not set

# 5. Test the command manually:
env -i PATH=/usr/local/bin:/usr/bin:/bin HOME=/root /opt/backup.sh
# env -i simulates cron's minimal environment

Drill 8: SSH Troubleshooting

Difficulty: Medium

Q: You can't SSH into a server. Walk through the debug process.

Answer
# 1. Verbose SSH connection
ssh -vvv user@host

# 2. Check from the server side (if you have console/BMC access)
systemctl status sshd
ss -tlnp | grep 22              # Is sshd listening?
journalctl -u sshd --since "5 min ago"

# 3. Common causes:
# - Firewall blocking port 22
iptables -L -n | grep 22
# - Wrong permissions on ~/.ssh
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chmod 600 ~/.ssh/id_ed25519
# - sshd_config denying user
grep -E "AllowUsers|DenyUsers|AllowGroups" /etc/ssh/sshd_config
# - Host key changed (MITM warning)
ssh-keygen -R hostname           # Remove old key

Drill 9: Log Analysis with Command Line

Difficulty: Hard

Q: Find the top 10 IPs making requests to nginx, show requests per second for the last hour, and find all 5xx errors.

Answer
# Top 10 IPs
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

# Requests per second (last hour)
awk -v start="$(date -d '1 hour ago' '+%d/%b/%Y:%H:%M')" \
  '$4 >= "["start' /var/log/nginx/access.log | \
  awk '{print $4}' | cut -d: -f1-3 | uniq -c | sort -rn | head -10

# All 5xx errors
awk '$9 ~ /^5/' /var/log/nginx/access.log | tail -20

# 5xx count by status code
awk '$9 ~ /^5/ {count[$9]++} END {for (s in count) print s, count[s]}' \
  /var/log/nginx/access.log

# 5xx errors by URL
awk '$9 ~ /^5/ {count[$7]++} END {for (u in count) print count[u], u}' \
  /var/log/nginx/access.log | sort -rn | head -10

Drill 10: Performance Triage (USE Method)

Difficulty: Hard

Q: A server is "slow." Run through a systematic performance investigation.

Answer
# 1. Load average (CPU saturation)
uptime
# load > num_cpus = saturation

# 2. CPU
mpstat -P ALL 1 3        # Per-CPU utilization
# High %iowait = disk bottleneck, not CPU

# 3. Memory
free -h                   # Available memory
vmstat 1 5                # si/so = swap in/out (should be 0)

# 4. Disk I/O
iostat -xz 1 3            # %util, await, avgqu-sz
# %util > 80% = saturated
# await > 10ms (SSD) or > 20ms (HDD) = slow

# 5. Network
sar -n DEV 1 3            # Interface throughput
ss -s                     # Connection counts

# 6. Per-process
pidstat -u -d -r 1 3      # CPU, disk, memory per process

# 7. Recent errors
dmesg -T | tail -20       # Kernel messages
journalctl -p err -b      # Errors since boot
Flow: Load → CPU → Memory → Disk → Network → Process → Errors

Wiki Navigation

Prerequisites