Linux Ops Footguns¶
[!WARNING] These mistakes take down servers, lose data, or leave you locked out at 3am. Every item here has caused a real outage.
1. rm -rf / or rm -rf $VAR/ when VAR is empty¶
[!CAUTION] This is the single most destructive command pattern in Linux. An unset variable turns a cleanup script into a wipe-everything script.
You write rm -rf $CLEANUP_DIR/ in a script. The variable is unset. Bash expands it to rm -rf /. Goodbye, everything.
Fix: Use set -u in scripts (errors on undefined variables). Quote variables: rm -rf "${CLEANUP_DIR:?}". The :? makes it fail if empty.
2. chmod -R 777 to "fix" permissions¶
Something doesn't work. You chmod -R 777 /var/www. Now every file is world-writable. SSH refuses to use keys from a world-readable .ssh directory. Your web server serves .env files to anyone.
Fix: Understand what permissions are actually needed. Web files: 644 for files, 755 for directories. SSH keys: 600. Never use 777.
3. Editing /etc/fstab without testing¶
You add a new mount to /etc/fstab, reboot, and the server doesn't come back. The entry has a typo and systemd is waiting for a mount that will never succeed. No SSH, no console.
Fix: After editing fstab, run mount -a to test before rebooting. Use nofail option for non-critical mounts so the system boots even if the mount fails.
4. kill -9 as the first option¶
Your process is stuck. You kill -9 it. The process had a database connection, file locks, and a temp file it was going to clean up. Now you have a corrupted PID file, a locked database table, and orphaned temp files.
Fix: Start with kill (SIGTERM, signal 15). Wait. Check. Only escalate to kill -9 (SIGKILL) if the process won't respond to SIGTERM.
Under the hood: SIGTERM allows the process to run its signal handler — close database connections, flush buffers, remove PID files, release locks. SIGKILL is handled by the kernel, not the process — the process never gets a chance to clean up. There is also SIGQUIT (signal 3), which tells many programs to dump core and exit — useful for debugging a stuck Java process (
kill -3triggers a thread dump to stdout).
5. Running out of inodes with disk space remaining¶
df -h shows 40% free. But you can't create files. You have millions of tiny files (session files, cache entries) and ran out of inodes. df -i would have shown it.
Fix: Monitor inode usage, not just disk space. Clean up small file accumulations. Consider a filesystem with dynamic inodes (XFS, btrfs) for workloads that create many small files.
6. echo > /dev/sda instead of /dev/sdb¶
You're writing a disk image or wiping a drive. You type the wrong device name. You just overwrote your boot disk.
Fix: Triple-check device names with lsblk. Use labels or UUIDs in scripts, never raw device paths. For disk operations, disconnect drives you don't want to touch.
7. Forgetting that deleted files still consume space if held open¶
You delete a 50GB log file with rm. df -h still shows the disk full. A process still has the file open, so the space isn't freed.
Fix: Check with lsof +L1 | grep deleted. Either restart the process or truncate the file instead: > /var/log/bigfile.log (truncates without removing the file handle).
8. iptables -F over SSH¶
[!CAUTION] Flushing firewall rules over SSH can permanently lock you out if the default policy is DROP.
You flush all iptables rules remotely. Your default policy is DROP. You just locked yourself out. The flush removed the rule that allowed SSH.
Fix: Before flushing, set default policy to ACCEPT: iptables -P INPUT ACCEPT. Or use at to schedule a rule restore: echo "iptables-restore < /etc/iptables/rules.v4" | at now + 5 min.
9. Cron jobs without PATH¶
Your cron job works when you run it manually but fails silently from cron. Cron has a minimal PATH (/usr/bin:/bin). Your script calls kubectl, helm, or aws which live in /usr/local/bin.
Fix: Set PATH=/usr/local/bin:/usr/bin:/bin at the top of your crontab. Use full paths in scripts. Log cron output: * * * * * /opt/script.sh >> /var/log/script.log 2>&1.
10. Upgrading the kernel without a rollback plan¶
You run apt upgrade and it installs a new kernel. You reboot. The new kernel doesn't support your NIC driver, or your RAID controller. No network, no disk access.
Fix: Keep the previous kernel installed. Know how to boot into the old kernel from GRUB. Test kernel upgrades on non-critical servers first. Use dnf versionlock or apt-mark hold to prevent accidental kernel upgrades.
11. nohup and backgrounding confusion¶
You SSH in, start a long process, close your laptop. The process dies because it got SIGHUP when your session ended. Or you use nohup but forget & so it runs in the foreground and you can't do anything else.
Fix: Use tmux or screen for long-running tasks. Or nohup command & with output redirection. For services, use systemd.
12. Swap masking OOM problems¶
Your server has 32GB of swap. An app has a memory leak. Instead of crashing (which would alert you), it slowly consumes swap. The entire server becomes unresponsive because everything is paging to disk. By the time you notice, the server is unreachable.
Fix: Set vm.swappiness=10 or lower. Monitor swap usage and alert when it exceeds a threshold. For containers/K8s nodes, consider disabling swap entirely (Kubernetes requires it).
Networking Tool Footguns¶
13. Using ifconfig instead of ip and missing interfaces¶
ifconfig only shows interfaces that are UP. A down interface will not appear in the output. You conclude the interface does not exist when it is just administratively down.
Fix: Use ip link show or ip -br link show. It shows all interfaces regardless of state. ifconfig is deprecated and missing on many modern distros.
14. Running tcpdump without -nn and waiting forever for DNS¶
Without -nn, tcpdump tries to reverse-resolve every IP address and port number via DNS. If DNS is slow or broken (which is often why you are running tcpdump), output stalls for seconds per packet.
Fix: Always use tcpdump -nn to disable name resolution. Add -i <interface> to avoid capturing on the wrong one.
15. Interpreting intermediate hop loss in mtr as real packet loss¶
mtr shows 40% loss at hop 3 but 0% at the final destination. You report a problem at hop 3 to the network team. They find nothing wrong because the router at hop 3 is simply rate-limiting ICMP responses.
Fix: Only loss at the final hop indicates real packet loss. Intermediate hop loss with no loss at the destination is almost always ICMP deprioritization.
16. Using netstat on a busy host and waiting minutes¶
netstat reads /proc/net/tcp sequentially and does DNS lookups by default. On a host with 50,000 connections, this takes minutes and produces massive output.
Fix: Use ss -tn instead. It reads netlink sockets directly and is orders of magnitude faster. Add -p for process info, -l for listening only.
17. Running nmap scans against production without authorization¶
You want to check which ports are open on a production host. You run a full nmap scan. The IDS flags it as an attack, security is alerted, and you spend the afternoon explaining yourself.
Fix: For your own hosts, use ss -tlnp locally. For remote port checks, use nc -zv host port. Reserve nmap for authorized security assessments.
18. Forgetting -c on tcpdump and filling the disk¶
You start tcpdump -w /tmp/capture.pcap and leave it running. On a busy interface, this fills /tmp (or the root filesystem) within hours.
Fix: Always limit captures: tcpdump -c 10000 (packet count) or tcpdump -W 5 -C 100 (5 rotating files of 100MB each). Monitor disk while capturing.
19. Testing with ping and concluding the service is down¶
Ping uses ICMP. Many hosts and firewalls block ICMP. Ping fails, but HTTP on port 443 works fine.
Fix: Test at the application layer. Use curl, nc -zv host port, or openssl s_client for TLS services. Ping only tests ICMP reachability, not service availability.
20. Not specifying the source interface for multi-homed hosts¶
You run traceroute 8.8.8.8 on a host with two interfaces and two default routes. The kernel picks the route based on metric, but you expected traffic to go out the other interface.
Fix: Use traceroute -s <source-ip> or traceroute -i <interface> to force the source. Check ip route get <dest> first to see which route the kernel will use.
21. Confusing ss state filters¶
You run ss -t state LISTEN but get no output because you also passed -a. Or you filter for established but spell it wrong and ss silently returns nothing.
Fix: Use ss -tln for listening TCP, ss -tn for established TCP. The state filter syntax is case-sensitive: established, time-wait, close-wait. Test without filters first to confirm data exists.
22. Running iperf3 without understanding what it measures¶
You run iperf3 between two hosts and get 9.5 Gbps. You report that the network can handle 9.5 Gbps of application traffic. But iperf3 measures raw TCP throughput — your application adds TLS overhead, protocol framing, and connection setup costs.
Fix: Use iperf3 to verify link capacity, not application throughput. For realistic benchmarks, test with your actual application or a representative workload.
Pages that link here¶
- Anti-Primer: Linux Ops
- Incident Replay: Disk Full on Root Partition — Services Down
- Incident Replay: Inode Exhaustion
- Incident Replay: Kernel Soft Lockup
- Incident Replay: OOM Killer Events
- Incident Replay: Runaway Logs Fill Disk
- Incident Replay: Stuck NFS Mount
- Incident Replay: Time Sync Skew Breaks Application
- Incident Replay: Zombie Processes Accumulating
- Incident Replay: systemd Service Flapping
- Linux Ops
- Pattern: PID Exhaustion via Zombies
- Thinking Out Loud: Linux Ops