Networking Troubleshooting Footguns¶

Mistakes that misdiagnose network problems, extend outages, or create new failures while fixing old ones.

1. Assuming "the network is down" without defining source and destination¶

Someone reports "the network is down." You start checking switch ports, router configs, and firewall rules. Two hours later you discover the problem is DNS resolution from one specific host. You wasted time because you never asked: down from where, to where, on what port, using what protocol?

Why people do it: "Network is down" triggers infrastructure instinct. The urgency pushes you to start fixing before you finish diagnosing.

Fix: Before touching anything, establish: source IP, destination IP, destination port, protocol. Run ping, traceroute, and curl from the affected host. Most "network is down" reports are DNS, firewall, or a single service being unavailable -- not a network outage.

Remember: The troubleshooting checklist mnemonic is "SPUD": Source, Protocol, (destination) URL/IP, Direction. Get these four facts before touching any infrastructure. In post-incident reviews, the single most common finding for extended network outages is "the engineer started fixing before finishing diagnosing."

2. Testing connectivity from the wrong host¶

The application server cannot reach the database. You SSH into the jump host and run telnet db-server 5432. It works. You close the ticket: "Network is fine." But the app server has different routing, different firewall rules, different DNS resolution, and different source IP. The problem was never visible from the jump host.

Why people do it: You test from whatever machine you are logged into. The jump host has all the tools installed. The app server is "just a container" with no debugging tools.

Fix: Always test from the affected source host. If it lacks tools, install them temporarily or use built-in alternatives: bash -c 'echo > /dev/tcp/db-server/5432' works without netcat. In containers, kubectl exec or kubectl debug with a tools image.

3. Flushing iptables rules to "fix" connectivity¶

Traffic is being blocked. You cannot figure out which rule is dropping it. You run iptables -F to flush all rules. Connectivity works. You celebrate. Five minutes later, every service on the host is exposed to the internet. You just removed all firewall rules including the ones protecting SSH, databases, and internal APIs.

Why people do it: iptables rules are complex and layered. Flushing is the fastest way to determine if the firewall is the problem. People intend to restore the rules but forget or do not know how.

Fix: To test if the firewall is the cause, insert a specific ACCEPT rule at the top for the traffic you need: iptables -I INPUT -s <source> -p tcp --dport <port> -j ACCEPT. If that fixes it, you have identified the problem without removing other protections. Save rules before any change: iptables-save > /tmp/iptables-backup.

4. Not checking MTU when diagnosing intermittent failures¶

Small requests work. Large file transfers fail or stall. SSH works but SCP hangs. HTTPS handshakes succeed but page loads fail. The path has an MTU mismatch -- one hop has a 1500-byte MTU, another has 1400 (common with VPNs, tunnels, VXLAN). Large packets are silently dropped because "Don't Fragment" is set.

Why people do it: MTU problems are rare on flat networks. But any encapsulation (VPN, VXLAN, GRE, Docker overlay) reduces effective MTU. The symptoms mimic application bugs.

Fix: Test with ping -M do -s 1472 <destination> (1472 + 28 byte header = 1500). If it fails, reduce until it works. That reveals the path MTU. Fix by reducing MTU on the interface or enabling PMTUD. For tunnel/overlay networks, set MTU to 1400 or lower at the interface level.

5. Changing DNS resolvers during an active incident¶

Services are failing DNS resolution. You change /etc/resolv.conf to point to 8.8.8.8 to "fix DNS." Internal service names that rely on your private DNS (db.internal, api.service.consul) stop resolving. External DNS works but internal service discovery is broken. You just expanded a partial outage into a full one.

Why people do it: Public DNS (8.8.8.8, 1.1.1.1) feels like a reliable fallback. It is -- for public names. Private DNS zones are invisible to public resolvers.

Fix: If DNS is failing, check why before changing resolvers: dig @<current-resolver> example.com. Is the resolver down? Is it a specific zone? Add a fallback resolver rather than replacing: put the public resolver second in /etc/resolv.conf. Never remove the private resolver entirely.

6. Using `tcpdump` without a capture filter and filling the disk¶

You start tcpdump -w /tmp/capture.pcap to debug a network issue. You forget the capture filter. On a busy server doing 10Gbps, the pcap file grows at hundreds of MB per second. /tmp fills up. The system becomes unstable. You are now debugging two problems.

Why people do it: "Capture everything, filter later" is the textbook approach. On a low-traffic dev box it works. On a production server it is a denial-of-service attack on yourself.

Fix: Always use a capture filter: tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.0.5 and port 443 -c 10000. The -c flag limits packet count. Monitor the file size. Use ring buffers for long captures: tcpdump -W 5 -C 100 -w /tmp/capture.pcap (5 files, 100MB each, rotating).

7. Forgetting that `ss` and `netstat` show local state, not remote¶

ss -tlnp shows your service listening on port 8080. You conclude "the service is up and reachable." But ss only shows local socket state. The service may be listening but behind a firewall, NATted to a different port, or on a loopback address (127.0.0.1) that is unreachable from other hosts.

Why people do it: Seeing LISTEN on the expected port feels definitive. The distinction between "bound to a socket" and "reachable from the network" is subtle.

Fix: After confirming local listen state with ss, verify remote reachability from the client: curl, nc, or telnet from the source host. Check the listen address -- 0.0.0.0:8080 is reachable from anywhere, 127.0.0.1:8080 is localhost only. Check firewall rules for the port.

8. Adding static routes without persistence¶

You add a route to fix a connectivity issue: ip route add 10.20.0.0/16 via 192.168.1.1. It works. You close the ticket. The server reboots next week. The route is gone. The connectivity issue returns. Nobody connects the reboot to the route because the fix was weeks ago.

Why people do it: ip route add is the quick fix. Persisting routes varies by distro (NetworkManager, netplan, /etc/sysconfig/network-scripts/, systemd-networkd) and people do not remember which mechanism their system uses.

Fix: After adding a route with ip route, immediately persist it. On RHEL/CentOS: /etc/sysconfig/network-scripts/route-<interface>. On Ubuntu/netplan: netplan config. On systemd-networkd: .network file. Verify persistence by checking the config file, not just the running state.

9. Running `traceroute` and misinterpreting `* * *` hops¶

Traceroute shows * * * at hop 5. You conclude hop 5 is down or unreachable. Actually, the router at hop 5 is configured to not respond to ICMP TTL-exceeded (rate limiting, security policy). Traffic passes through it fine. The next hop responds normally. You escalate to the network team about a "dead router" that is working perfectly.

Why people do it: * * * looks like a failure. The visual gap in the traceroute output suggests a problem at that hop.

Fix: * * * means "no response to the probe," not "no connectivity." Look at the hops after the stars. If subsequent hops respond, the starred hop is just filtering ICMP. Use mtr for better analysis (shows packet loss per hop). Try TCP traceroute (traceroute -T -p 443) as some routers respond to TCP but not ICMP.

10. Disabling IPv6 as a troubleshooting step and leaving it disabled¶

Something is slow. You read online that "disabling IPv6 fixes it." You add net.ipv6.conf.all.disable_ipv6=1 to sysctl. The immediate problem goes away (it was probably DNS trying AAAA records first and timing out). You leave IPv6 disabled. Months later, applications that need IPv6 (dual-stack services, container networking, modern cloud APIs) break silently.

Why people do it: IPv6 is the scapegoat for DNS resolution delays (happy eyeballs algorithm) and misconfigured dual-stack setups. Disabling it is a one-line "fix" that removes the variable.

Fix: If IPv6 is causing DNS delays, fix the DNS configuration -- disable AAAA queries in the resolver or configure happy eyeballs properly. Do not disable the protocol stack. If you must disable IPv6 temporarily, set a reminder to re-enable it and document why it was disabled.

Tool-Specific Footguns¶

11. Ping works, so the service must be reachable¶

ICMP and TCP are different protocols handled by different firewall rules. A host can respond to ping while dropping all TCP traffic on port 443. Conversely, many production hosts block ICMP entirely — ping fails, but the service is fine.

What happens: You ping a host, get replies, and tell the developer "the network is fine." They escalate because their app still cannot connect. You look foolish for 20 minutes.

Fix: Always test the actual protocol and port. nc -zv host port or curl -v for HTTP. Ping is a L3 sanity check, not a service reachability test.

12. tcpdump capturing on the wrong interface¶

On a multi-homed host or a container host, traffic may flow through docker0, br-xxxx, veth1234, lo, or eth1 — not the eth0 you habitually capture on. You see nothing and conclude "no traffic is arriving."

What happens: You run tcpdump -i eth0 for ten minutes, see no relevant packets, and declare the sender is not sending. Meanwhile the traffic is flowing through br-abcdef into a container.

Fix: Start with tcpdump -i any to confirm traffic exists on the host, then narrow to the correct interface. Use ip route get <dest> to determine which interface handles traffic for a specific destination.

13. Running nmap scans without authorization¶

nmap generates traffic patterns that intrusion detection systems are specifically trained to flag. A SYN scan against a production subnet triggers alerts, fills security team inboxes, and may get your IP blocked by automated response systems.

Fix: Only scan hosts you have explicit permission to scan. For checking a single port, use nc -zv instead. If you need a broad scan, coordinate with the security team first and document the window.

14. dig not using the same resolver as the system¶

dig queries DNS servers directly via the DNS protocol. Your application uses the system resolver via glibc's getaddrinfo(), which reads /etc/nsswitch.conf, checks /etc/hosts first, and may route through systemd-resolved (127.0.0.53). These can return different answers.

Fix: When debugging application DNS failures, also check:

getent hosts api.example.com         # Uses the same resolution path as the application
cat /etc/hosts | grep api             # Check for overrides
resolvectl query api.example.com      # Check systemd-resolved specifically
dig @127.0.0.53 api.example.com       # Query the stub resolver dig would skip

15. MTU testing with wrong overhead calculation¶

When testing path MTU with ping -s, the -s flag sets the ICMP payload size. The total packet size is payload + 8 (ICMP header) + 20 (IP header) = payload + 28. For a 1500 MTU path, the maximum -s value is 1472, not 1500.

Fix: Always subtract 28 from the target MTU: ping -c 1 -s 1472 -M do target for 1500 MTU. For IPv6, the overhead is 48 bytes (40 IPv6 header + 8 ICMP6 header), so use -s 1452.

16. Interpreting traceroute asterisks as packet loss¶

Asterisks in traceroute output mean the probe got no ICMP Time Exceeded response from that hop. Many routers rate-limit or suppress ICMP TTL exceeded messages as a matter of policy. This is not packet loss — the data-plane traffic flows through the router just fine.

Fix: Use mtr with enough probes (100+). If loss appears at an intermediate hop but does not carry through to subsequent hops, it is ICMP rate limiting. Real loss at hop N will show equal or greater loss at hops N+1, N+2, etc.

17. iperf3 testing through NAT gives misleading results¶

iperf3 requires a direct connection between client and server on port 5201 (default). When tested through NAT, source NAT, or a load balancer, the results reflect the NAT device's forwarding capacity and connection tracking overhead, not the end-to-end link bandwidth.

Fix: Run iperf3 between endpoints in the same broadcast domain as the link you are measuring. If you must test through NAT, understand that the result includes NAT overhead.

18. Using ifconfig and netstat instead of ip and ss¶

ifconfig does not show secondary IP addresses added with ip addr add. It does not show all routing tables. netstat is slower, less accurate under load, and not installed by default on modern systems.

Fix: Use ip addr, ip route, ip link, ip neigh, and ss. The iproute2 suite has been the standard since the 2.6 kernel era.

19. Not specifying the DNS server in dig¶

When you run dig example.com without @server, dig uses the first nameserver in /etc/resolv.conf. If that file is managed by DHCP, NetworkManager, or systemd-resolved, the server may not be what you expect.

Fix: Always specify the server when debugging DNS:

dig @10.0.0.2 example.com    # Test the specific resolver in question
dig @8.8.8.8 example.com     # Compare against a known-good public resolver

Check /etc/resolv.conf first on every host you investigate.