Networking Troubleshooting — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about diagnosing network problems.

The most common network problem is DNS, and the second most common is also DNS¶

There is a famous joke in operations: "It's always DNS." This is not entirely a joke — DNS resolution failures account for an outsized proportion of reported network outages because virtually every network operation starts with a DNS lookup. A misconfigured /etc/resolv.conf, an unreachable DNS server, or a stale DNS cache can make an otherwise healthy network appear completely down.

traceroute uses a clever TTL exploit that was never intended by the protocol designers¶

traceroute works by sending packets with incrementally increasing TTL values (1, 2, 3...). Each router along the path decrements the TTL; when it hits zero, the router discards the packet and sends back an ICMP Time Exceeded message. By collecting these ICMP responses, traceroute builds a map of the path. The TTL field was designed for loop prevention, not for path discovery — traceroute was a creative hack by Van Jacobson in 1987.

ICMP "unreachable" messages contain the first 64 bytes of the offending packet¶

When a router sends an ICMP Destination Unreachable message, it includes the IP header plus the first 8 bytes of the payload of the original packet. This is enough to identify the source/destination ports (for TCP/UDP), which is how tools like traceroute can match responses to specific probes. It's also why some ICMP-based attacks embed crafted data in those 64 bytes.

TCP RST packets are the network's way of saying "I have no idea what you're talking about"¶

A TCP RST (reset) is sent when a host receives a packet for a connection that doesn't exist in its state table. Common causes: the server process crashed, the connection was idle and the NAT table expired, or a firewall dropped the connection state. RST packets are often the first visible clue of load balancer timeouts, firewall state table exhaustion, or server-side crashes.

The "ping works but the application doesn't" scenario has a specific name: MTU black hole¶

When ping (which uses small packets) works fine but applications sending larger packets fail, the problem is almost always an MTU mismatch. Somewhere along the path, a link has a smaller MTU than expected, and ICMP "Fragmentation Needed" messages are being blocked by a firewall. The diagnostic command is ping -M do -s 1472 <destination> (Linux) — if it fails, the effective MTU is less than 1500.

Half-open TCP connections are invisible to one side¶

If one side of a TCP connection crashes without sending a FIN or RST (power failure, kernel panic, cable pull), the other side has no way of knowing the connection is dead until it tries to send data and the TCP retransmission timer eventually expires. This can take minutes to hours depending on keepalive settings. During this time, the surviving side holds the connection open, consuming resources. This is why TCP keepalives and application-level health checks exist.

The "tcp_tw_recycle" sysctl was so dangerous it was removed from the Linux kernel¶

tcp_tw_recycle was a Linux kernel option that aggressively recycled TIME_WAIT connections using TCP timestamps. It worked fine on direct connections, but behind NAT (where multiple clients share one IP with different timestamp clocks), it caused random connection drops. The problem was so widespread and so hard to diagnose that the option was removed entirely in Linux 4.12 (2017). tcp_tw_reuse (the safer alternative) remains.

Asymmetric routing breaks stateful firewalls silently¶

If traffic from A to B takes a different path than traffic from B to A (asymmetric routing), a stateful firewall on only one path sees half the connection. It sees the SYN but never the SYN-ACK, or vice versa, and drops the traffic as invalid. The symptom is intermittent connectivity that depends on which path the initial SYN takes. This is one of the hardest networking issues to diagnose because it depends on routing state that may change between tests.

The "arp -a" command has saved more troubleshooting hours than any monitoring tool¶

Checking the ARP table (arp -a or ip neigh show) reveals whether the host can reach its default gateway and local peers at Layer 2. If the gateway's MAC address is missing or shows "incomplete," the problem is Layer 2 (cable, VLAN, switch port) — no amount of IP-level debugging will help. Experienced network troubleshooters check the ARP table before anything else.

tcpdump's default capture size used to truncate packets, hiding the actual problem¶

Before version 4.0, tcpdump's default snap length was 68 or 96 bytes, capturing only the headers and discarding the payload. This meant that protocol-level issues in the payload (HTTP errors, TLS handshake failures, DNS response content) were invisible in the capture. Many troubleshooting sessions were wasted analyzing truncated captures. Modern tcpdump defaults to 262144 bytes, but the -s 0 flag (capture everything) became muscle memory for a generation of engineers.

Network problems follow the "five whys" pattern more than almost any other domain¶

A "slow application" might be caused by high latency, which is caused by packet retransmissions, which are caused by an interface dropping packets, which is caused by a duplex mismatch, which is caused by a misconfigured switch port. Each layer of the stack can mask the root cause. The discipline of systematically working through layers — physical, data link, network, transport, application — is why the OSI model remains useful for troubleshooting even though nobody implements it.