TCP/IP Deep Dive Footguns¶

Mistakes that cause mysterious connection failures, latency spikes, and throughput problems that look like application bugs.

1. TIME_WAIT is not a problem — stop trying to eliminate it¶

Operators see 30,000 TIME_WAIT sockets and panic. They set tcp_tw_recycle (now removed), reduce tcp_fin_timeout to 5 seconds, or try to flush TIME_WAIT sockets. TIME_WAIT exists to prevent old packets from a closed connection being accepted by a new connection on the same port. Eliminating it causes silent data corruption.

30,000 TIME_WAIT sockets use approximately 3MB of kernel memory. They are cheap. The only time TIME_WAIT is a real problem is when you exhaust the ephemeral port range to a single destination IP:port pair. With 64,000 ports and 60-second TIME_WAIT, you can make ~1,066 short-lived connections per second to a single destination before running out.

Fix: Use connection pooling. Set tcp_tw_reuse = 1 (safe for outbound connections). Expand the port range with ip_local_port_range. Do not touch tcp_fin_timeout for TIME_WAIT — it only controls FIN_WAIT_2 timeout, not TIME_WAIT duration (TIME_WAIT is hardcoded at 60 seconds on Linux).

2. tcp_tw_recycle breaking behind NAT¶

net.ipv4.tcp_tw_recycle recycled TIME_WAIT sockets using TCP timestamp validation. When multiple clients were behind a NAT (same source IP, different timestamp clocks), connections from some clients were silently dropped. The symptom was intermittent: some users behind the same corporate firewall could connect, others could not, with no server-side errors.

This was so widely misunderstood and so frequently harmful that the kernel developers removed tcp_tw_recycle entirely in kernel 4.12 (2017). If you see it in old Ansible playbooks or sysctl configs, remove it. If you see it recommended in a blog post, the blog post is wrong.

Fix: Remove any reference to tcp_tw_recycle. Use tcp_tw_reuse = 1 instead, which only affects outbound connections and is safe behind NAT.

Debug clue: If users behind corporate firewalls report intermittent connection failures while others work fine, check for tcp_tw_recycle=1 in your sysctl config. The classic symptom is that only some clients behind the same NAT IP fail, and it appears random because it depends on per-source-IP TCP timestamp ordering. The failures appear in netstat -s as passive connections rejected because of time stamp under TCPExt.

3. Small socket buffers limiting throughput on high-latency links¶

Your application transfers data between US-East and EU-West (80ms RTT). With the default net.core.rmem_max of 212992 bytes (208KB), the maximum throughput for a single TCP connection is 208KB / 0.08s = 2.6 MB/s — regardless of available bandwidth. You have a 10 Gbps link and are using 0.02% of it.

This is the bandwidth-delay product problem. TCP cannot have more data in-flight than the receive window allows, and the receive window cannot exceed the socket buffer size.

Fix: Set socket buffers large enough for the BDP of your highest-latency paths.

# BDP = bandwidth * RTT
# 1 Gbps * 80ms = 125,000,000 * 0.080 = 10,000,000 bytes (10 MB)
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

The third value in tcp_rmem/tcp_wmem is the auto-tuning maximum. The kernel will not automatically grow buffers beyond this value.

4. Nagle's algorithm + delayed ACK interaction¶

Nagle says "do not send small segments if there is unacknowledged data." Delayed ACK says "wait up to 40-200ms before sending a bare ACK, in case there is data to piggyback." Together they create a deadlock: the sender holds data waiting for an ACK, the receiver holds the ACK waiting for data. The result is 40-200ms of added latency on every other small write.

This destroys performance in request-response protocols. The classic example: an application sends a request header in one write() and the body in a second write(). Nagle buffers the body because the header is not yet acknowledged. The delayed ACK timer on the receiver fires after 40ms. Total added latency: 40ms per request, which turns into 25 requests/second maximum on a single connection.

Fix: Set TCP_NODELAY on any socket used for request-response communication. This disables Nagle and sends every write() immediately. Virtually every modern protocol library does this by default (HTTP/2, gRPC, database drivers), but check your application if you see mysterious 40-200ms latency bumps.

5. Not setting TCP_NODELAY for interactive protocols¶

You write a custom protocol or use a library that does not set TCP_NODELAY. Your messages are small (50-500 bytes). Under low load, everything is fast — Nagle has no unacknowledged data to trigger buffering. Under load, with multiple messages in flight, Nagle kicks in and batches your 50-byte messages. Latency jumps from sub-millisecond to 40-200ms.

This bug is load-dependent, which makes it hard to catch in testing. It only appears when multiple small messages overlap in the send buffer.

Fix: Always set TCP_NODELAY for interactive, low-latency, or request-response protocols. The "wasted bandwidth" from sending small segments is negligible compared to the latency penalty. The only time to leave Nagle enabled is for bulk data transfers (file copy, backup streams) where throughput matters more than latency.

6. TCP keepalive defaults are too long¶

Linux defaults: first probe after 7200 seconds (2 hours), 75 seconds between probes, 9 probes. Total detection time: 2 hours and 11 minutes. If a database server crashes, your application holds a dead connection for over 2 hours before noticing.

During those 2 hours, every query sent to the dead connection blocks until the TCP retransmit timeout expires (which itself is minutes). Your application appears to hang, and connection pool slots are consumed by dead connections.

Fix: Set system-wide keepalive to something reasonable (300s/30s/5 probes = 7.5 min detection). Also configure application-level keepalive in your database driver, HTTP client, and gRPC configuration — these override the system defaults and allow you to tune per-connection.

7. Forgetting that MSS is less than MTU¶

You see an MTU of 1500 and assume you can send 1500-byte payloads. You cannot. The IP header is 20 bytes and the TCP header is 20 bytes (40 with options). MSS is 1460 at best, often 1448 with TCP timestamps. In tunneled environments (VPN, VXLAN, GRE), the overhead is even larger:

Encapsulation	Overhead	Effective MSS (1500 MTU)
None	40 bytes	1460
VXLAN	50 bytes	1410
GRE	24 bytes	1436
IPsec (tunnel)	52-73 bytes	~1387-1408
WireGuard	60 bytes	1400

If you set a fixed MSS or MTU without accounting for encapsulation overhead, packets that are slightly too large get fragmented or dropped.

Fix: Always check the effective MSS with ss -ti. For tunneled traffic, reduce the tunnel interface MTU to account for encapsulation overhead, or use TCP MSS clamping on the firewall:

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
    -j TCPMSS --clamp-mss-to-pmtu

8. PMTU blackhole from blocking ICMP¶

Your security team blocks all ICMP at the firewall because "ICMP is a security risk." Path MTU discovery relies on ICMP Type 3, Code 4 ("Fragmentation Needed"). When a router on the path has a smaller MTU, it sends this ICMP message to tell the sender to reduce segment size. With ICMP blocked, the sender never learns about the smaller MTU. Large packets are silently dropped.

The symptoms are bizarre: small requests work (ping, DNS, small HTTP responses) but large responses hang or time out. SSH works but SCP transfers stall. Web pages partially load then stop. This is called a PMTU blackhole and it is one of the hardest network problems to diagnose because it is intermittent and path-dependent.

Fix: Never block ICMP Type 3 (Destination Unreachable). At minimum, allow Type 3 Code 4 (Fragmentation Needed) and Type 11 (Time Exceeded, needed for traceroute). Also allow Type 0 (Echo Reply) and Type 8 (Echo Request) for basic connectivity testing. The "block all ICMP" advice is outdated and harmful.

9. conntrack table full — kernel silently drops packets¶

On Linux servers doing NAT (load balancers, gateways, Kubernetes nodes with kube-proxy), every connection gets an entry in the conntrack table. When the table fills up, the kernel drops new packets with no error message — connections simply fail. The only clue is a kernel log message: nf_conntrack: table full, dropping packet.

# Check conntrack usage
sysctl net.netfilter.nf_conntrack_count
sysctl net.netfilter.nf_conntrack_max
# If count is near max, you are about to lose packets

# Watch in real-time
watch -n 1 'sysctl net.netfilter.nf_conntrack_count; sysctl net.netfilter.nf_conntrack_max'

# Check kernel log
dmesg | grep conntrack

Fix: Increase net.netfilter.nf_conntrack_max. Each entry uses ~320 bytes, so 1M entries = ~320MB RAM. For Kubernetes nodes, 1048576 (1M) is a reasonable starting point. Also reduce nf_conntrack_tcp_timeout_time_wait from the default 120 to 60 to free entries faster.

10. BBR requiring fq qdisc and kernel support¶

You set net.ipv4.tcp_congestion_control = bbr and expect better throughput. But BBR requires the fq (Fair Queuing) packet scheduler to work correctly. Without fq, BBR's pacing logic has no effect — packets are sent in bursts, defeating BBR's congestion model.

# Check current qdisc
tc qdisc show dev eth0
# qdisc pfifo_fast 0: root ... ← wrong, need fq

# Set fq qdisc
tc qdisc replace dev eth0 root fq

# Persistent (varies by distro)
# NetworkManager: nmcli connection modify eth0 tc.qdisc "root fq"

Also, BBR requires kernel 4.9+ and the tcp_bbr module. On older kernels, setting the sysctl silently falls back to CUBIC.

11. SO_REUSEADDR vs SO_REUSEPORT confusion¶

SO_REUSEADDR allows binding to an address that is in TIME_WAIT. This is needed so a server can restart immediately after a crash without waiting 60 seconds for TIME_WAIT to expire. Virtually every server should set this.

SO_REUSEPORT allows multiple sockets (typically one per worker process) to bind to the same address:port and the kernel load-balances incoming connections across them. This is a performance optimization, not a restart fix.

Confusing the two leads to either "Address already in use" errors on restart (forgot SO_REUSEADDR) or no load balancing across workers (used SO_REUSEADDR when you meant SO_REUSEPORT).

Fix: Server applications should almost always set SO_REUSEADDR. Multi-process architectures (nginx workers, Go net.Listen with GOMAXPROCS) should additionally use SO_REUSEPORT for kernel-level load balancing.