Network Traps & Deep Debugging¶

The failure modes that survive basic troubleshooting and require real understanding of the stack.

MTU Blackhole Debugging¶

Symptoms¶

TCP connections establish successfully (SYN/SYN-ACK are small)
Small requests work (API health checks pass)
Large responses fail silently -- transfers hang or time out
SSH works but SCP/SFTP stalls after the initial handshake
TLS handshake fails (the certificate exchange exceeds MTU)

Diagnosis¶

# Test path MTU by sending progressively smaller packets with DF bit set
# Standard Ethernet MTU = 1500. Subtract 28 bytes for IP + ICMP headers.
ping -M do -s 1472 destination    # Should work if MTU is 1500
ping -M do -s 1473 destination    # Should fail if any link is 1500

# Binary search for the real path MTU
ping -M do -s 1400 destination    # Works?
ping -M do -s 1450 destination    # Works?
ping -M do -s 1460 destination    # Fails? MTU is between 1450+28 and 1460+28

# Capture ICMP "fragmentation needed" messages (if they exist)
tcpdump -i eth0 -nn 'icmp[icmptype] == 3 and icmp[icmpcode] == 4'
# If nothing shows up, a firewall is blocking ICMP -- that IS the problem

# Check local interface MTU
ip link show eth0 | grep mtu

# In Kubernetes, check overlay network MTU
kubectl exec -it debug-pod -- ip link show eth0
# Overlay MTU should be host MTU minus encapsulation overhead
# VXLAN: host MTU - 50 bytes. Geneve: host MTU - 50 bytes.

Fix¶

# Set interface MTU to match the path
ip link set eth0 mtu 1400

# Permanent fix (Netplan example)
# /etc/netplan/01-config.yaml:
#   ethernets:
#     eth0:
#       mtu: 1400

# For Kubernetes CNI: update the CNI config to set pod MTU
# Calico: kubectl set env daemonset/calico-node -n kube-system FELIX_IPINIPMTU=1440
# Flannel: edit the ConfigMap net-conf.json, set "MTU": 1440

# Alternatively, enable TCP MSS clamping on the firewall
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Verification¶

# After fix, large transfers should complete
curl -o /dev/null http://destination/large-file
ping -M do -s 1372 destination   # Adjusted for new MTU

Asymmetric Routing¶

Symptoms¶

Traffic works in one direction but not the other
Connections from host A to B work, but B to A fail
Stateful firewalls drop packets (they see responses without matching requests)
Intermittent failures when multiple default gateways exist

Diagnosis¶

# Check if the kernel is dropping packets due to reverse path filtering
cat /proc/sys/net/ipv4/conf/all/rp_filter
cat /proc/sys/net/ipv4/conf/eth0/rp_filter
# 1 = strict (drops packets arriving on "wrong" interface)
# 2 = loose (accepts if any route exists back to source)
# 0 = disabled (accepts everything)

# Check for drops caused by rp_filter
grep -i "martian" /var/log/syslog
# Enable logging to see them:
sysctl -w net.ipv4.conf.all.log_martians=1

# Trace the path in both directions
traceroute -n source-to-dest
# Then from the other side:
traceroute -n dest-to-source
# If paths differ, you have asymmetric routing

# Check routing table for multiple default routes
ip route show | grep default
# Multiple defaults with same metric = unpredictable path selection

Fix¶

# Option 1: Set rp_filter to loose mode (if asymmetry is intentional)
sysctl -w net.ipv4.conf.all.rp_filter=2
# Persist in /etc/sysctl.d/99-rp-filter.conf:
echo "net.ipv4.conf.all.rp_filter = 2" > /etc/sysctl.d/99-rp-filter.conf

# Option 2: Fix routing so return traffic uses the correct interface
# Use policy-based routing to ensure replies go out the same interface
ip rule add from 10.0.1.100 table 100
ip route add default via 10.0.1.1 table 100

# Option 3: On stateful firewalls, add rules for both directions

Verification¶

# Test bidirectional connectivity
ping -c 5 destination       # from source
# And from the other side:
ssh destination "ping -c 5 source"

# Confirm no martian drops
dmesg | grep -i martian     # should be clean

DNS Caching Traps¶

Symptoms¶

DNS record was changed but old IP is still being used
Different servers resolve the same name to different IPs
In Kubernetes: services resolve correctly from some pods but not others
Application works with IP address but fails with hostname

Diagnosis¶

# Query each layer of the DNS stack separately
dig @127.0.0.1 app.example.com          # Local resolver
dig @10.96.0.10 app.example.com         # Kubernetes CoreDNS
dig @8.8.8.8 app.example.com            # Public resolver
dig @ns1.example.com app.example.com    # Authoritative server

# Check TTL -- how long until caches expire?
dig +noall +answer +ttlunits app.example.com
# If TTL is 86400 (24h), caches won't refresh for a day

# Check systemd-resolved cache
resolvectl statistics
resolvectl query app.example.com

# Flush local caches
systemd-resolve --flush-caches          # systemd-resolved
/etc/init.d/nscd restart                # NSCD

Kubernetes-specific DNS traps:

# The ndots problem: default ndots=5 means any name with fewer than 5 dots
# gets search domains appended FIRST, generating 4-6 extra queries
kubectl exec -it pod -- cat /etc/resolv.conf
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

# Querying "api.example.com" (2 dots, < 5) generates:
#   api.example.com.default.svc.cluster.local   -> NXDOMAIN
#   api.example.com.svc.cluster.local            -> NXDOMAIN
#   api.example.com.cluster.local                -> NXDOMAIN
#   api.example.com.                             -> finally resolves

# This causes latency and hammers CoreDNS. Fix:
# 1. Use FQDNs with trailing dot: "api.example.com."
# 2. Set ndots:1 in pod DNS config (but breaks short service names)
# 3. Set ndots:2 as a compromise

Fix¶

# For stale cache: flush and reduce TTL at the authoritative server
# For ndots: set pod dnsConfig in the deployment spec:
#   dnsConfig:
#     options:
#       - name: ndots
#         value: "2"

# For search domain explosion: use dnsPolicy ClusterFirstWithHostNet
# or explicitly set search domains in dnsConfig

Verification¶

dig +short app.example.com             # from host
kubectl exec debug-pod -- nslookup app.example.com   # from pod
# Both should return the correct, updated IP

PMTUD Failures¶

Symptoms¶

Identical to MTU blackhole (they are the same root cause)
Path MTU Discovery relies on ICMP "Fragmentation Needed" (type 3, code 4)
When firewalls block ICMP, PMTUD breaks and TCP connections hang on large payloads

Diagnosis¶

# Check if PMTUD is working
ip route get destination
# Should show "mtu <value>" if PMTUD has learned a smaller MTU

# Check the PMTU cache
ip route show cache | grep destination

# Capture to see if ICMP frag-needed messages arrive
tcpdump -i eth0 -nn 'icmp[icmptype] == 3 and icmp[icmpcode] == 4' -c 5

# If no ICMP arrives, check intermediate firewalls
# Many security appliances block all ICMP -- this breaks the internet

Fix¶

# Option 1: Fix the firewall to allow ICMP type 3 code 4
iptables -A INPUT -p icmp --icmp-type fragmentation-needed -j ACCEPT
iptables -A FORWARD -p icmp --icmp-type fragmentation-needed -j ACCEPT

# Option 2: TCP MSS clamping (works without ICMP)
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

# Option 3: Manually lower MTU on the endpoint
ip link set eth0 mtu 1400

Verification¶

# Large data transfer should complete
curl -o /dev/null http://destination/large-file
# PMTU cache should be populated
ip route show cache | grep destination

TCP TIME_WAIT Exhaustion¶

Symptoms¶

Connection failures under high load: "Cannot assign requested address"
ss shows thousands of sockets in TIME_WAIT state
Affects servers making many short-lived outbound connections (proxies, load balancers, API gateways)
The server has available CPU and memory but refuses new connections

Diagnosis¶

# Count sockets by state
ss -s
# Look at TIME_WAIT count. Anything over 20,000 is concerning.

# Count TIME_WAIT sockets per destination
ss -tan state time-wait | awk '{print $4}' | sort | uniq -c | sort -rn | head -20

# Check ephemeral port range
cat /proc/sys/net/ipv4/ip_local_port_range
# Default: 32768 60999 = ~28,000 ports
# If you have 28,000 TIME_WAIT sockets to the same destination:port, you are exhausted

# Check tw_reuse setting
cat /proc/sys/net/ipv4/tcp_tw_reuse

Fix¶

# Option 1: Enable TIME_WAIT reuse (safe for outbound connections)
sysctl -w net.ipv4.tcp_tw_reuse=1
echo "net.ipv4.tcp_tw_reuse = 1" >> /etc/sysctl.d/99-tcp-tuning.conf

# Option 2: Expand ephemeral port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Option 3: Use connection pooling in the application (best fix)
# HTTP keepalive, database connection pools, etc.
# This eliminates the problem at the source

# Option 4: Reduce TIME_WAIT duration (kernel >= 4.12)
# Not directly tunable, but tcp_tw_reuse=1 effectively handles it

# NEVER set tcp_tw_recycle=1 -- it is broken behind NAT and was removed in kernel 4.12

Verification¶

# After fix, monitor TIME_WAIT count under load
watch -n 2 'ss -s | grep TIME-WAIT'

# Confirm connections succeed
curl -v http://destination/health
# Should not get "Cannot assign requested address"