Skip to content

Network Traps & Deep Debugging

The failure modes that survive basic troubleshooting and require real understanding of the stack.

MTU Blackhole Debugging

Symptoms

  • TCP connections establish successfully (SYN/SYN-ACK are small)
  • Small requests work (API health checks pass)
  • Large responses fail silently -- transfers hang or time out
  • SSH works but SCP/SFTP stalls after the initial handshake
  • TLS handshake fails (the certificate exchange exceeds MTU)

Diagnosis

# Test path MTU by sending progressively smaller packets with DF bit set
# Standard Ethernet MTU = 1500. Subtract 28 bytes for IP + ICMP headers.
ping -M do -s 1472 destination    # Should work if MTU is 1500
ping -M do -s 1473 destination    # Should fail if any link is 1500

# Binary search for the real path MTU
ping -M do -s 1400 destination    # Works?
ping -M do -s 1450 destination    # Works?
ping -M do -s 1460 destination    # Fails? MTU is between 1450+28 and 1460+28

# Capture ICMP "fragmentation needed" messages (if they exist)
tcpdump -i eth0 -nn 'icmp[icmptype] == 3 and icmp[icmpcode] == 4'
# If nothing shows up, a firewall is blocking ICMP -- that IS the problem

# Check local interface MTU
ip link show eth0 | grep mtu

# In Kubernetes, check overlay network MTU
kubectl exec -it debug-pod -- ip link show eth0
# Overlay MTU should be host MTU minus encapsulation overhead
# VXLAN: host MTU - 50 bytes. Geneve: host MTU - 50 bytes.

Fix

# Set interface MTU to match the path
ip link set eth0 mtu 1400

# Permanent fix (Netplan example)
# /etc/netplan/01-config.yaml:
#   ethernets:
#     eth0:
#       mtu: 1400

# For Kubernetes CNI: update the CNI config to set pod MTU
# Calico: kubectl set env daemonset/calico-node -n kube-system FELIX_IPINIPMTU=1440
# Flannel: edit the ConfigMap net-conf.json, set "MTU": 1440

# Alternatively, enable TCP MSS clamping on the firewall
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Verification

# After fix, large transfers should complete
curl -o /dev/null http://destination/large-file
ping -M do -s 1372 destination   # Adjusted for new MTU

Asymmetric Routing

Symptoms

  • Traffic works in one direction but not the other
  • Connections from host A to B work, but B to A fail
  • Stateful firewalls drop packets (they see responses without matching requests)
  • Intermittent failures when multiple default gateways exist

Diagnosis

# Check if the kernel is dropping packets due to reverse path filtering
cat /proc/sys/net/ipv4/conf/all/rp_filter
cat /proc/sys/net/ipv4/conf/eth0/rp_filter
# 1 = strict (drops packets arriving on "wrong" interface)
# 2 = loose (accepts if any route exists back to source)
# 0 = disabled (accepts everything)

# Check for drops caused by rp_filter
grep -i "martian" /var/log/syslog
# Enable logging to see them:
sysctl -w net.ipv4.conf.all.log_martians=1

# Trace the path in both directions
traceroute -n source-to-dest
# Then from the other side:
traceroute -n dest-to-source
# If paths differ, you have asymmetric routing

# Check routing table for multiple default routes
ip route show | grep default
# Multiple defaults with same metric = unpredictable path selection

Fix

# Option 1: Set rp_filter to loose mode (if asymmetry is intentional)
sysctl -w net.ipv4.conf.all.rp_filter=2
# Persist in /etc/sysctl.d/99-rp-filter.conf:
echo "net.ipv4.conf.all.rp_filter = 2" > /etc/sysctl.d/99-rp-filter.conf

# Option 2: Fix routing so return traffic uses the correct interface
# Use policy-based routing to ensure replies go out the same interface
ip rule add from 10.0.1.100 table 100
ip route add default via 10.0.1.1 table 100

# Option 3: On stateful firewalls, add rules for both directions

Verification

# Test bidirectional connectivity
ping -c 5 destination       # from source
# And from the other side:
ssh destination "ping -c 5 source"

# Confirm no martian drops
dmesg | grep -i martian     # should be clean

DNS Caching Traps

Symptoms

  • DNS record was changed but old IP is still being used
  • Different servers resolve the same name to different IPs
  • In Kubernetes: services resolve correctly from some pods but not others
  • Application works with IP address but fails with hostname

Diagnosis

# Query each layer of the DNS stack separately
dig @127.0.0.1 app.example.com          # Local resolver
dig @10.96.0.10 app.example.com         # Kubernetes CoreDNS
dig @8.8.8.8 app.example.com            # Public resolver
dig @ns1.example.com app.example.com    # Authoritative server

# Check TTL -- how long until caches expire?
dig +noall +answer +ttlunits app.example.com
# If TTL is 86400 (24h), caches won't refresh for a day

# Check systemd-resolved cache
resolvectl statistics
resolvectl query app.example.com

# Flush local caches
systemd-resolve --flush-caches          # systemd-resolved
/etc/init.d/nscd restart                # NSCD

Kubernetes-specific DNS traps:

# The ndots problem: default ndots=5 means any name with fewer than 5 dots
# gets search domains appended FIRST, generating 4-6 extra queries
kubectl exec -it pod -- cat /etc/resolv.conf
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

# Querying "api.example.com" (2 dots, < 5) generates:
#   api.example.com.default.svc.cluster.local   -> NXDOMAIN
#   api.example.com.svc.cluster.local            -> NXDOMAIN
#   api.example.com.cluster.local                -> NXDOMAIN
#   api.example.com.                             -> finally resolves

# This causes latency and hammers CoreDNS. Fix:
# 1. Use FQDNs with trailing dot: "api.example.com."
# 2. Set ndots:1 in pod DNS config (but breaks short service names)
# 3. Set ndots:2 as a compromise

Fix

# For stale cache: flush and reduce TTL at the authoritative server
# For ndots: set pod dnsConfig in the deployment spec:
#   dnsConfig:
#     options:
#       - name: ndots
#         value: "2"

# For search domain explosion: use dnsPolicy ClusterFirstWithHostNet
# or explicitly set search domains in dnsConfig

Verification

dig +short app.example.com             # from host
kubectl exec debug-pod -- nslookup app.example.com   # from pod
# Both should return the correct, updated IP

PMTUD Failures

Symptoms

  • Identical to MTU blackhole (they are the same root cause)
  • Path MTU Discovery relies on ICMP "Fragmentation Needed" (type 3, code 4)
  • When firewalls block ICMP, PMTUD breaks and TCP connections hang on large payloads

Diagnosis

# Check if PMTUD is working
ip route get destination
# Should show "mtu <value>" if PMTUD has learned a smaller MTU

# Check the PMTU cache
ip route show cache | grep destination

# Capture to see if ICMP frag-needed messages arrive
tcpdump -i eth0 -nn 'icmp[icmptype] == 3 and icmp[icmpcode] == 4' -c 5

# If no ICMP arrives, check intermediate firewalls
# Many security appliances block all ICMP -- this breaks the internet

Fix

# Option 1: Fix the firewall to allow ICMP type 3 code 4
iptables -A INPUT -p icmp --icmp-type fragmentation-needed -j ACCEPT
iptables -A FORWARD -p icmp --icmp-type fragmentation-needed -j ACCEPT

# Option 2: TCP MSS clamping (works without ICMP)
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

# Option 3: Manually lower MTU on the endpoint
ip link set eth0 mtu 1400

Verification

# Large data transfer should complete
curl -o /dev/null http://destination/large-file
# PMTU cache should be populated
ip route show cache | grep destination

TCP TIME_WAIT Exhaustion

Symptoms

  • Connection failures under high load: "Cannot assign requested address"
  • ss shows thousands of sockets in TIME_WAIT state
  • Affects servers making many short-lived outbound connections (proxies, load balancers, API gateways)
  • The server has available CPU and memory but refuses new connections

Diagnosis

# Count sockets by state
ss -s
# Look at TIME_WAIT count. Anything over 20,000 is concerning.

# Count TIME_WAIT sockets per destination
ss -tan state time-wait | awk '{print $4}' | sort | uniq -c | sort -rn | head -20

# Check ephemeral port range
cat /proc/sys/net/ipv4/ip_local_port_range
# Default: 32768 60999 = ~28,000 ports
# If you have 28,000 TIME_WAIT sockets to the same destination:port, you are exhausted

# Check tw_reuse setting
cat /proc/sys/net/ipv4/tcp_tw_reuse

Fix

# Option 1: Enable TIME_WAIT reuse (safe for outbound connections)
sysctl -w net.ipv4.tcp_tw_reuse=1
echo "net.ipv4.tcp_tw_reuse = 1" >> /etc/sysctl.d/99-tcp-tuning.conf

# Option 2: Expand ephemeral port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Option 3: Use connection pooling in the application (best fix)
# HTTP keepalive, database connection pools, etc.
# This eliminates the problem at the source

# Option 4: Reduce TIME_WAIT duration (kernel >= 4.12)
# Not directly tunable, but tcp_tw_reuse=1 effectively handles it

# NEVER set tcp_tw_recycle=1 -- it is broken behind NAT and was removed in kernel 4.12

Verification

# After fix, monitor TIME_WAIT count under load
watch -n 2 'ss -s | grep TIME-WAIT'

# Confirm connections succeed
curl -v http://destination/health
# Should not get "Cannot assign requested address"