TCP/IP Deep Dive - Street-Level Ops¶

What experienced network operators know from years of debugging connection issues, latency problems, and mysterious packet drops.

Diagnosing TIME_WAIT Accumulation¶

TIME_WAIT is the most common state operators worry about — and most of the time it is not actually a problem. A socket stays in TIME_WAIT for 2*MSL (60 seconds on Linux) to handle delayed packets from the previous connection.

# Count TIME_WAIT sockets
ss -s
# TCP:   85432 (estab 45200, closed 12000, orphaned 340, timewait 28000)

ss -tan state time-wait | wc -l
# 28000

# See which destinations accumulate TIME_WAIT
ss -tan state time-wait | awk '{print $4}' | \
    awk -F: '{print $1}' | sort | uniq -c | sort -rn | head -10
#   8450 10.0.2.100     ← backend pool member
#   6200 10.0.2.101
#   4100 10.0.3.50

# Check if you are running out of ephemeral ports
sysctl net.ipv4.ip_local_port_range
# 32768    60999    (only ~28000 ports)

ss -tan state time-wait sport = :0 | wc -l

When TIME_WAIT is actually a problem: Only when you exhaust the ephemeral port range for a specific destination IP:port pair. 28000 TIME_WAIT sockets to 100 different backends is fine. 28000 to a single backend is a problem.

Fix:

# Enable TIME_WAIT reuse for outbound connections
sysctl -w net.ipv4.tcp_tw_reuse=1

# Expand ephemeral port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Reduce FIN_WAIT_2 timeout (dead half-closed connections)
sysctl -w net.ipv4.tcp_fin_timeout=15

# Application-level: use connection pooling (HTTP keep-alive, database pools)

Debugging Retransmissions¶

Retransmissions are the TCP response to packet loss. A few retransmissions per million segments is normal. Sustained retransmissions indicate a real problem: congestion, bad hardware, or misconfiguration.

# Quick retransmission stats
nstat -az | grep -i retrans
# TcpRetransSegs         12847    0.0
# TcpExtTCPSlowStartRetrans  0   0.0
# TcpExtTCPLossProbes    4521    0.0

# Watch retransmission rate in real-time
watch -n 1 'nstat -az | grep TcpRetransSegs'

# Per-connection retransmission info
ss -ti dst 10.0.2.100
# cubic wscale:7,7 rto:204 rtt:1.2/0.5 mss:1448 pmtu:1500
# cwnd:10 retrans:0/3 bytes_sent:1247832 bytes_acked:1247832
#                  ^^^
#                  current/total retransmissions

# Capture retransmissions with tcpdump
tcpdump -n -i eth0 'tcp[tcpflags] & tcp-syn != 0' -c 20
# SYN retransmissions indicate the remote is not responding

# See retransmit events in real-time with ss
watch -n 1 'ss -ti | grep retrans'

# Kernel retransmit stats
cat /proc/net/snmp | grep Tcp
# Tcp: ... RetransSegs ... OutSegs
# Retransmit ratio = RetransSegs / OutSegs (should be < 0.01%)

Diagnosing the cause:

# Is it a specific host?
ss -ti dst 10.0.2.100 | grep retrans

# Is it a specific interface?
ethtool -S eth0 | grep -i error
ethtool -S eth0 | grep -i drop

# Is it congestion? Check if cwnd is small
ss -ti dst 10.0.2.100 | grep cwnd
# cwnd:3    ← congestion window is tiny, link is congested

# Is it MTU-related? Look for ICMP fragmentation needed
tcpdump -n 'icmp[icmptype] == 3 and icmp[icmpcode] == 4'

Finding Connection Resets¶

TCP RST packets terminate connections abruptly. They can indicate application errors, firewall interference, or resource exhaustion.

# Capture RST packets
tcpdump -n -i eth0 'tcp[tcpflags] & tcp-rst != 0' -c 50
# 10:15:23.456 IP 10.0.2.100.8080 > 10.0.1.50.45678: Flags [R.], ...

# RST from server = server rejected the connection or application crashed
# RST from client = client gave up or firewall killed the connection

# Count resets per source
tcpdump -n -i eth0 'tcp[tcpflags] & tcp-rst != 0' -c 1000 2>/dev/null | \
    awk '{print $3}' | awk -F. '{print $1"."$2"."$3"."$4}' | \
    sort | uniq -c | sort -rn | head

# Check for connection refused (RST in response to SYN)
tcpdump -n -i eth0 '(tcp[tcpflags] & tcp-syn != 0) or (tcp[tcpflags] & tcp-rst != 0)' -c 100

# Kernel stats
nstat -az | grep -i "reset\|abort"
# TcpExtTCPAbortOnData       0
# TcpExtTCPAbortOnClose      0
# TcpExtTCPAbortOnMemory     0    ← RSTs due to memory pressure
# TcpExtTCPAbortOnTimeout    12   ← connections that timed out
# TcpExtTCPAbortOnLinger     0

Investigating Slow Connections¶

"The connection is slow" is the vaguest symptom. Here is how to narrow it down.

# 1. Measure RTT to the host
ping -c 10 10.0.2.100
# rtt min/avg/max/mdev = 0.5/0.8/1.2/0.3 ms

# 2. Check window size and scaling for the connection
ss -ti dst 10.0.2.100
# cubic wscale:7,7 rto:204 rtt:1.2/0.5 mss:1448
# rcv_space:29200 rcv_ssthresh:29200
# cwnd:10 ssthresh:7    ← small cwnd and ssthresh = congestion history
# send 96.5Mbps         ← estimated send rate

# 3. Check for zero windows (receiver is full, sender is stalled)
tcpdump -n -i eth0 'dst host 10.0.2.100' -v 2>/dev/null | grep "win 0"
# win 0 = receiver's buffer is full, sender must wait

# 4. Check socket buffer sizes
ss -tm dst 10.0.2.100
# skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0)
# rb = receive buffer, tb = send buffer

# 5. Bandwidth-delay product check
# RTT = 50ms, desired throughput = 1 Gbps
# BDP = 125,000,000 bytes/sec * 0.050 sec = 6,250,000 bytes
# Socket buffer must be at least 6.25 MB
sysctl net.core.rmem_max
# 212992    ← 208 KB — nowhere near 6.25 MB!

# Fix:
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

TCP Keepalive for Detecting Dead Connections¶

Your application has persistent connections to a database. The database server crashes (power failure, kernel panic). Your application does not know the connection is dead — it sits waiting for a response that will never come.

# Check current keepalive defaults
sysctl net.ipv4.tcp_keepalive_time     # 7200 seconds (2 hours!)
sysctl net.ipv4.tcp_keepalive_intvl    # 75 seconds
sysctl net.ipv4.tcp_keepalive_probes   # 9 probes

# Default: 2h + (75s * 9) = 2h 11m to detect a dead connection
# That is way too long

# Reasonable production values
sysctl -w net.ipv4.tcp_keepalive_time=300    # First probe after 5min
sysctl -w net.ipv4.tcp_keepalive_intvl=30    # Probe every 30s
sysctl -w net.ipv4.tcp_keepalive_probes=5    # Give up after 5 probes
# Detection time: 5min + (30s * 5) = 7.5 minutes

# Verify keepalive is active on a connection
ss -to dst 10.0.2.100
# timer:(keepalive,4min23sec,0)    ← keepalive timer running

# Check if a socket has keepalive enabled
ss -to | grep keepalive

Note: System-wide sysctl only sets defaults. Applications override per-socket. Check your database client library's keepalive settings too (PostgreSQL: keepalives_idle, keepalives_interval, keepalives_count).

MTU/MSS Issues and Path MTU Discovery¶

Debug clue: If SSH (small packets) works but SCP of large files hangs, suspect a PMTU blackhole: a router on the path is dropping oversized packets and blocking the ICMP "fragmentation needed" response. This is extremely common with VPNs, GRE tunnels, and cloud provider VPCs that filter ICMP.

Symptoms: large packets fail, small packets work. Downloads stall after the first few kilobytes. SSH works but SCP hangs. VPN connections have throughput issues.

# 1. Check your interface MTU
ip link show eth0
# mtu 1500

# 2. Test path MTU to a destination
tracepath 10.0.2.100
#  1:  10.0.0.1    0.3ms pmtu 1500
#  2:  10.0.1.1    0.5ms pmtu 1400    ← MTU drops (tunnel, VPN)
#  3:  10.0.2.100  0.8ms reached

# 3. Detect PMTU blackhole (ICMP blocked on the path)
# Send a max-size packet with DF set
ping -s 1472 -M do 10.0.2.100
# 1472 + 8 (ICMP header) + 20 (IP header) = 1500 bytes total
# If this hangs (no response), but smaller packets work:
ping -s 1400 -M do 10.0.2.100
# → PMTU blackhole between 1428 and 1500

# 4. See if the kernel knows about reduced PMTU
ip route get 10.0.2.100
# 10.0.2.100 via 10.0.0.1 dev eth0 src 10.0.1.50 ... mtu 1400

# 5. Watch for ICMP "fragmentation needed" messages
tcpdump -n 'icmp[icmptype] == 3 and icmp[icmpcode] == 4'

# 6. Workaround: clamp MSS on the firewall/router
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
    -j TCPMSS --clamp-mss-to-pmtu

BBR vs CUBIC Selection¶

# Check current algorithm
sysctl net.ipv4.tcp_congestion_control

# Switch to BBR (kernel 4.9+)
modprobe tcp_bbr
sysctl -w net.ipv4.tcp_congestion_control=bbr

# When to use BBR:
# - High-latency links (cross-region, transcontinental, satellite)
# - Lossy links (Wi-Fi, cellular, public internet with random loss)
# - Streaming workloads (CDN, video delivery)

# When to stay with CUBIC:
# - Datacenter east-west traffic (low latency, no loss)
# - Multiple flows competing fairly (BBR can be aggressive)
# - Compliance/predictability requirements (CUBIC is well-understood)

# Persistent:
echo "tcp_bbr" >> /etc/modules-load.d/bbr.conf
echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/99-bbr.conf

SYN Flood Detection¶

# Symptoms: new connections fail, established ones work
# Server CPU is low, memory is available

# 1. Check for SYN queue overflow
nstat -az | grep -i syn
# TcpExtListenOverflows    45821    ← accept queue full
# TcpExtTCPReqQFullDrop    45821

# 2. Check SYN_RECV state count
ss -tan state syn-recv | wc -l
# 512    (if this equals tcp_max_syn_backlog, you are full)

# 3. Enable SYN cookies (allows connections even when SYN queue is full)
sysctl -w net.ipv4.tcp_syncookies=1

# 4. Increase backlogs
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w net.core.somaxconn=65535

# 5. Identify the source (legitimate load or attack?)
ss -tan state syn-recv | awk '{print $4}' | \
    awk -F: '{print $1}' | sort | uniq -c | sort -rn | head
#   4500 203.0.113.50    ← single source = likely attack
#     12 10.0.1.20       ← internal = legitimate

# 6. If attack: use firewall rate limiting
iptables -A INPUT -p tcp --syn -m limit --limit 100/s --limit-burst 200 -j ACCEPT
iptables -A INPUT -p tcp --syn -j DROP

Half-Open Connection Cleanup¶

CLOSE_WAIT connections mean the remote side closed but your application has not. This is almost always an application bug — a connection that is never properly closed.

# Find CLOSE_WAIT connections
ss -tan state close-wait
# State    Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
# CLOSE-WAIT 0     0       10.0.1.50:45678    10.0.2.100:5432

# Count per process
ss -tanp state close-wait | awk '{print $6}' | sort | uniq -c | sort -rn
#   450 users:(("java",pid=12345,fd=847))
#    12 users:(("python",pid=23456,fd=23))

# CLOSE_WAIT does not time out on its own (it waits for the application)
# FIN_WAIT_2 does time out (controlled by tcp_fin_timeout)
sysctl net.ipv4.tcp_fin_timeout
# 60

# Reduce if you have many orphaned FIN_WAIT_2 connections
sysctl -w net.ipv4.tcp_fin_timeout=15

Out-of-Order Packet Analysis¶

Out-of-order packets cause duplicate ACKs and potential retransmissions. They indicate network path issues (ECMP with asymmetric paths, link aggregation hash problems).

# Check OOO stats
nstat -az | grep -i ofo
# TcpExtTCPOFOQueue       12345
# TcpExtTCPOFOMerge       6789

# Per-connection reordering info
ss -ti dst 10.0.2.100
# ... reordering:3 ...    ← kernel detected reordering depth of 3

# Capture out-of-order packets with tcpdump
# Look for duplicate ACKs (same ack number repeated)
tcpdump -n -i eth0 'dst host 10.0.2.100 and tcp' -c 1000 -w /tmp/capture.pcap
# Analyze with tshark:
tshark -r /tmp/capture.pcap -Y "tcp.analysis.duplicate_ack" | wc -l
tshark -r /tmp/capture.pcap -Y "tcp.analysis.out_of_order" | wc -l

TCP Dump Analysis for Latency¶

When you need to understand exactly where time is spent in a connection:

# Capture a full TCP session
tcpdump -n -i eth0 'host 10.0.2.100 and port 8080' -w /tmp/http.pcap -s 0

# Analyze with tshark

# See handshake timing
tshark -r /tmp/http.pcap -Y "tcp.flags.syn==1" -T fields \
    -e frame.time_relative -e ip.src -e ip.dst -e tcp.flags

# See request-response timing
tshark -r /tmp/http.pcap -Y "http" -T fields \
    -e frame.time_relative -e http.request.method -e http.response.code

# RTT measurement (time between data and ACK)
tshark -r /tmp/http.pcap -z "io,stat,0.1,tcp.analysis.ack_rtt"

# One-liner: measure time between SYN and SYN-ACK (connection setup time)
tcpdump -n -i eth0 'tcp[tcpflags] & (tcp-syn) != 0 and host 10.0.2.100' \
    -ttt -c 4 2>/dev/null
# The -ttt flag shows time delta between packets
# SYN → SYN-ACK delta = server-side TCP processing + network RTT

# Filter for slow responses (useful in production)
tshark -r /tmp/http.pcap -Y "tcp.analysis.ack_rtt > 0.5" -T fields \
    -e frame.number -e tcp.analysis.ack_rtt -e ip.dst

Pattern: Connection Debugging Flowchart¶

Connection fails?
  ├── Cannot reach host at all
  │     ├── ping works? → TCP issue (firewall, port closed)
  │     └── ping fails? → L3 issue (routing, firewall, host down)
  │           └── Check: ip route get, traceroute, arp table
  │
  ├── Connection times out (SYN no response)
  │     ├── tcpdump shows SYN leaving? → remote firewall or host down
  │     └── tcpdump shows no SYN? → local firewall or routing
  │
  ├── Connection refused (RST immediately)
  │     └── Service not listening on that port
  │         └── Check: ss -ltn, systemctl status
  │
  ├── Connection established but slow
  │     ├── High RTT? → network latency
  │     ├── Small cwnd? → congestion or loss
  │     ├── Small window? → buffer too small for BDP
  │     └── Retransmissions? → packet loss somewhere on path
  │
  └── Connection drops after working
        ├── RST from remote? → app crash or firewall timeout
        ├── FIN from remote? → graceful close (check app logs)
        └── No FIN/RST? → network partition, host crash
              └── Keepalive will eventually detect