Skip to content

Portal | Level: L2: Operations | Topics: TCP/IP Deep Dive, TCP/IP | Domain: Networking

TCP/IP Deep Dive - Primer

Why This Matters

TCP/IP is the protocol stack that everything runs on. Every HTTP request, database query, API call, and file transfer rides on top of TCP. When you understand TCP deeply — not just "it is reliable" but how it achieves reliability, what the costs are, and where it breaks down — you can diagnose problems that look like application bugs but are actually transport-layer issues.

I have spent hours debugging "slow API responses" that turned out to be TCP window scaling misconfiguration. I have watched Nagle's algorithm add 200ms to every message in a chat application. I have seen production load balancers drop connections because the conntrack table was full and nobody knew to check. These are not obscure edge cases. They are the normal failure modes of systems at scale.

Understanding the IP layer matters too. When path MTU discovery breaks because someone blocked ICMP, you get mysterious failures for large packets. When ARP goes wrong, machines that are physically connected cannot talk to each other. These are layer 3 and layer 2 problems that masquerade as application-level failures.

TCP Deep Dive

1. The Three-Way Handshake

Every TCP connection begins with a three-way handshake. This is how two hosts agree on initial sequence numbers and establish state.

Client                          Server
  |                                |
  |  SYN (seq=x)                   |
  |  ----------------------------→ |  Server allocates state (SYN_RECEIVED)
  |                                |
  |  SYN-ACK (seq=y, ack=x+1)     |
  |  ←---------------------------- |
  |                                |
  |  ACK (ack=y+1)                 |
  |  ----------------------------→ |  Connection ESTABLISHED
  |                                |

Under the hood: During the handshake, each side picks a random Initial Sequence Number (ISN). Modern kernels use a combination of a timer, source/destination IP+port, and a secret key to generate ISNs. This randomness prevents TCP sequence prediction attacks, which were a serious vulnerability in early TCP implementations (the Morris Worm of 1988 exploited predictable ISNs).

The handshake takes one round-trip time (RTT). On a cross-region connection with 80ms RTT, the handshake alone adds 80ms before any data flows. This is why connection reuse (HTTP keep-alive, connection pooling) matters so much for performance.

2. TCP Connection States

Every TCP connection passes through a state machine. Understanding these states is essential for diagnosing production issues.

                              CLOSED
                                |
                    (passive open: listen)
                                |
                              LISTEN
                                |
                    (receive SYN, send SYN+ACK)
                                |
                           SYN_RECEIVED
                                |
                    (receive ACK)
                                |
                           ESTABLISHED
                                |
              (active close: send FIN)
                                |
                           FIN_WAIT_1
                                |
                    (receive ACK)        (receive FIN+ACK)
                       |                        |
                  FIN_WAIT_2               CLOSING
                       |                        |
                  (receive FIN)           (receive ACK)
                       |                        |
                   TIME_WAIT <------------------+
                       |
                  (2*MSL timeout: 60s default)
                       |
                     CLOSED

The passive close (server side when client initiates close):

ESTABLISHED → (receive FIN) → CLOSE_WAIT → (send FIN) → LAST_ACK → (receive ACK) → CLOSED

Key states to know:

State Meaning Operational significance
LISTEN Waiting for connections Normal for servers; many LISTEN sockets is expected
ESTABLISHED Active connection Normal; high count is fine if expected
SYN_RECEIVED Handshake in progress High count may indicate SYN flood
CLOSE_WAIT Remote closed, local has not Application bug (not closing sockets); this state accumulates

Debug clue: A growing count of CLOSE_WAIT sockets is almost always an application bug — the remote side closed the connection but your application never called close() on its end. Check ss -tnp state close-wait to see which process is leaking sockets. Common culprits: HTTP clients not reading response bodies, database connections not being returned to the pool. | TIME_WAIT | Connection closed, waiting 2*MSL | Normal; high count is expected on busy servers | | FIN_WAIT_2 | Sent FIN, waiting for remote FIN | Remote side not closing; may indicate hung remote process |

# View connection states
ss -tan state established | wc -l
ss -tan state time-wait | wc -l
ss -tan state close-wait | wc -l

# Summary of all states
ss -s
# TCP:   45892 (estab 32100, closed 5200, orphaned 120, timewait 8400)

# Per-state breakdown
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

3. Window Size and Flow Control

TCP uses a sliding window for flow control. The receiver advertises how much data it is willing to accept (receive window). The sender must not send more than the receiver's window allows.

Sender                              Receiver
  |                                    |
  |  Data (seq 1-1460) →              |  Window = 65535
  |  Data (seq 1461-2920) →           |  Window = 64075 (advertised in ACK)
  |  Data (seq 2921-4380) →           |
  |                                    |
  |  ← ACK (ack=4381, win=65535)      |  Application read the data, window reopened
  |                                    |

Window scaling (RFC 1323): The window size field in the TCP header is 16 bits, limiting it to 65535 bytes. Window scaling uses a shift count negotiated during the handshake to multiply the window by up to 2^14, allowing windows up to ~1GB.

# Check if window scaling is enabled (it should be)
sysctl net.ipv4.tcp_window_scaling
# 1

# See actual window sizes for a connection
ss -ti dst 10.0.1.50
# cubic wscale:7,7 rto:204 rtt:1.2/0.5 ... rcv_space:29200
# wscale:7,7 means both sides use scale factor 7 (window * 128)

Without window scaling, the maximum in-flight data is 65535 bytes. On a 100ms RTT link, this limits throughput to 65535 / 0.1 = ~655 KB/s regardless of bandwidth.

4. Congestion Control

Flow control prevents the sender from overwhelming the receiver. Congestion control prevents the sender from overwhelming the network. They are independent mechanisms.

Slow Start

A new connection starts with a small congestion window (cwnd, typically 10 segments = ~14KB). For each ACK received, cwnd increases by one segment. This is exponential growth — cwnd doubles every RTT until either a loss occurs or the slow start threshold (ssthresh) is reached.

RTT 1: cwnd = 10 segments → send 10 segments
RTT 2: cwnd = 20 segments → send 20 segments
RTT 3: cwnd = 40 segments → send 40 segments
RTT 4: cwnd = 80 segments → send 80 segments

On a 1 Gbps link with 50ms RTT, it takes approximately 8 RTTs (400ms) to ramp up to full link utilization. This is why short-lived HTTP connections are inefficient — they spend most of their life in slow start.

Congestion Avoidance

Once cwnd reaches ssthresh, growth switches to linear: cwnd increases by ~1 segment per RTT. This is additive increase.

Fast Retransmit and Fast Recovery

When the receiver detects a gap (an out-of-order segment), it sends a duplicate ACK. After 3 duplicate ACKs, the sender retransmits the missing segment immediately (fast retransmit) without waiting for the retransmission timeout. It then halves cwnd and enters fast recovery, continuing to send data rather than dropping back to slow start.

Modern Algorithms: CUBIC and BBR

CUBIC (default on Linux since 2.6.19) uses a cubic function for window growth. After a loss, it rapidly recovers to the pre-loss window size, then slows growth near that point. CUBIC is loss-based — it treats packet loss as a signal of congestion.

Remember: Congestion control algorithm comparison: CUBIC reacts to loss (packet drop = slow down). BBR measures bandwidth and RTT (probes the path). On clean links, both perform similarly. On lossy links (Wi-Fi, satellite), BBR wins because random packet loss is not congestion.

BBR (Bottleneck Bandwidth and RTT, Google, 2016) takes a fundamentally different approach. Instead of reacting to loss, it actively probes for bandwidth and RTT to build a model of the path. BBR can significantly improve throughput on lossy links (Wi-Fi, transcontinental, satellite) where CUBIC interprets random loss as congestion.

# Check current algorithm
sysctl net.ipv4.tcp_congestion_control
# cubic

# Available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# reno cubic

# Load and use BBR (requires kernel 4.9+)
modprobe tcp_bbr
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Verify
sysctl net.ipv4.tcp_congestion_control
# bbr

5. Nagle's Algorithm and TCP_NODELAY

Nagle's algorithm batches small writes into larger segments. If there is unacknowledged data in flight, the sender buffers new small writes until either an ACK arrives or the buffer fills an MSS-sized segment.

This is efficient for bulk transfers but devastating for interactive protocols. A typical pathology: an application sends a 50-byte message, Nagle holds it because there is data in flight, the receiver has delayed ACK enabled and waits 40-200ms before acknowledging. The 50-byte message sits in the sender's buffer for up to 200ms.

# Disable Nagle (send immediately regardless of outstanding ACKs)
# In application code:
# setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one));

# Verify on a live connection
ss -ti dst 10.0.1.50 | grep nodelay

Rule of thumb: Set TCP_NODELAY for any request-response protocol (HTTP, RPC, database queries, game packets). Leave Nagle enabled for bulk transfers where latency does not matter.

6. TCP Keepalive

TCP keepalive probes detect dead connections where the remote host has crashed, been disconnected, or is behind a stateful firewall that silently drops idle connections.

# Default keepalive settings
sysctl net.ipv4.tcp_keepalive_time     # 7200 (2 hours before first probe)
sysctl net.ipv4.tcp_keepalive_intvl    # 75 (seconds between probes)
sysctl net.ipv4.tcp_keepalive_probes   # 9 (probes before declaring dead)

# With defaults: 2h + (75s * 9) = ~2h 11min to detect a dead connection
# Most applications cannot tolerate this

# Reasonable values for production
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5
# Now: 5min + (30s * 5) = 7.5min to detect dead connection

Applications can override these on a per-socket basis with setsockopt. Many applications (PostgreSQL, Redis, gRPC) have their own keepalive configuration that overrides the system defaults.

7. MSS vs MTU

MTU (Maximum Transmission Unit) is the largest frame size the link layer will carry. Ethernet default is 1500 bytes. MSS (Maximum Segment Size) is the largest TCP payload, which is MTU minus IP and TCP headers: 1500 - 20 (IP) - 20 (TCP) = 1460 bytes. With TCP options (timestamps, SACK), effective MSS is typically 1448 bytes.

# Check interface MTU
ip link show eth0 | grep mtu
# mtu 1500

# Check TCP MSS for a connection
ss -ti dst 10.0.1.50 | grep mss
# mss:1448

# Jumbo frames (9000 byte MTU) for datacenter traffic
ip link set eth0 mtu 9000
# MSS becomes: 9000 - 40 = 8960 bytes

8. TCP Timestamps and SACK

TCP timestamps (RFC 1323) serve two purposes: precise RTT measurement and protection against wrapped sequence numbers (PAWS) on high-bandwidth links.

# Check if timestamps are enabled
sysctl net.ipv4.tcp_timestamps
# 1 (enabled by default — leave it on)

SACK (Selective Acknowledgment, RFC 2018) allows the receiver to tell the sender exactly which segments arrived, rather than just acknowledging up to the last contiguous segment. Without SACK, a single lost segment causes retransmission of everything after it.

sysctl net.ipv4.tcp_sack
# 1 (enabled by default — leave it on)

9. TCP Fast Open

TCP Fast Open (TFO) allows data to be sent in the SYN packet, eliminating one RTT for repeat connections. The server gives the client a cookie on the first connection. On subsequent connections, the client includes the cookie and data in the SYN.

# Enable TFO
# Bit 1 = client, Bit 2 = server, 3 = both
sysctl -w net.ipv4.tcp_fastopen=3

# Application must also enable TFO:
# Server: setsockopt(sock, IPPROTO_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen))
# Client: sendto() with MSG_FASTOPEN flag

IP Deep Dive

10. IPv4 Header

The critical fields for operations:

Field Size Purpose
TTL 8 bits Decremented by each router; prevents loops. Default 64 (Linux), 128 (Windows)
Protocol 8 bits Upper-layer protocol: 6=TCP, 17=UDP, 1=ICMP
Flags + Fragment Offset 16 bits DF (Don't Fragment) flag critical for PMTU discovery
Source/Dest IP 32 bits each The addresses
# See IP TTL in action
traceroute -n 10.0.1.50
# Each hop decrements TTL by 1; when TTL=0, router sends ICMP Time Exceeded

# Check the Don't Fragment flag
tcpdump -v -c 5 'host 10.0.1.50'
# Flags [DF] means the packet has Don't Fragment set

11. IPv6 Header

IPv6 simplifies the header: no fragmentation by routers (only endpoints fragment), no header checksum (upper layers handle it), fixed 40-byte header with extension headers for optional features.

# Check IPv6 addresses
ip -6 addr show

# IPv6 neighbor discovery (replaces ARP)
ip -6 neigh show

# Test IPv6 connectivity
ping6 -c 3 ::1

12. ICMP

ICMP is not just for ping. It carries critical control messages:

Type Code Message Why it matters
0 0 Echo Reply ping response
3 0 Destination Network Unreachable Routing failure
3 1 Destination Host Unreachable ARP failure / host down
3 3 Destination Port Unreachable No service listening (UDP)
3 4 Fragmentation Needed + DF Set Path MTU discovery
8 0 Echo Request ping
11 0 Time Exceeded TTL expired (traceroute)

Gotcha: Blocking ICMP "for security" is one of the most destructive network misconfigurations. ICMP Type 3 Code 4 (Fragmentation Needed) is essential for Path MTU Discovery. Without it, large packets are silently dropped and TCP connections hang after the handshake succeeds (because SYN/ACK packets are small enough, but data packets are not). This is called a PMTU blackhole.

Path MTU Discovery (PMTUD): When a router receives a packet larger than the next-hop MTU and the DF flag is set, it sends ICMP Type 3, Code 4 back to the sender with the next-hop MTU. The sender then reduces its segment size. If ICMP is blocked (by a misconfigured firewall), PMTUD breaks — this is a PMTU blackhole.

# Check path MTU to a host
tracepath 10.0.1.50
#  1:  10.0.0.1         0.312ms pmtu 1500
#  2:  10.0.1.50        0.645ms reached
#      Resume: pmtu 1500

# Check for PMTU blackholes
tcpdump -n 'icmp and icmp[icmptype] == 3 and icmp[icmpcode] == 4'

13. ARP and NDP

ARP (Address Resolution Protocol) maps IPv4 addresses to MAC addresses on the local network. NDP (Neighbor Discovery Protocol) does the same for IPv6.

# View ARP cache
ip neigh show
# 10.0.1.1 dev eth0 lladdr 00:11:22:33:44:55 REACHABLE
# 10.0.1.50 dev eth0 lladdr 00:11:22:33:44:66 STALE

# ARP states: REACHABLE, STALE, DELAY, PROBE, FAILED, INCOMPLETE

# Clear a specific ARP entry
ip neigh del 10.0.1.50 dev eth0

# Flush all ARP entries
ip neigh flush all

# Watch ARP traffic
tcpdump -n -i eth0 arp
# ARP, Request who-has 10.0.1.50 tell 10.0.1.1
# ARP, Reply 10.0.1.50 is-at 00:11:22:33:44:66

14. IP Fragmentation and Reassembly

When a packet exceeds the link MTU and the DF flag is not set, the router fragments it. Fragmentation is expensive: each fragment needs its own IP header, the receiver must reassemble, and if any fragment is lost, all fragments must be retransmitted.

# Check fragmentation stats
nstat -az | grep -i frag
# IpFragOKs      0
# IpFragFails    0
# IpReasmReqds   0

# Force a large packet to see fragmentation
ping -s 4000 -M dont 10.0.1.50
# PING 10.0.1.50: 4000 data bytes (will be fragmented if MTU < 4028)

# With DF set (default), large packets that exceed MTU are dropped
ping -s 4000 -M do 10.0.1.50
# ping: local error: message too long, mtu=1500

15. Multicast, DSCP, and ECN

Multicast: One-to-many delivery using addresses in 224.0.0.0/4. Used by VRRP (224.0.0.18), OSPF (224.0.0.5/6), and application-level multicast (video streaming, market data).

# Check multicast group membership
ip maddr show
netstat -g

DSCP (Differentiated Services Code Point): 6-bit field in the IP header for QoS marking. Routers use DSCP to prioritize traffic (voice over bulk data, for example).

ECN (Explicit Congestion Notification): IP + TCP feature that lets routers signal congestion without dropping packets. The router marks the packet; the receiver echoes the mark to the sender; the sender reduces its rate. This avoids the latency penalty of loss-based congestion detection.

sysctl net.ipv4.tcp_ecn
# 2 (default: enable ECN for incoming, request for outgoing if server supports)

Socket Programming Concepts

16. The Socket API

Understanding the socket system calls helps you reason about what the kernel is doing when you see connection states.

Server                              Client
  |                                    |
  socket()  → create endpoint          |
  bind()    → assign address:port      |
  listen()  → mark as passive, create  |
             accept queue (backlog)    socket() → create endpoint
  |                                    |
  |                                    connect() → send SYN
  |  SYN_RECEIVED                      |
  accept()  → dequeue connection       |  ESTABLISHED
  |  ESTABLISHED                       |
  |                                    |
  read()/write()  ←→  read()/write()   |
  |                                    |
  close()  → send FIN                  close() → send FIN

The listen() backlog parameter sets the maximum length of the accept queue. This is where net.core.somaxconn acts as a ceiling. If the accept queue is full, the kernel drops new SYN packets (or sends RST, depending on tcp_abort_on_overflow).

# See the accept queue for listening sockets
ss -ltn
# State    Recv-Q   Send-Q   Local Address:Port
# LISTEN   0        4096     0.0.0.0:80
#          ^        ^
#          current  max (min of backlog and somaxconn)

When Recv-Q approaches Send-Q on a LISTEN socket, the accept queue is almost full and connections will start being dropped.


Wiki Navigation

Prerequisites