Portal | Level: L2: Operations | Topics: TCP/IP Deep Dive, TCP/IP | Domain: Networking

TCP/IP Deep Dive - Primer¶

Why This Matters¶

TCP/IP is the protocol stack that everything runs on. Every HTTP request, database query, API call, and file transfer rides on top of TCP. When you understand TCP deeply — not just "it is reliable" but how it achieves reliability, what the costs are, and where it breaks down — you can diagnose problems that look like application bugs but are actually transport-layer issues.

I have spent hours debugging "slow API responses" that turned out to be TCP window scaling misconfiguration. I have watched Nagle's algorithm add 200ms to every message in a chat application. I have seen production load balancers drop connections because the conntrack table was full and nobody knew to check. These are not obscure edge cases. They are the normal failure modes of systems at scale.

Understanding the IP layer matters too. When path MTU discovery breaks because someone blocked ICMP, you get mysterious failures for large packets. When ARP goes wrong, machines that are physically connected cannot talk to each other. These are layer 3 and layer 2 problems that masquerade as application-level failures.

TCP Deep Dive¶

1. The Three-Way Handshake¶

Every TCP connection begins with a three-way handshake. This is how two hosts agree on initial sequence numbers and establish state.

Client                          Server
  |                                |
  |  SYN (seq=x)                   |
  |  ----------------------------→ |  Server allocates state (SYN_RECEIVED)
  |                                |
  |  SYN-ACK (seq=y, ack=x+1)     |
  |  ←---------------------------- |
  |                                |
  |  ACK (ack=y+1)                 |
  |  ----------------------------→ |  Connection ESTABLISHED
  |                                |

Under the hood: During the handshake, each side picks a random Initial Sequence Number (ISN). Modern kernels use a combination of a timer, source/destination IP+port, and a secret key to generate ISNs. This randomness prevents TCP sequence prediction attacks, which were a serious vulnerability in early TCP implementations (the Morris Worm of 1988 exploited predictable ISNs).

The handshake takes one round-trip time (RTT). On a cross-region connection with 80ms RTT, the handshake alone adds 80ms before any data flows. This is why connection reuse (HTTP keep-alive, connection pooling) matters so much for performance.

2. TCP Connection States¶

Every TCP connection passes through a state machine. Understanding these states is essential for diagnosing production issues.

                              CLOSED
                                |
                    (passive open: listen)
                                |
                              LISTEN
                                |
                    (receive SYN, send SYN+ACK)
                                |
                           SYN_RECEIVED
                                |
                    (receive ACK)
                                |
                           ESTABLISHED
                                |
              (active close: send FIN)
                                |
                           FIN_WAIT_1
                                |
                    (receive ACK)        (receive FIN+ACK)
                       |                        |
                  FIN_WAIT_2               CLOSING
                       |                        |
                  (receive FIN)           (receive ACK)
                       |                        |
                   TIME_WAIT <------------------+
                       |
                  (2*MSL timeout: 60s default)
                       |
                     CLOSED

The passive close (server side when client initiates close):

ESTABLISHED → (receive FIN) → CLOSE_WAIT → (send FIN) → LAST_ACK → (receive ACK) → CLOSED

Key states to know:

State	Meaning	Operational significance
LISTEN	Waiting for connections	Normal for servers; many LISTEN sockets is expected
ESTABLISHED	Active connection	Normal; high count is fine if expected
SYN_RECEIVED	Handshake in progress	High count may indicate SYN flood
CLOSE_WAIT	Remote closed, local has not	Application bug (not closing sockets); this state accumulates

Debug clue: A growing count of CLOSE_WAIT sockets is almost always an application bug — the remote side closed the connection but your application never called close() on its end. Check ss -tnp state close-wait to see which process is leaking sockets. Common culprits: HTTP clients not reading response bodies, database connections not being returned to the pool. | TIME_WAIT | Connection closed, waiting 2*MSL | Normal; high count is expected on busy servers | | FIN_WAIT_2 | Sent FIN, waiting for remote FIN | Remote side not closing; may indicate hung remote process |

# View connection states
ss -tan state established | wc -l
ss -tan state time-wait | wc -l
ss -tan state close-wait | wc -l

# Summary of all states
ss -s
# TCP:   45892 (estab 32100, closed 5200, orphaned 120, timewait 8400)

# Per-state breakdown
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

3. Window Size and Flow Control¶

TCP uses a sliding window for flow control. The receiver advertises how much data it is willing to accept (receive window). The sender must not send more than the receiver's window allows.

Sender                              Receiver
  |                                    |
  |  Data (seq 1-1460) →              |  Window = 65535
  |  Data (seq 1461-2920) →           |  Window = 64075 (advertised in ACK)
  |  Data (seq 2921-4380) →           |
  |                                    |
  |  ← ACK (ack=4381, win=65535)      |  Application read the data, window reopened
  |                                    |

Window scaling (RFC 1323): The window size field in the TCP header is 16 bits, limiting it to 65535 bytes. Window scaling uses a shift count negotiated during the handshake to multiply the window by up to 2^14, allowing windows up to ~1GB.

# Check if window scaling is enabled (it should be)
sysctl net.ipv4.tcp_window_scaling
# 1

# See actual window sizes for a connection
ss -ti dst 10.0.1.50
# cubic wscale:7,7 rto:204 rtt:1.2/0.5 ... rcv_space:29200
# wscale:7,7 means both sides use scale factor 7 (window * 128)

Without window scaling, the maximum in-flight data is 65535 bytes. On a 100ms RTT link, this limits throughput to 65535 / 0.1 = ~655 KB/s regardless of bandwidth.

4. Congestion Control¶

Flow control prevents the sender from overwhelming the receiver. Congestion control prevents the sender from overwhelming the network. They are independent mechanisms.

Slow Start¶

A new connection starts with a small congestion window (cwnd, typically 10 segments = ~14KB). For each ACK received, cwnd increases by one segment. This is exponential growth — cwnd doubles every RTT until either a loss occurs or the slow start threshold (ssthresh) is reached.

RTT 1: cwnd = 10 segments → send 10 segments
RTT 2: cwnd = 20 segments → send 20 segments
RTT 3: cwnd = 40 segments → send 40 segments
RTT 4: cwnd = 80 segments → send 80 segments

On a 1 Gbps link with 50ms RTT, it takes approximately 8 RTTs (400ms) to ramp up to full link utilization. This is why short-lived HTTP connections are inefficient — they spend most of their life in slow start.

Congestion Avoidance¶

Once cwnd reaches ssthresh, growth switches to linear: cwnd increases by ~1 segment per RTT. This is additive increase.

Fast Retransmit and Fast Recovery¶

When the receiver detects a gap (an out-of-order segment), it sends a duplicate ACK. After 3 duplicate ACKs, the sender retransmits the missing segment immediately (fast retransmit) without waiting for the retransmission timeout. It then halves cwnd and enters fast recovery, continuing to send data rather than dropping back to slow start.

Modern Algorithms: CUBIC and BBR¶

CUBIC (default on Linux since 2.6.19) uses a cubic function for window growth. After a loss, it rapidly recovers to the pre-loss window size, then slows growth near that point. CUBIC is loss-based — it treats packet loss as a signal of congestion.

Remember: Congestion control algorithm comparison: CUBIC reacts to loss (packet drop = slow down). BBR measures bandwidth and RTT (probes the path). On clean links, both perform similarly. On lossy links (Wi-Fi, satellite), BBR wins because random packet loss is not congestion.

BBR (Bottleneck Bandwidth and RTT, Google, 2016) takes a fundamentally different approach. Instead of reacting to loss, it actively probes for bandwidth and RTT to build a model of the path. BBR can significantly improve throughput on lossy links (Wi-Fi, transcontinental, satellite) where CUBIC interprets random loss as congestion.

# Check current algorithm
sysctl net.ipv4.tcp_congestion_control
# cubic

# Available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# reno cubic

# Load and use BBR (requires kernel 4.9+)
modprobe tcp_bbr
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Verify
sysctl net.ipv4.tcp_congestion_control
# bbr

5. Nagle's Algorithm and TCP_NODELAY¶

Nagle's algorithm batches small writes into larger segments. If there is unacknowledged data in flight, the sender buffers new small writes until either an ACK arrives or the buffer fills an MSS-sized segment.

This is efficient for bulk transfers but devastating for interactive protocols. A typical pathology: an application sends a 50-byte message, Nagle holds it because there is data in flight, the receiver has delayed ACK enabled and waits 40-200ms before acknowledging. The 50-byte message sits in the sender's buffer for up to 200ms.

# Disable Nagle (send immediately regardless of outstanding ACKs)
# In application code:
# setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one));

# Verify on a live connection
ss -ti dst 10.0.1.50 | grep nodelay

Rule of thumb: Set TCP_NODELAY for any request-response protocol (HTTP, RPC, database queries, game packets). Leave Nagle enabled for bulk transfers where latency does not matter.

6. TCP Keepalive¶

TCP keepalive probes detect dead connections where the remote host has crashed, been disconnected, or is behind a stateful firewall that silently drops idle connections.

# Default keepalive settings
sysctl net.ipv4.tcp_keepalive_time     # 7200 (2 hours before first probe)
sysctl net.ipv4.tcp_keepalive_intvl    # 75 (seconds between probes)
sysctl net.ipv4.tcp_keepalive_probes   # 9 (probes before declaring dead)

# With defaults: 2h + (75s * 9) = ~2h 11min to detect a dead connection
# Most applications cannot tolerate this

# Reasonable values for production
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5
# Now: 5min + (30s * 5) = 7.5min to detect dead connection

Applications can override these on a per-socket basis with setsockopt. Many applications (PostgreSQL, Redis, gRPC) have their own keepalive configuration that overrides the system defaults.

7. MSS vs MTU¶

MTU (Maximum Transmission Unit) is the largest frame size the link layer will carry. Ethernet default is 1500 bytes. MSS (Maximum Segment Size) is the largest TCP payload, which is MTU minus IP and TCP headers: 1500 - 20 (IP) - 20 (TCP) = 1460 bytes. With TCP options (timestamps, SACK), effective MSS is typically 1448 bytes.

# Check interface MTU
ip link show eth0 | grep mtu
# mtu 1500

# Check TCP MSS for a connection
ss -ti dst 10.0.1.50 | grep mss
# mss:1448

# Jumbo frames (9000 byte MTU) for datacenter traffic
ip link set eth0 mtu 9000
# MSS becomes: 9000 - 40 = 8960 bytes

8. TCP Timestamps and SACK¶

TCP timestamps (RFC 1323) serve two purposes: precise RTT measurement and protection against wrapped sequence numbers (PAWS) on high-bandwidth links.

# Check if timestamps are enabled
sysctl net.ipv4.tcp_timestamps
# 1 (enabled by default — leave it on)

SACK (Selective Acknowledgment, RFC 2018) allows the receiver to tell the sender exactly which segments arrived, rather than just acknowledging up to the last contiguous segment. Without SACK, a single lost segment causes retransmission of everything after it.

sysctl net.ipv4.tcp_sack
# 1 (enabled by default — leave it on)

9. TCP Fast Open¶

TCP Fast Open (TFO) allows data to be sent in the SYN packet, eliminating one RTT for repeat connections. The server gives the client a cookie on the first connection. On subsequent connections, the client includes the cookie and data in the SYN.

# Enable TFO
# Bit 1 = client, Bit 2 = server, 3 = both
sysctl -w net.ipv4.tcp_fastopen=3

# Application must also enable TFO:
# Server: setsockopt(sock, IPPROTO_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen))
# Client: sendto() with MSG_FASTOPEN flag

IP Deep Dive¶

10. IPv4 Header¶

The critical fields for operations:

Field	Size	Purpose
TTL	8 bits	Decremented by each router; prevents loops. Default 64 (Linux), 128 (Windows)
Protocol	8 bits	Upper-layer protocol: 6=TCP, 17=UDP, 1=ICMP
Flags + Fragment Offset	16 bits	DF (Don't Fragment) flag critical for PMTU discovery
Source/Dest IP	32 bits each	The addresses

# See IP TTL in action
traceroute -n 10.0.1.50
# Each hop decrements TTL by 1; when TTL=0, router sends ICMP Time Exceeded

# Check the Don't Fragment flag
tcpdump -v -c 5 'host 10.0.1.50'
# Flags [DF] means the packet has Don't Fragment set

11. IPv6 Header¶

IPv6 simplifies the header: no fragmentation by routers (only endpoints fragment), no header checksum (upper layers handle it), fixed 40-byte header with extension headers for optional features.

# Check IPv6 addresses
ip -6 addr show

# IPv6 neighbor discovery (replaces ARP)
ip -6 neigh show

# Test IPv6 connectivity
ping6 -c 3 ::1

12. ICMP¶

ICMP is not just for ping. It carries critical control messages:

Type	Code	Message	Why it matters
0	0	Echo Reply	ping response
3	0	Destination Network Unreachable	Routing failure
3	1	Destination Host Unreachable	ARP failure / host down
3	3	Destination Port Unreachable	No service listening (UDP)
3	4	Fragmentation Needed + DF Set	Path MTU discovery
8	0	Echo Request	ping
11	0	Time Exceeded	TTL expired (traceroute)

Gotcha: Blocking ICMP "for security" is one of the most destructive network misconfigurations. ICMP Type 3 Code 4 (Fragmentation Needed) is essential for Path MTU Discovery. Without it, large packets are silently dropped and TCP connections hang after the handshake succeeds (because SYN/ACK packets are small enough, but data packets are not). This is called a PMTU blackhole.

Path MTU Discovery (PMTUD): When a router receives a packet larger than the next-hop MTU and the DF flag is set, it sends ICMP Type 3, Code 4 back to the sender with the next-hop MTU. The sender then reduces its segment size. If ICMP is blocked (by a misconfigured firewall), PMTUD breaks — this is a PMTU blackhole.

# Check path MTU to a host
tracepath 10.0.1.50
#  1:  10.0.0.1         0.312ms pmtu 1500
#  2:  10.0.1.50        0.645ms reached
#      Resume: pmtu 1500

# Check for PMTU blackholes
tcpdump -n 'icmp and icmp[icmptype] == 3 and icmp[icmpcode] == 4'

13. ARP and NDP¶

ARP (Address Resolution Protocol) maps IPv4 addresses to MAC addresses on the local network. NDP (Neighbor Discovery Protocol) does the same for IPv6.

# View ARP cache
ip neigh show
# 10.0.1.1 dev eth0 lladdr 00:11:22:33:44:55 REACHABLE
# 10.0.1.50 dev eth0 lladdr 00:11:22:33:44:66 STALE

# ARP states: REACHABLE, STALE, DELAY, PROBE, FAILED, INCOMPLETE

# Clear a specific ARP entry
ip neigh del 10.0.1.50 dev eth0

# Flush all ARP entries
ip neigh flush all

# Watch ARP traffic
tcpdump -n -i eth0 arp
# ARP, Request who-has 10.0.1.50 tell 10.0.1.1
# ARP, Reply 10.0.1.50 is-at 00:11:22:33:44:66

14. IP Fragmentation and Reassembly¶

When a packet exceeds the link MTU and the DF flag is not set, the router fragments it. Fragmentation is expensive: each fragment needs its own IP header, the receiver must reassemble, and if any fragment is lost, all fragments must be retransmitted.

# Check fragmentation stats
nstat -az | grep -i frag
# IpFragOKs      0
# IpFragFails    0
# IpReasmReqds   0

# Force a large packet to see fragmentation
ping -s 4000 -M dont 10.0.1.50
# PING 10.0.1.50: 4000 data bytes (will be fragmented if MTU < 4028)

# With DF set (default), large packets that exceed MTU are dropped
ping -s 4000 -M do 10.0.1.50
# ping: local error: message too long, mtu=1500

15. Multicast, DSCP, and ECN¶

Multicast: One-to-many delivery using addresses in 224.0.0.0/4. Used by VRRP (224.0.0.18), OSPF (224.0.0.5/6), and application-level multicast (video streaming, market data).

# Check multicast group membership
ip maddr show
netstat -g

DSCP (Differentiated Services Code Point): 6-bit field in the IP header for QoS marking. Routers use DSCP to prioritize traffic (voice over bulk data, for example).

ECN (Explicit Congestion Notification): IP + TCP feature that lets routers signal congestion without dropping packets. The router marks the packet; the receiver echoes the mark to the sender; the sender reduces its rate. This avoids the latency penalty of loss-based congestion detection.

sysctl net.ipv4.tcp_ecn
# 2 (default: enable ECN for incoming, request for outgoing if server supports)

Socket Programming Concepts¶

16. The Socket API¶

Understanding the socket system calls helps you reason about what the kernel is doing when you see connection states.

Server                              Client
  |                                    |
  socket()  → create endpoint          |
  bind()    → assign address:port      |
  listen()  → mark as passive, create  |
             accept queue (backlog)    socket() → create endpoint
  |                                    |
  |                                    connect() → send SYN
  |  SYN_RECEIVED                      |
  accept()  → dequeue connection       |  ESTABLISHED
  |  ESTABLISHED                       |
  |                                    |
  read()/write()  ←→  read()/write()   |
  |                                    |
  close()  → send FIN                  close() → send FIN

The listen() backlog parameter sets the maximum length of the accept queue. This is where net.core.somaxconn acts as a ceiling. If the accept queue is full, the kernel drops new SYN packets (or sends RST, depending on tcp_abort_on_overflow).

# See the accept queue for listening sockets
ss -ltn
# State    Recv-Q   Send-Q   Local Address:Port
# LISTEN   0        4096     0.0.0.0:80
#          ^        ^
#          current  max (min of backlog and somaxconn)

When Recv-Q approaches Send-Q on a LISTEN socket, the accept queue is almost full and connections will start being dropped.

Prerequisites¶

Networking Deep Dive (Topic Pack, L1)

AWS Networking (Topic Pack, L1) — TCP/IP
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — TCP/IP
Case Study: Duplex Mismatch Symptoms (Case Study, L1) — TCP/IP
Case Study: NAT Exhaustion Intermittent (Case Study, L2) — TCP/IP
Case Study: TCP RST After Idle (Case Study, L2) — TCP/IP
DHCP & IP Address Management (Topic Pack, L1) — TCP/IP
Deep Dive: TCP/IP Deep Dive (deep_dive, L2) — TCP/IP
Deep Dive: TLS Handshake (deep_dive, L2) — TCP/IP
Networking Deep Dive (Topic Pack, L1) — TCP/IP
Networking Drills (Drill, L1) — TCP/IP

TCP/IP Deep Dive - Primer¶

Why This Matters¶

TCP Deep Dive¶

1. The Three-Way Handshake¶

2. TCP Connection States¶

3. Window Size and Flow Control¶

4. Congestion Control¶

Slow Start¶

Congestion Avoidance¶

Fast Retransmit and Fast Recovery¶

Modern Algorithms: CUBIC and BBR¶

5. Nagle's Algorithm and TCP_NODELAY¶

6. TCP Keepalive¶

7. MSS vs MTU¶

8. TCP Timestamps and SACK¶

9. TCP Fast Open¶

IP Deep Dive¶

10. IPv4 Header¶

11. IPv6 Header¶

12. ICMP¶

13. ARP and NDP¶

14. IP Fragmentation and Reassembly¶

15. Multicast, DSCP, and ECN¶

Socket Programming Concepts¶

16. The Socket API¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

TCP/IP Deep Dive - Primer¶

Why This Matters¶

TCP Deep Dive¶

1. The Three-Way Handshake¶

2. TCP Connection States¶

3. Window Size and Flow Control¶

4. Congestion Control¶

Slow Start¶

Congestion Avoidance¶

Fast Retransmit and Fast Recovery¶

Modern Algorithms: CUBIC and BBR¶

5. Nagle's Algorithm and TCP_NODELAY¶

6. TCP Keepalive¶

7. MSS vs MTU¶

8. TCP Timestamps and SACK¶

9. TCP Fast Open¶

IP Deep Dive¶

10. IPv4 Header¶

11. IPv6 Header¶

12. ICMP¶

13. ARP and NDP¶

14. IP Fragmentation and Reassembly¶

15. Multicast, DSCP, and ECN¶

Socket Programming Concepts¶

16. The Socket API¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶