MTU - Street-Level Ops¶

Real-world MTU diagnosis and resolution workflows for production environments.

Remember: MTU = Maximum Transmission Unit = the largest packet a link can carry. Ethernet default is 1500 bytes. Every tunnel, overlay, or VPN adds header bytes, reducing the effective MTU. The formula: effective MTU = physical MTU - encapsulation overhead.

Task: Diagnose "Small Packets Work, Large Transfers Hang"¶

# SSH works but SCP stalls. Classic MTU blackhole.
# Test with DF bit set — find where packets get too big
$ ping -M do -s 1472 10.0.0.20
PING 10.0.0.20 (10.0.0.20) 1472(1500) bytes of data.
--- 10.0.0.20 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss

# 1500 fails. Binary search for the real path MTU:
$ ping -M do -s 1400 10.0.0.20    # works
$ ping -M do -s 1450 10.0.0.20    # works
$ ping -M do -s 1460 10.0.0.20    # fails

# Path MTU is between 1450 and 1460. Likely a tunnel overhead.
# Set interface MTU to match:
$ ip link set dev eth0 mtu 1450

Under the hood: The -M do flag sets the "Don't Fragment" (DF) bit on the IP header. When a router along the path cannot forward the packet without fragmenting, it should send back an ICMP "Fragmentation Needed" message. If a firewall blocks ICMP (common), you never get that message — packets just vanish. This is the "MTU blackhole" problem.

Gotcha: The -s value in ping -M do -s is the ICMP payload size, not the total packet size. Add 28 bytes for IP + ICMP headers: -s 1472 sends a 1500-byte packet (1472 + 20 IP + 8 ICMP). This off-by-28 mistake leads people to set MTU 28 bytes too high.

Task: Discover Path MTU Automatically¶

# tracepath discovers MTU at each hop
$ tracepath 10.0.0.20
 1?: [LOCALHOST]     pmtu 1500
 1:  gateway         0.5ms
 2:  10.0.1.1        1.2ms pmtu 1450
 2:  10.0.0.20       2.1ms reached
     Resume: pmtu 1450

# Path MTU is 1450 — a tunnel between hop 1 and hop 2 is the bottleneck

Task: Set Up Jumbo Frames on a Storage Network¶

# Verify switch supports jumbo frames (must be end-to-end)
# Set MTU on the storage interface
$ ip link set dev eth1 mtu 9000

# Verify
$ ip link show eth1 | grep mtu
    mtu 9000

# Test end-to-end with DF bit
$ ping -M do -s 8972 10.100.0.20    # 8972 + 28 = 9000
PING 10.100.0.20 (10.100.0.20) 8972(9000) bytes of data.
64 bytes from 10.100.0.20: icmp_seq=1 ttl=64 time=0.3 ms

# Make persistent (netplan)
$ cat >> /etc/netplan/01-storage.yaml <<'EOF'
network:
  ethernets:
    eth1:
      mtu: 9000
      addresses: [10.100.0.5/24]
EOF
$ netplan apply

Scale note: Jumbo frames (MTU 9000) improve throughput for storage traffic by reducing per-packet overhead. A 1MB transfer uses ~700 packets at MTU 1500 but only ~120 at MTU 9000. The improvement is most visible on 10G+ links with large sequential I/O (NFS, iSCSI, database replication).

Task: Fix MTU for VXLAN Overlay Network¶

# Kubernetes pods timing out on large HTTP responses
# Underlay MTU is 1500, VXLAN overhead is 50 bytes

# Check current pod MTU
$ kubectl exec -it debug-pod -- ip link show eth0
    mtu 1500

# Pod MTU should be 1450 (1500 - 50 VXLAN overhead)
# Fix depends on CNI — for Calico:
$ kubectl -n kube-system get configmap calico-config -o yaml | grep -i mtu
    veth_mtu: "1500"   # Wrong!

# Update to 1450, then restart calico-node pods
# For Flannel, set in kube-flannel ConfigMap: "Backend": {"MTU": 1450}

Gotcha: After changing the CNI MTU config, you must restart all pods (not just calico-node). Existing pods keep their old MTU until they are recreated. Use kubectl rollout restart deployment -n <ns> for each namespace, or do a rolling node drain.

Under the hood: RFC 4821 (Packetization Layer Path MTU Discovery) describes a method that does not depend on ICMP at all — it probes with progressively larger TCP segments and uses ACKs to determine the path MTU. Linux supports this via net.ipv4.tcp_mtu_probing=1 (enabled on blackhole detection) or =2 (always enabled). This is the robust alternative when ICMP is blocked.

Task: Clamp TCP MSS When You Cannot Change MTU¶

# VPN tunnel with 1400 MTU, but you cannot change endpoint MTU settings
# Clamp MSS so TCP segments fit within the tunnel MTU
$ iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
    -j TCPMSS --set-mss 1360

# Or auto-clamp to path MTU
$ iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
    -j TCPMSS --clamp-mss-to-pmtu

# Verify with tcpdump — check MSS in SYN packets
$ tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0' -c 5
    10.0.0.5.43210 > 10.0.0.20.443: Flags [S], mss 1360

Task: Check for Fragmentation in Production¶

# Look for fragmented packets on the wire
$ tcpdump -i eth0 'ip[6:2] & 0x3fff != 0' -c 10
# Any output means fragmentation is happening

# Check kernel fragmentation statistics
$ cat /proc/net/snmp | grep -i ip | head -3
Ip: Forwarding DefaultTTL ...
Ip: 2 64 ... FragCreates 8472 FragOKs 4236 FragFails 12

# FragFails > 0 means DF bit was set and packets could not be sent
# FragCreates > 0 means the kernel is fragmenting — investigate why

# Check ICMP "need to frag" messages
$ tcpdump -i eth0 'icmp[0] == 3 and icmp[1] == 4' -c 5

Remember: Common MTU overhead values to memorize: Ethernet = 1500, VXLAN = -50 (1450), GRE = -24 (1476), IPsec (tunnel+ESP) = -58 to -73 (depends on cipher), WireGuard = -60 (1440), PPPoE = -8 (1492). When troubleshooting, subtract the encapsulation overhead from the underlay MTU.

Task: Debug MTU Mismatch Across a VPN¶

# IPsec tunnel — users report intermittent stalls
# Check the tunnel interface MTU
$ ip link show ipsec0
    mtu 1438

# Check what the remote side sends
$ tcpdump -i eth0 -nn host 203.0.113.5 | grep "frag"
# Look for "need to frag" or fragmented packets

# If PMTUD is blocked (ICMP filtered), force a lower MTU
$ ip link set dev ipsec0 mtu 1400

# Test
$ ping -M do -s 1372 remote-host    # 1372 + 28 = 1400
3 packets transmitted, 3 received, 0% packet loss

Task: Verify MTU Consistency Across a Fleet¶

# Quick check across all nodes
$ for host in node{01..10}; do
    mtu=$(ssh "$host" cat /sys/class/net/eth0/mtu 2>/dev/null)
    echo "$host: MTU=$mtu"
done
node01: MTU=1500
node02: MTU=1500
node03: MTU=9000   # <-- mismatch!
node04: MTU=1500
...

# node03 has jumbo frames enabled while others do not
# Large packets from node03 to other nodes will be dropped

Emergency: PMTUD Blackhole — Fix Without Reboot¶

# Application connections establish but data stalls
# ICMP "need to frag" is being blocked by an intermediate firewall
# Temporary fix: enable TCP MSS clamping on the affected path

$ iptables -t mangle -A FORWARD -o eth0 -p tcp --tcp-flags SYN,RST SYN \
    -j TCPMSS --clamp-mss-to-pmtu

# Long-term: fix the firewall to allow ICMP type 3 code 4
# This ICMP message is essential for PMTUD and must never be blocked

War story: A "security hardened" firewall rule blocking all ICMP caused intermittent failures across an entire data center. Small API calls worked, large file uploads failed randomly. The fix was a one-line firewall rule to allow ICMP type 3 code 4 (Destination Unreachable / Fragmentation Needed). Never blanket-block ICMP.

MTU - Street-Level Ops¶

Task: Diagnose "Small Packets Work, Large Transfers Hang"¶

Task: Discover Path MTU Automatically¶

Task: Set Up Jumbo Frames on a Storage Network¶

Task: Fix MTU for VXLAN Overlay Network¶

Task: Clamp TCP MSS When You Cannot Change MTU¶

Task: Check for Fragmentation in Production¶

Task: Debug MTU Mismatch Across a VPN¶

Task: Verify MTU Consistency Across a Fleet¶

Emergency: PMTUD Blackhole — Fix Without Reboot¶

Pages that link here¶