MTU - Street-Level Ops¶
Real-world MTU diagnosis and resolution workflows for production environments.
Remember: MTU = Maximum Transmission Unit = the largest packet a link can carry. Ethernet default is 1500 bytes. Every tunnel, overlay, or VPN adds header bytes, reducing the effective MTU. The formula:
effective MTU = physical MTU - encapsulation overhead.
Task: Diagnose "Small Packets Work, Large Transfers Hang"¶
# SSH works but SCP stalls. Classic MTU blackhole.
# Test with DF bit set — find where packets get too big
$ ping -M do -s 1472 10.0.0.20
PING 10.0.0.20 (10.0.0.20) 1472(1500) bytes of data.
--- 10.0.0.20 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss
# 1500 fails. Binary search for the real path MTU:
$ ping -M do -s 1400 10.0.0.20 # works
$ ping -M do -s 1450 10.0.0.20 # works
$ ping -M do -s 1460 10.0.0.20 # fails
# Path MTU is between 1450 and 1460. Likely a tunnel overhead.
# Set interface MTU to match:
$ ip link set dev eth0 mtu 1450
Under the hood: The
-M doflag sets the "Don't Fragment" (DF) bit on the IP header. When a router along the path cannot forward the packet without fragmenting, it should send back an ICMP "Fragmentation Needed" message. If a firewall blocks ICMP (common), you never get that message — packets just vanish. This is the "MTU blackhole" problem.Gotcha: The
-svalue inping -M do -sis the ICMP payload size, not the total packet size. Add 28 bytes for IP + ICMP headers:-s 1472sends a 1500-byte packet (1472 + 20 IP + 8 ICMP). This off-by-28 mistake leads people to set MTU 28 bytes too high.
Task: Discover Path MTU Automatically¶
# tracepath discovers MTU at each hop
$ tracepath 10.0.0.20
1?: [LOCALHOST] pmtu 1500
1: gateway 0.5ms
2: 10.0.1.1 1.2ms pmtu 1450
2: 10.0.0.20 2.1ms reached
Resume: pmtu 1450
# Path MTU is 1450 — a tunnel between hop 1 and hop 2 is the bottleneck
Task: Set Up Jumbo Frames on a Storage Network¶
# Verify switch supports jumbo frames (must be end-to-end)
# Set MTU on the storage interface
$ ip link set dev eth1 mtu 9000
# Verify
$ ip link show eth1 | grep mtu
mtu 9000
# Test end-to-end with DF bit
$ ping -M do -s 8972 10.100.0.20 # 8972 + 28 = 9000
PING 10.100.0.20 (10.100.0.20) 8972(9000) bytes of data.
64 bytes from 10.100.0.20: icmp_seq=1 ttl=64 time=0.3 ms
# Make persistent (netplan)
$ cat >> /etc/netplan/01-storage.yaml <<'EOF'
network:
ethernets:
eth1:
mtu: 9000
addresses: [10.100.0.5/24]
EOF
$ netplan apply
Scale note: Jumbo frames (MTU 9000) improve throughput for storage traffic by reducing per-packet overhead. A 1MB transfer uses ~700 packets at MTU 1500 but only ~120 at MTU 9000. The improvement is most visible on 10G+ links with large sequential I/O (NFS, iSCSI, database replication).
Task: Fix MTU for VXLAN Overlay Network¶
# Kubernetes pods timing out on large HTTP responses
# Underlay MTU is 1500, VXLAN overhead is 50 bytes
# Check current pod MTU
$ kubectl exec -it debug-pod -- ip link show eth0
mtu 1500
# Pod MTU should be 1450 (1500 - 50 VXLAN overhead)
# Fix depends on CNI — for Calico:
$ kubectl -n kube-system get configmap calico-config -o yaml | grep -i mtu
veth_mtu: "1500" # Wrong!
# Update to 1450, then restart calico-node pods
# For Flannel, set in kube-flannel ConfigMap: "Backend": {"MTU": 1450}
Gotcha: After changing the CNI MTU config, you must restart all pods (not just calico-node). Existing pods keep their old MTU until they are recreated. Use
kubectl rollout restart deployment -n <ns>for each namespace, or do a rolling node drain.Under the hood: RFC 4821 (Packetization Layer Path MTU Discovery) describes a method that does not depend on ICMP at all — it probes with progressively larger TCP segments and uses ACKs to determine the path MTU. Linux supports this via
net.ipv4.tcp_mtu_probing=1(enabled on blackhole detection) or=2(always enabled). This is the robust alternative when ICMP is blocked.
Task: Clamp TCP MSS When You Cannot Change MTU¶
# VPN tunnel with 1400 MTU, but you cannot change endpoint MTU settings
# Clamp MSS so TCP segments fit within the tunnel MTU
$ iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
-j TCPMSS --set-mss 1360
# Or auto-clamp to path MTU
$ iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
-j TCPMSS --clamp-mss-to-pmtu
# Verify with tcpdump — check MSS in SYN packets
$ tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0' -c 5
10.0.0.5.43210 > 10.0.0.20.443: Flags [S], mss 1360
Task: Check for Fragmentation in Production¶
# Look for fragmented packets on the wire
$ tcpdump -i eth0 'ip[6:2] & 0x3fff != 0' -c 10
# Any output means fragmentation is happening
# Check kernel fragmentation statistics
$ cat /proc/net/snmp | grep -i ip | head -3
Ip: Forwarding DefaultTTL ...
Ip: 2 64 ... FragCreates 8472 FragOKs 4236 FragFails 12
# FragFails > 0 means DF bit was set and packets could not be sent
# FragCreates > 0 means the kernel is fragmenting — investigate why
# Check ICMP "need to frag" messages
$ tcpdump -i eth0 'icmp[0] == 3 and icmp[1] == 4' -c 5
Remember: Common MTU overhead values to memorize: Ethernet = 1500, VXLAN = -50 (1450), GRE = -24 (1476), IPsec (tunnel+ESP) = -58 to -73 (depends on cipher), WireGuard = -60 (1440), PPPoE = -8 (1492). When troubleshooting, subtract the encapsulation overhead from the underlay MTU.
Task: Debug MTU Mismatch Across a VPN¶
# IPsec tunnel — users report intermittent stalls
# Check the tunnel interface MTU
$ ip link show ipsec0
mtu 1438
# Check what the remote side sends
$ tcpdump -i eth0 -nn host 203.0.113.5 | grep "frag"
# Look for "need to frag" or fragmented packets
# If PMTUD is blocked (ICMP filtered), force a lower MTU
$ ip link set dev ipsec0 mtu 1400
# Test
$ ping -M do -s 1372 remote-host # 1372 + 28 = 1400
3 packets transmitted, 3 received, 0% packet loss
Task: Verify MTU Consistency Across a Fleet¶
# Quick check across all nodes
$ for host in node{01..10}; do
mtu=$(ssh "$host" cat /sys/class/net/eth0/mtu 2>/dev/null)
echo "$host: MTU=$mtu"
done
node01: MTU=1500
node02: MTU=1500
node03: MTU=9000 # <-- mismatch!
node04: MTU=1500
...
# node03 has jumbo frames enabled while others do not
# Large packets from node03 to other nodes will be dropped
Emergency: PMTUD Blackhole — Fix Without Reboot¶
# Application connections establish but data stalls
# ICMP "need to frag" is being blocked by an intermediate firewall
# Temporary fix: enable TCP MSS clamping on the affected path
$ iptables -t mangle -A FORWARD -o eth0 -p tcp --tcp-flags SYN,RST SYN \
-j TCPMSS --clamp-mss-to-pmtu
# Long-term: fix the firewall to allow ICMP type 3 code 4
# This ICMP message is essential for PMTUD and must never be blocked
War story: A "security hardened" firewall rule blocking all ICMP caused intermittent failures across an entire data center. Small API calls worked, large file uploads failed randomly. The fix was a one-line firewall rule to allow ICMP type 3 code 4 (Destination Unreachable / Fragmentation Needed). Never blanket-block ICMP.