Skip to content

Runbook: MTU Mismatch

Field Value
Domain Networking
Alert Large packet drops, TCP connections working but large file transfers failing, ICMP unreachable fragmentation-needed messages in logs
Severity P2
Est. Resolution Time 30-60 minutes
Escalation Timeout 45 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites SSH access to cluster nodes, ability to run ping/tracepath on nodes, kubectl access

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
ip link show && ping -M do -s 1400 <TARGET_IP>
If output shows: ping -M do -s 1400 succeeds but ping -M do -s 1450 fails → MTU is between 1428 and 1472 bytes somewhere in the path, continue from Step 2 If output shows: All pings fail regardless of size → This is a routing issue, not MTU — see Network Partition

Step 1: Identify the Symptom Pattern

Why: MTU mismatches produce a very specific pattern — small requests succeed, large transfers silently stall or fail. Confirming this pattern before chasing MTU saves time.

# Test that small requests work
curl -v --max-time 10 http://<SERVICE_IP>:<PORT>/

# Test that a large download stalls or fails
curl -v --max-time 30 http://<SERVICE_IP>:<PORT>/large-file -o /dev/null

# In Kubernetes: test from a pod hitting a service that does large responses
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  curl -v --max-time 30 http://<SERVICE_NAME>.<NAMESPACE>.svc.cluster.local/large-endpoint
Expected output (confirming MTU issue):
# Small request: completes in < 1s
# Large request: starts, transfers some bytes, then hangs
If this fails: If both small and large fail completely, it is not an MTU issue. Rule out firewall, DNS, or routing problems first.

Step 2: Find the MTU Break Point with Ping

Why: PMTUD (Path MTU Discovery) relies on ICMP "fragmentation needed" messages. Testing with different packet sizes identifies the exact MTU ceiling across the path.

# Run from a node (not a pod) to find the physical MTU limit
# Start high and work down until ping succeeds
# -M do = don't fragment, -s = payload size (actual IP packet size = -s + 28)
ping -M do -s 1472 <REMOTE_NODE_IP>   # Tests 1500 byte IP packet (standard Ethernet)
ping -M do -s 1422 <REMOTE_NODE_IP>   # Tests 1450 byte IP packet
ping -M do -s 1372 <REMOTE_NODE_IP>   # Tests 1400 byte IP packet

# Run from inside a pod to find the overlay (CNI tunnel) MTU limit
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  ping -M do -s 1372 <REMOTE_POD_IP>
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  ping -M do -s 1422 <REMOTE_POD_IP>
Expected output:
# The largest size that succeeds tells you the effective MTU
PING 10.0.1.5: 1400 data bytes
1428 bytes from 10.0.1.5: icmp_seq=1 ttl=64 time=0.5 ms   # Success
PING 10.0.1.5: 1450 data bytes
--- 10.0.1.5 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss         # Fail
If this fails: If all sizes fail with ICMP Message too long: mtu=<N>, that N is your effective MTU ceiling. Use it in Step 4.

Step 3: Check Interface MTU on Nodes

Why: The node's physical NIC MTU sets the ceiling for everything above it. The CNI overlay must be lower, not equal to or higher.

# SSH to affected nodes and check all interface MTUs
ssh <NODE_USERNAME>@<NODE_IP>

# Show all interfaces with MTU
ip link show

# Or just the primary and tunnel interfaces
ip link show eth0
ip link show flannel.1   # Flannel VXLAN
ip link show cilium_vxlan  # Cilium VXLAN
ip link show tunl0       # Calico IPIP
ip link show vxlan.calico  # Calico VXLAN
Expected output:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP
   ...
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue
If this fails: If the tunnel interface MTU equals the physical MTU (e.g., both at 1500), the encapsulation overhead is not being accounted for — this is the root cause. Proceed to Step 4.

Step 4: Check CNI Overlay MTU (Must Be Lower than Node MTU)

Why: VXLAN adds 50 bytes of overhead; Geneve adds 60 bytes; IPIP adds 20 bytes. The CNI overlay MTU must be node_MTU - overhead or packets will be silently dropped.

# Check Flannel CNI config
cat /run/flannel/subnet.env

# Check Calico VXLAN MTU config
kubectl get configmap calico-config -n kube-system -o yaml | grep -i mtu

# Check Cilium MTU
kubectl exec -n kube-system <CILIUM_POD> -- cilium config | grep mtu

# Check the actual MTU in the CNI config file
cat /etc/cni/net.d/<CNI_CONFIG_FILE>
Expected output:
# For a node with 1500 MTU using VXLAN (50 byte overhead):
# CNI MTU should be 1450 or lower
mtu: 1450
If this fails: If CNI MTU matches node MTU, encapsulation is causing packet drops. Calculate correct value:
VXLAN/Geneve:  node_MTU - 50 = pod_MTU
IPIP:          node_MTU - 20 = pod_MTU
WireGuard:     node_MTU - 60 = pod_MTU

Step 5: Update CNI Configuration MTU

Why: The fix is to set the correct MTU in the CNI configuration so that pods never generate packets too large for the overlay.

# For Flannel — edit the net-conf.json ConfigMap
kubectl edit configmap kube-flannel-cfg -n kube-flannel
# Find the "Backend" section and set "MTU": <CORRECT_MTU>

# For Calico — patch the calico-config ConfigMap
kubectl patch configmap calico-config -n kube-system \
  --type merge \
  -p '{"data":{"veth_mtu":"<CORRECT_MTU>"}}'

# For Cilium — update via cilium-config or Helm values
kubectl patch configmap cilium-config -n kube-system \
  --type merge \
  -p '{"data":{"tunnel-mtu":"<CORRECT_MTU>"}}'
Expected output:
configmap/kube-flannel-cfg edited
If this fails: If the CNI uses a Helm release for configuration, update the values file and re-run helm upgrade. Do not edit the ConfigMap directly on a Helm-managed release or it will be overwritten on the next upgrade.

Step 6: Restart CNI Pods to Apply New MTU

Why: CNI pods read MTU configuration at startup. A rolling restart is needed to apply the new value; it also re-creates tunnel interfaces with the correct MTU.

# Restart CNI DaemonSet pods one at a time to avoid network downtime
# Flannel
kubectl rollout restart daemonset kube-flannel-ds -n kube-flannel

# Calico
kubectl rollout restart daemonset calico-node -n calico-system

# Cilium
kubectl rollout restart daemonset cilium -n kube-system

# Watch rollout progress
kubectl rollout status daemonset <CNI_DAEMONSET_NAME> -n <CNI_NAMESPACE>

# Verify MTU on tunnel interface after restart
ssh <NODE_USERNAME>@<NODE_IP> ip link show <TUNNEL_INTERFACE>
Expected output:
daemonset.apps/calico-node successfully rolled out
If this fails: If rollout gets stuck (pods won't start), the new MTU value may be invalid. Check CNI logs: kubectl logs -n <CNI_NAMESPACE> <CNI_POD> --previous.

Verification

# Confirm the issue is resolved — test large packets between pods
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  ping -M do -s 1400 <REMOTE_POD_IP> -c 5

# Confirm large downloads work
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- \
  curl -o /dev/null --max-time 30 http://<SERVICE_NAME>/large-endpoint
Success looks like: Large ping succeeds with 0% packet loss. Large download completes without hanging. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 45 min Platform/Network on-call "MTU mismatch confirmed in cluster , large transfers failing cluster-wide, CNI MTU fix not resolving"
Data loss suspected SRE lead "MTU mismatch may have caused silent data truncation for services receiving large payloads"
Scope expanding to cloud network Infrastructure team "MTU issue may be at cloud provider level (jumbo frames not enabled), requires VPC/NIC configuration change"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Not accounting for encapsulation overhead: The single most common mistake is setting CNI MTU equal to node MTU. VXLAN adds 50 bytes; IPIP adds 20 bytes. The overlay MTU must always be lower than the physical MTU by at least the encapsulation overhead.
  2. Changing MTU without a rolling restart plan: Updating the ConfigMap has no effect until the CNI pods restart and re-create tunnel interfaces. Plan the rollout to avoid taking down pod networking on multiple nodes simultaneously.

Cross-References


Wiki Navigation