Skip to content

Portal | Level: L2: Operations | Topics: MTU, Linux Networking Tools | Domain: Networking

Scenario: MTU Black Hole — Large Packets Silently Dropped

Situation

At 11:40 UTC, users report that the internal wiki (served over HTTPS) loads partially or hangs indefinitely. Small API responses work fine, but any page with substantial content never finishes loading. SSH to the wiki server works normally. The issue began after the infrastructure team migrated the wiki server into a new VLAN that traverses a VPN tunnel to reach the corporate network.

What You Know

  • SSH to the wiki server works (small packets)
  • Small HTTPS responses (health check endpoint returning {"status":"ok"}) work fine
  • Large HTTPS responses (actual wiki pages) hang mid-transfer or never complete
  • The wiki server was recently moved behind a site-to-site VPN tunnel (IPsec or WireGuard)
  • No recent application or OS changes on the wiki server itself
  • ICMP is filtered by an intermediate firewall (corporate security policy)

Investigation Steps

1. Confirm the problem is packet-size dependent

Command(s):

# Send pings of increasing size with Don't Fragment bit set
# -M do = set DF bit (do not fragment)
# -s = payload size (add 28 bytes for IP+ICMP headers)
ping -M do -s 1400 -c 3 wiki.internal.example.com
ping -M do -s 1450 -c 3 wiki.internal.example.com
ping -M do -s 1472 -c 3 wiki.internal.example.com
ping -M do -s 1473 -c 3 wiki.internal.example.com

# If ping is blocked, use a TCP-based MTU probe
tracepath wiki.internal.example.com
What to look for: Pings at or below a certain size succeed. Above that size, with the DF bit set, they silently disappear (100% packet loss, no "need to fragment" ICMP error returned). tracepath will attempt to discover the path MTU and report where the bottleneck is. A normal network returns ICMP Type 3, Code 4 ("Fragmentation Needed") — if that message is blocked by a firewall, the sender never learns to reduce packet size, creating a black hole.

2. Check the local interface MTU and look for a tunnel

Command(s):

# Check MTU on all interfaces
ip link show
ip -d link show

# Check if there is a tunnel interface with reduced MTU
ip tunnel show
wg show 2>/dev/null

# Check the route MTU
ip route get wiki.internal.example.com
ip route show to wiki.internal.example.com
What to look for: The server's ethernet interface likely shows MTU 1500. But if traffic goes through a tunnel (IPsec, GRE, WireGuard, VXLAN), the tunnel adds headers (20-60+ bytes overhead), reducing the effective MTU. If the tunnel interface shows MTU 1500 but the encapsulation overhead is not accounted for, packets above ~1420-1460 bytes (depending on tunnel type) will exceed the outer link's MTU.

3. Capture traffic to confirm retransmits and stalled transfers

Command(s):

# On the wiki server, capture the HTTPS session
tcpdump -nn -i eth0 host 10.10.5.100 and port 443 -w /tmp/mtu_debug.pcap

# Trigger a large response
curl -v -o /dev/null https://wiki.internal.example.com/large-page

# Analyze the capture
tcpdump -nn -r /tmp/mtu_debug.pcap | head -50
# Look for retransmissions of specific sequence numbers
tcpdump -nn -r /tmp/mtu_debug.pcap 'tcp[tcpflags] & tcp-syn != 0'

# Check for TCP retransmits in kernel stats
ss -ti dst wiki.internal.example.com
netstat -s | grep -i retransmit
What to look for: In the tcpdump output, the TCP handshake completes (small packets). The server begins sending data. At some point, packets of ~1500 bytes are sent, then you see the same sequence number retransmitted repeatedly. The client never ACKs those large packets because they are being dropped in transit. You will see a pattern of: SYN/SYN-ACK (works), small data (works), large data segment (retransmit, retransmit, retransmit). ss -ti will show high retransmit counts on the socket.

4. Verify PMTUD is broken by checking for ICMP unreachable messages

Command(s):

# Listen for ICMP "need to fragment" messages that should be coming back
tcpdump -nn -i eth0 icmp

# In another terminal, generate large packets
ping -M do -s 1472 -c 5 wiki.internal.example.com

# Check if the kernel has cached a lower PMTU
ip route get wiki.internal.example.com
# Look for "mtu" in the output — if PMTUD worked, it would show a reduced MTU
What to look for: If PMTUD is working correctly, you would see ICMP Type 3, Code 4 messages arriving, and ip route get would show a cached lower MTU value. If the intermediate firewall is blocking all ICMP, you see nothing — no errors, no cached MTU reduction. This silence is the black hole.

Root Cause

The wiki server was moved to a network segment that reaches users through a VPN tunnel (IPsec). The tunnel adds 50-80 bytes of encapsulation overhead, reducing the effective path MTU to approximately 1420-1450 bytes. When the server sends a full 1500-byte TCP segment, the tunnel endpoint cannot forward it without fragmentation. Normally, it would send back an ICMP "Fragmentation Needed" message so the server can reduce its segment size (Path MTU Discovery). However, the corporate firewall blocks all ICMP traffic, including these essential PMTUD messages. The server never learns to send smaller packets, so it retransmits the same too-large packet repeatedly until the connection times out. Small packets (SSH keystrokes, short API responses, TCP handshakes) fit within the reduced MTU and work fine.

Fix

Immediate:

# Option 1: Reduce the MTU on the server's interface to fit within the tunnel
ip link set dev eth0 mtu 1400

# Option 2: Clamp TCP MSS at the tunnel endpoint to avoid the problem
# On the Linux router/firewall performing the tunneling:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu

# Option 3: If you control the tunnel, set the tunnel interface MTU correctly
ip link set dev tun0 mtu 1400

# Verify the fix
ping -M do -s 1372 -c 3 wiki.internal.example.com
curl -v -o /dev/null https://wiki.internal.example.com/large-page

Preventive: - Never block ICMP Type 3 (Destination Unreachable) at firewalls. At minimum, allow Type 3 Code 4 (Fragmentation Needed). This is the single most common cause of MTU black holes. - Configure --clamp-mss-to-pmtu on all tunnel endpoints. This rewrites the TCP MSS option during the handshake so endpoints agree on a safe segment size without relying on PMTUD. - Document the MTU for every network segment, especially those involving tunnels, VPNs, or overlay networks (VXLAN overhead is 50 bytes, IPsec is 50-80 bytes, GRE is 24 bytes, WireGuard is 60 bytes). - Add monitoring that tests large transfers, not just ping. A health check that downloads a 10KB payload will catch MTU issues that a simple connectivity check misses.

Common Mistakes

  • Thinking "SSH works, so the network is fine." SSH interactive sessions send small packets that fit under the reduced MTU. The problem only manifests with larger payloads.
  • Reducing MTU too aggressively. Setting MTU to 1200 "to be safe" wastes bandwidth. Calculate the actual tunnel overhead and subtract it from 1500.
  • Blaming the application or TLS. The stalled transfer looks like an app hang or TLS negotiation failure, leading engineers down the wrong path for hours.
  • Not understanding why ICMP matters. Blocking all ICMP "for security" breaks Path MTU Discovery. This is one of the most impactful misconfigurations in corporate networks.
  • Forgetting to persist the MTU change. ip link set mtu is lost on reboot. Update the interface configuration file or networkd/netplan config.

Interview Angle

Q: HTTPS to a server hangs but SSH works. What do you check? Good answer shape: Immediately identify this as a potential MTU/PMTUD issue because SSH uses small packets while HTTPS transfers large payloads. Explain Path MTU Discovery: when a packet is too large and has the DF bit set, routers should send back ICMP "Fragmentation Needed" so the sender can reduce segment size. If a firewall blocks that ICMP message, the sender never adapts and keeps retransmitting the same oversized packet — creating a black hole. Describe testing with ping -M do -s <size> to find the threshold, checking for tunnel interfaces that reduce effective MTU, and using tcpdump to confirm retransmissions of large segments. The fix is either reducing the interface MTU, clamping TCP MSS at the tunnel endpoint, or (ideally) allowing ICMP Type 3 Code 4 through the firewall.


Wiki Navigation

Prerequisites