Incident Replay: MTU Blackhole — TLS Stalls¶
Setup¶
- System context: Web application behind an nginx reverse proxy. TLS handshakes stall intermittently. Small HTTP requests work but large responses hang.
- Time: Tuesday 08:30 UTC
- Your role: On-call SRE / network engineer
Round 1: Alert Fires¶
[Pressure cue: "Users report pages that partially load then hang. Small API calls work. Large page loads time out after 30 seconds."]
What you see: Packet capture shows TLS handshake completes but large response packets (>1500 bytes) are never received by the client. PMTUD is failing — ICMP 'need to fragment' packets are being blocked by a firewall.
Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue
If you chose A:¶
[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]
If you chose B (recommended):¶
[Result: Systematic investigation reveals the root cause. A VPN tunnel in the path has MTU 1400. Server sends 1500-byte packets with DF bit set. The tunnel endpoint sends ICMP Fragmentation Needed but the server's firewall blocks inbound ICMP. Server never learns about the lower MTU. Proceed to Round 2.]
If you chose C:¶
[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]
If you chose D:¶
[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]
Round 2: First Triage Data¶
[Pressure cue: "Root cause identified. Apply the fix."]
What you see: A VPN tunnel in the path has MTU 1400. Server sends 1500-byte packets with DF bit set. The tunnel endpoint sends ICMP Fragmentation Needed but the server's firewall blocks inbound ICMP. Server never learns about the lower MTU.
Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window
If you chose B (recommended):¶
[Result: Allow ICMP type 3 code 4 (Fragmentation Needed) through the firewall. Alternatively, set TCP MSS clamping on the VPN tunnel:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu. Service restored and verified. Proceed to Round 3.]
If you chose A:¶
[Result: Fix applied but not verified. May not be complete.]
If you chose C:¶
[Result: Broader fix is correct long-term but takes longer to implement during an incident.]
If you chose D:¶
[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]
Round 3: Root Cause Identification¶
[Pressure cue: "Service restored. Document and prevent recurrence."]
What you see: Root cause confirmed: A VPN tunnel in the path has MTU 1400. Server sends 1500-byte packets with DF bit set. The tunnel endpoint sends ICMP Fragmentation Needed but the server's firewall blocks inbound ICMP. Server never learns about the lower MTU.
Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above
If you chose D (recommended):¶
[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]
If you chose A:¶
[Result: Documentation helps but relies on humans remembering to check it.]
If you chose B:¶
[Result: Monitoring detects faster but does not prevent.]
If you chose C:¶
[Result: Automation prevents recurrence but needs monitoring for edge cases.]
Round 4: Remediation¶
[Pressure cue: "Verify everything and close the incident."]
Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure
Damage Report¶
- Total downtime: Varies based on path chosen
- Blast radius: Affected services and dependent systems
- Optimal resolution time: 12 minutes
- If every wrong choice was made: 90 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking