Incident Replay: OSPF Stuck in ExStart¶
Setup¶
- System context: Two routers that should form an OSPF adjacency are stuck in ExStart state. They can ping each other but OSPF never reaches Full state.
- Time: Saturday 11:00 UTC
- Your role: Network engineer
Round 1: Alert Fires¶
[Pressure cue: "New router deployment — OSPF adjacency will not form. Routes are not being learned. The network segment is unreachable from the rest of the network."]
What you see:
show ip ospf neighbor shows state ExStart/DR for the peer. Debug output shows Database Description (DBD) packets being sent but never acknowledged. The DBD packet size is 1500 bytes but the link MTU between the routers is 1400 (GRE tunnel).
Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue
If you chose A:¶
[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]
If you chose B (recommended):¶
[Result: Systematic investigation reveals the root cause. OSPF DBD packets are sent at the interface MTU. If the two routers have mismatched MTU settings, the larger DBD packets are dropped by the lower-MTU side. OSPF never progresses past ExStart because DBD exchange fails. Proceed to Round 2.]
If you chose C:¶
[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]
If you chose D:¶
[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]
Round 2: First Triage Data¶
[Pressure cue: "Root cause identified. Apply the fix."]
What you see: OSPF DBD packets are sent at the interface MTU. If the two routers have mismatched MTU settings, the larger DBD packets are dropped by the lower-MTU side. OSPF never progresses past ExStart because DBD exchange fails.
Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window
If you chose B (recommended):¶
[Result: Match MTU on both sides of the OSPF link. Either set both to 1400 or configure
ip ospf mtu-ignore(not recommended as a permanent fix). For GRE tunnels, setip mtu 1400on the tunnel interfaces. Service restored and verified. Proceed to Round 3.]
If you chose A:¶
[Result: Fix applied but not verified. May not be complete.]
If you chose C:¶
[Result: Broader fix is correct long-term but takes longer to implement during an incident.]
If you chose D:¶
[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]
Round 3: Root Cause Identification¶
[Pressure cue: "Service restored. Document and prevent recurrence."]
What you see: Root cause confirmed: OSPF DBD packets are sent at the interface MTU. If the two routers have mismatched MTU settings, the larger DBD packets are dropped by the lower-MTU side. OSPF never progresses past ExStart because DBD exchange fails.
Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above
If you chose D (recommended):¶
[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]
If you chose A:¶
[Result: Documentation helps but relies on humans remembering to check it.]
If you chose B:¶
[Result: Monitoring detects faster but does not prevent.]
If you chose C:¶
[Result: Automation prevents recurrence but needs monitoring for edge cases.]
Round 4: Remediation¶
[Pressure cue: "Verify everything and close the incident."]
Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure
Damage Report¶
- Total downtime: Varies based on path chosen
- Blast radius: Affected services and dependent systems
- Optimal resolution time: 15 minutes
- If every wrong choice was made: 120 minutes + additional damage
Cross-References¶
- Primer: Networking
- Primer: Networking Troubleshooting
- Footguns: Networking