Incident Replay: BGP Peer Flapping¶
Setup¶
- System context: Edge router peering with an upstream ISP via eBGP. The BGP session drops and re-establishes every 5-10 minutes, causing route withdrawals and traffic blackholes during convergence.
- Time: Thursday 03:00 UTC
- Your role: Network engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Internet-facing services experiencing 30-second outages every 5-10 minutes. BGP monitoring shows peer flapping. ISP NOC is on the phone."]
What you see:
show bgp summary shows the uptime resetting every few minutes. BGP neighbor state toggles between Established and Idle. Route count drops to 0 during each flap, causing a traffic blackhole until convergence.
Choose your action: - A) Restart the BGP process on the edge router - B) Check BGP neighbor logs for the specific disconnect reason - C) Switch all traffic to the secondary ISP link - D) Increase the BGP hold timer to tolerate longer gaps
If you chose B (recommended):¶
[Result: BGP logs show "Hold Timer Expired" as the disconnect reason. The peer is not sending keepalives within the 90-second hold timer. But the keepalive interval is 30 seconds — the peer should send 3 keepalives per hold period. Something is delaying or dropping the keepalives. Proceed to Round 2.]
If you chose A:¶
[Result: BGP process restarts, session re-establishes, then flaps again in 5 minutes. Not a process issue.]
If you chose C:¶
[Result: Good mitigation — traffic moves to secondary. But you still need to fix the primary link for redundancy.]
If you chose D:¶
[Result: Longer hold timer might mask the issue but introduces slower convergence on real failures. Treating the symptom.]
Round 2: First Triage Data¶
[Pressure cue: "Primary ISP link flapping. Secondary is carrying all traffic. Need the primary back for redundancy."]
What you see: Interface counters on the uplink show CRC errors and input errors incrementing. The physical link is a 10GbE fiber connection. Error rate is ~50 errors/second — enough to occasionally corrupt a BGP keepalive packet and cause TCP retransmissions that exceed the hold timer.
Choose your action: - A) Replace the SFP+ optic on the router side - B) Check the fiber patch cable and clean the connectors - C) Request the ISP to check their side of the link - D) Both B and C — clean your side and ask ISP to check theirs
If you chose D (recommended):¶
[Result: Fiber connectors cleaned with IBC cleaner — error rate drops from 50/sec to 10/sec. ISP checks their optic — finds it is a third-party module overheating. They replace it. Error rate drops to 0. BGP session stabilizes. Proceed to Round 3.]
If you chose A:¶
[Result: New SFP+ on your side does not help — the errors are coming from the ISP's transmit side.]
If you chose B:¶
[Result: Cleaning reduces errors but does not eliminate them. The ISP side also needs attention.]
If you chose C:¶
[Result: ISP finds and fixes their optic, but your dirty connector still contributes errors. Both sides need attention.]
Round 3: Root Cause Identification¶
[Pressure cue: "Link clean. BGP stable for 30 minutes. Document."]
What you see: Root cause: Dirty fiber connectors on our side + failing third-party SFP+ on the ISP side combined to create enough CRC errors to occasionally corrupt TCP segments carrying BGP keepalives. The hold timer expired when keepalives were lost to corruption.
Choose your action: - A) Implement BFD (Bidirectional Forwarding Detection) for faster failure detection - B) Add interface error rate monitoring and alerting - C) Request ISP uses certified optics and schedule regular fiber cleaning - D) All of the above
If you chose D (recommended):¶
[Result: BFD enables sub-second failure detection. Error monitoring catches degradation before it causes BGP flaps. Fiber maintenance schedule prevents connector-related issues. Proceed to Round 4.]
If you chose A:¶
[Result: BFD detects failures faster but does not prevent them.]
If you chose B:¶
[Result: Monitoring catches errors early but does not fix the optics issue.]
If you chose C:¶
[Result: Prevents this specific cause but other physical layer issues could still occur.]
Round 4: Remediation¶
[Pressure cue: "BGP stable. Verify convergence and close."]
Actions:
1. Verify BGP session uptime: show bgp summary — uptime >1 hour
2. Verify zero interface errors: show interface counters errors
3. Move traffic back to primary ISP link (if it was shifted to secondary)
4. Configure BFD on the BGP session
5. Set up interface error rate alerting at 10 errors/minute
Damage Report¶
- Total downtime: Intermittent 30-second blackholes over 3 hours
- Blast radius: All internet-facing traffic via the primary ISP link
- Optimal resolution time: 20 minutes (read BGP logs -> check interface errors -> clean + ISP fix)
- If every wrong choice was made: 3+ hours of flapping plus delayed ISP coordination
Cross-References¶
- Primer: Routing
- Primer: Networking
- Footguns: Routing
- Footguns: Networking