Solution: BGP Peer Flapping¶

Summary¶

The BGP session flaps because the hold timer is set to an aggressive 15 seconds (keepalive 5 seconds), and the physical link is experiencing intermittent packet loss due to a faulty media converter. When 3 consecutive keepalive packets are lost (~0.1% packet loss is enough), the hold timer expires and the session drops. The standard hold timer of 90 seconds would tolerate this level of loss without issue.

Senior Workflow¶

Step 1: Confirm BGP session status and flapping history¶

rtr-edge-01# show bgp summary
# Look for: neighbor state, uptime (if very short, it just recovered), number of state changes

rtr-edge-01# show bgp neighbor 192.168.100.2
# Look for: Hold time, Keepalive interval, Last reset reason

Step 2: Review BGP event log¶

rtr-edge-01# show log | include BGP
# Look for pattern of:
#   "NOTIFICATION sent - hold time expired"
#   "BGP neighbor 192.168.100.2 Down"
#   "BGP neighbor 192.168.100.2 Up"
# repeating every 2-5 minutes

Step 3: Check the physical link health¶

rtr-edge-01# show interface GigabitEthernet0/1
# Look for: CRC errors, input errors, output errors
# Calculate error rate: errors / uptime

rtr-edge-01# show interface GigabitEthernet0/1 counters errors
# Check if errors are still incrementing:
# Run twice with 60 seconds gap, compare counts

Step 4: Correlate keepalive loss with link errors¶

With a 5-second keepalive interval: - 50 CRC errors/hour = ~1 error every 72 seconds - If errors come in bursts (common with bad media converters), 3+ errors in 15 seconds is plausible - That means 3 consecutive keepalives could be lost within one hold period

Step 5: Fix the physical layer¶

# Request ISP to investigate/replace the media converter
# Check SFP module on both sides
# Verify cable integrity

# Test after media converter replacement:
rtr-edge-01# clear counters GigabitEthernet0/1
# Wait 1 hour, check error rate

Step 6: Increase hold timer to a reasonable value¶

rtr-edge-01(config)# router bgp 65001
rtr-edge-01(config-router)# neighbor 192.168.100.2 timers 30 90
# keepalive 30 seconds, hold time 90 seconds

Coordinate with the ISP to match timers on their side:

rtr-isp-01(config)# router bgp 64512
rtr-isp-01(config-router)# neighbor 192.168.100.1 timers 30 90

Step 7: Consider BFD for fast failover (if needed)¶

If sub-second failover is truly required:

rtr-edge-01(config)# interface GigabitEthernet0/1
rtr-edge-01(config-if)# bfd interval 300 min_rx 300 multiplier 3
rtr-edge-01(config)# router bgp 65001
rtr-edge-01(config-router)# neighbor 192.168.100.2 fall-over bfd

BFD runs at the hardware/data plane level and is more reliable than BGP keepalives for fast detection.

Step 8: Verify stability¶

# Monitor the session for 24 hours
rtr-edge-01# show bgp neighbor 192.168.100.2 | include uptime
# Uptime should grow steadily without resets

# Check for any remaining flaps:
rtr-edge-01# show log | include BGP | include Down

Common Pitfalls¶

Aggressive timers on imperfect links: A 15-second hold timer means a single burst of 3 lost packets in 15 seconds kills the session. This is counterproductive -- the session drops cause more outage than a real failure would.
Only fixing timers, not the physical layer: The media converter is introducing errors. Fix the underlying cause, not just the sensitivity.
Only fixing the physical layer, not the timers: Even after fixing the media converter, aggressive timers are risky. Use BFD if you need fast detection.
Not coordinating with the ISP: BGP timer changes must be agreed upon with the peer. The lower of the two proposed hold times is used.
Ignoring route dampening: If this router has downstream eBGP peers, the flapping routes will propagate. Implement dampening on downstream routers.
Confusing BGP keepalives with BFD: BGP keepalives are control-plane packets processed by the CPU. Under CPU load, they can be delayed. BFD operates at the data plane and is far more reliable for fast detection.