Skip to content

Postmortem: BGP Route Leak Sends Customer Traffic Through Monitoring VLAN

Field Value
ID PM-013
Date 2025-07-22
Severity SEV-2
Duration 28m (detection to resolution)
Time to Detect 7m
Time to Mitigate 28m
Customer Impact ~15% of customer sessions experienced 40% packet loss for 21 minutes; geographically concentrated in the US-East region; estimated 9,200 affected sessions
Revenue Impact ~$22,000 estimated (degraded transaction completion rate × duration)
Teams Involved Network Engineering, SRE, Security Operations
Postmortem Author Jerome Adeyemi
Postmortem Date 2025-07-25

Executive Summary

On July 22, 2025, a network engineer added a new BGP neighbor for the monitoring infrastructure but misconfigured the outbound route-map, causing the monitoring VLAN's /24 prefix (10.99.4.0/24) to be advertised to two transit peers instead of being restricted to the internal iBGP mesh. Transit routers began forwarding approximately 15% of customer-bound traffic through the monitoring VLAN switches, which are sized for low-bandwidth observability traffic and cannot handle production load. Packet loss on the affected path reached 40% within minutes. The misconfiguration was detected via customer-facing latency alerts 7 minutes after the change was applied. The fix — removing the misconfigured route-map statement and issuing a BGP soft reset — took 21 additional minutes due to change-authorization delays. The root cause was a copy-pasted route-map that was not correctly scoped for a transit-peer context, combined with the absence of prefix-list guardrails on outbound transit advertisements.

Timeline (All times UTC)

Time Event
10:44:00 Network engineer Yuki Nakamura applies BGP neighbor configuration for new monitoring peer (10.99.4.254) on core router cr-use1-01; configuration includes outbound route-map RM-MONITORING-OUT copied from peer RM-INTERNAL-OUT
10:44:10 BGP session to new monitoring peer establishes; RM-MONITORING-OUT incorrectly permits all prefixes in AS path set including 10.99.4.0/24
10:44:15 cr-use1-01 advertises 10.99.4.0/24 to transit peers transit-peer-01 (AS 7018) and transit-peer-02 (AS 3356)
10:44:30 Transit peers begin routing traffic destined for 10.99.4.x addresses through cr-use1-01; monitoring VLAN switches begin receiving unexpected production-volume traffic
10:45:00 Monitoring VLAN switch mon-sw-02 CPU climbs to 78% (normal: <10%); begins tail-dropping packets on overloaded interfaces
10:46:20 Customer-facing packet loss begins; approximately 15% of US-East sessions affected
10:48:00 Packet loss on monitoring VLAN path reaches 38–42% sustained
10:51:00 SRE on-call Ingrid Svensson receives PagerDuty alert: "US-East p99 latency > 800ms for 120 seconds"
10:51:45 Ingrid begins investigation; observes latency spikes localized to US-East; rules out application-layer cause
10:52:30 Ingrid pages Network Engineering; Yuki Nakamura responds within 90 seconds
10:53:15 Yuki identifies recent BGP change as potential cause; pulls up cr-use1-01 configuration
10:54:00 Yuki confirms 10.99.4.0/24 is being advertised to transit peers; identifies route-map misconfiguration
10:55:00 Yuki prepares corrective configuration: modify RM-MONITORING-OUT to deny 10.99.4.0/24 outbound, permit only 10.0.0.0/8 to internal peers
10:58:00 Change authorization process begins; Yuki submits emergency change request to Network Change Advisory Board (NCAB) Slack channel
11:02:00 NCAB approves emergency change (4-minute review); Yuki given go-ahead
11:03:30 Corrected route-map applied to cr-use1-01; BGP soft reset (clear ip bgp * soft out) issued to transit peers
11:04:00 Transit peer transit-peer-01 withdraws 10.99.4.0/24 route from its routing table
11:05:10 Transit peer transit-peer-02 withdraws route; traffic no longer routed through monitoring VLAN
11:05:30 Packet loss drops to 0%; monitoring VLAN switch CPU returns to <5%
11:12:00 SRE confirms customer-facing metrics fully recovered; incident declared mitigated
11:45:00 Security Operations confirms no unauthorized access occurred via the misrouted traffic path

Impact

Customer Impact

Approximately 9,200 customer sessions in the US-East region experienced 38–42% packet loss for 21 minutes. For TCP-based workloads (the majority), TCP retransmission masked some of the packet loss at the cost of increased latency — median round-trip time for affected sessions climbed from 18ms to 210ms. For any customers using UDP-based features (video streaming preview, WebRTC), the packet loss was unmasked and sessions degraded or dropped entirely. Approximately 280 payment transactions were retried due to timeout errors during the window.

Internal Impact

  • Network Engineering: 2 engineers × 2 hours (configuration correction, root cause analysis) = 4 engineering-hours
  • SRE: 2 engineers × 1.5 hours (incident response, monitoring) = 3 engineering-hours
  • Security Operations: 1 engineer × 2 hours (traffic analysis, unauthorized access review) = 2 engineering-hours
  • Two planned network change windows were postponed due to the post-incident change freeze

Data Impact

No data was lost or corrupted. The monitoring VLAN switches dropped packets via tail-drop, which discards packets cleanly without corrupting in-flight data. Security Operations' review of packet captures on the monitoring VLAN switches confirmed that no customer payload data was logged, stored, or accessible via the monitoring infrastructure — the traffic was simply forwarded (and partially dropped) without inspection.

Root Cause

What Happened (Technical)

Yuki Nakamura was adding a new BGP neighbor to cr-use1-01 to peer with the monitoring infrastructure's route reflector at 10.99.4.254. The intent was to allow the monitoring systems to receive internal BGP routing table information (for topology visualization) via iBGP, advertising only internal prefixes within the AS.

The outbound route-map RM-MONITORING-OUT was created by copying RM-INTERNAL-OUT, which is used for internal iBGP peers. RM-INTERNAL-OUT permits all prefixes matching ip prefix-list PL-INTERNAL (which includes 10.0.0.0/8) and has an implicit deny for everything else. However, Yuki modified only the neighbor statement and the route-map name, not the permit clause. The PL-INTERNAL prefix list includes 10.99.4.0/24 because monitoring infrastructure is part of the 10.0.0.0/8 internal supernet.

The misconfiguration manifested when cr-use1-01's route redistribution logic applied RM-MONITORING-OUT to the route table it exports to transit peers transit-peer-01 and transit-peer-02. These transit peers are configured on a different BGP peer group (TRANSIT-PEERS) than internal peers, but the route-map was attached at the neighbor level, not the peer-group level. Yuki's configuration attached RM-MONITORING-OUT to the new monitoring neighbor but the command syntax used was neighbor 10.99.4.254 route-map RM-MONITORING-OUT out, which — on this platform (Cisco IOS-XE) — applies to all sessions in the peer group when the peer group inherits route policies from a shared template. The monitoring peer was inadvertently added to the TRANSIT-PEERS peer group because the configuration template Yuki copied already contained neighbor 10.99.4.254 peer-group TRANSIT-PEERS.

The result: transit peers received the 10.99.4.0/24 advertisement, treated it as a reachable customer prefix, and began routing traffic to it. The monitoring VLAN switches at 10.99.4.0/24 were not provisioned to handle production-volume traffic — they are 1G access switches with 200Mbps of monitoring traffic at peak. Production traffic in the affected path ran at approximately 3.2 Gbps, causing immediate queue saturation and tail-drop.

Contributing Factors

  1. No outbound prefix-list on transit peer sessions: Transit peers should only ever receive prefixes that are explicitly authorized for public advertisement — typically only the organization's own BGP-announced address blocks (in this case, the public /20 allocations). An outbound prefix-list on the TRANSIT-PEERS peer group restricting advertisements to the public prefix list would have blocked 10.99.4.0/24 from ever being advertised to transit, regardless of route-map misconfiguration. No such prefix-list existed.

  2. Configuration was copy-pasted without a diff review: The configuration change was written by copying an existing peer block and modifying fields. The copy-paste carried over the peer-group TRANSIT-PEERS assignment, which is the proximate cause of the transit advertisement. A diff-based review (comparing intended config to previous running config) would have surfaced this extra line. The change was applied without a second engineer reviewing the full neighbor stanza.

  3. Change was made during business hours without peer review: The network team's change management policy requires a second engineer to review BGP configuration changes before application. This requirement applies to all planned changes but has a carve-out for "minor additions" such as adding a new neighbor — a carve-out that Yuki applied to this change. BGP neighbor additions are not minor when they involve peer-group assignments.

What We Got Lucky About

  1. The monitoring VLAN switches degraded gracefully via tail-drop rather than crashing outright. Tail-drop means packets are discarded cleanly at the queue boundary, which is a detectable failure mode — latency spikes and packet loss counters rose predictably, which is what triggered the alert. If the switches had crashed, the failure would have been a hard black hole rather than a soft degradation, potentially more severe and harder to diagnose.
  2. The monitoring switches did not log or capture the customer traffic that transited them. The monitoring platform uses passive SNMP polling and IPFIX flow exports, neither of which captures payload data. A monitoring system that did packet capture (e.g., one running Wireshark sessions) could have created a data exposure incident on top of the availability incident.

Detection

How We Detected

Detection came from a PagerDuty alert on US-East p99 HTTP latency exceeding 800ms for 120 consecutive seconds. This alert fired 7 minutes after the BGP change was applied. Secondary confirmation came from a packet loss alert on the SRE synthetic monitoring probes that send ICMP pings to customer-facing endpoints every 10 seconds — this alert would have fired 2 minutes later if the latency alert had not come first.

Why We Didn't Detect Sooner

The 7-minute detection gap had two causes. First, the latency alert requires 120 seconds of sustained threshold breach before firing, which was a deliberate choice to reduce alert noise from brief traffic spikes. Second, the first 90 seconds of the incident were dominated by packet loss that TCP's retransmission mechanism partially masked — p99 latency did not cross the 800ms threshold until approximately T+5m as retransmit queues accumulated. A direct alert on BGP prefix advertisement changes (via BGP monitoring protocol or route change alerting) would have fired within seconds of the misconfiguration.

Response

What Went Well

  1. Yuki self-identified the probable cause (the BGP change applied 7 minutes prior) within 90 seconds of being paged, dramatically shortening the time to fix.
  2. The NCAB emergency change process worked as designed: a 4-minute review cycle approved the corrective configuration, which is fast enough to be useful during an incident without being a rubber stamp.
  3. The BGP soft reset (clear ip bgp * soft out) propagated the route withdrawal to both transit peers within 70 seconds, which is well within the BGP convergence expectations for this topology.

What Went Poorly

  1. The 4-minute NCAB authorization delay, while within policy, added friction during an incident where the root cause and fix were already known. An emergency change track with pre-authorized revert patterns (e.g., "revert BGP neighbor config to previous state") would eliminate this delay for rollback operations.
  2. The absence of BGP advertisement monitoring meant the team had no alert on the specific failure mechanism (unexpected prefix advertisement to transit). The detection path went through customer-facing latency, which adds multiple minutes of lag compared to a direct control-plane alert.
  3. Post-incident, Security Operations took 33 minutes to confirm that the monitoring VLAN switches had not logged customer traffic. The documentation for the monitoring platform's data collection capabilities was incomplete, requiring manual investigation. A data flow map documenting what monitoring systems capture would have resolved this question in under 5 minutes.

Action Items

ID Action Priority Owner Status Due Date
PM-013-01 Add outbound prefix-list to TRANSIT-PEERS peer group on all core routers restricting transit advertisements to explicitly authorized public BGP prefixes; test in staging first P0 Network Engineering (Jerome Adeyemi) In Progress 2025-07-28
PM-013-02 Implement BGP route change alerting using RIPE NCC BGP monitoring or equivalent; alert on any new prefix advertisement to transit peers within 60 seconds P0 Network Engineering (Jerome Adeyemi) Planned 2025-08-08
PM-013-03 Update change management policy: remove "minor addition" carve-out for BGP neighbor additions; all BGP changes require peer review regardless of perceived scope P1 Network Engineering (Yuki Nakamura) Done 2025-07-23
PM-013-04 Create pre-authorized revert playbook for BGP changes: any change within 30 minutes can be reverted by on-call network engineer without NCAB approval, logged post-hoc P1 Network Engineering, SRE (Ingrid Svensson) Planned 2025-08-05
PM-013-05 Document monitoring platform data collection capabilities in a data flow map; include in Security Operations runbooks P2 Security Operations Planned 2025-08-15

Lessons Learned

  1. Prefix-lists on transit peers are a non-negotiable baseline control: Route-maps control what gets advertised in complex ways, but they can be misconfigured. A prefix-list on transit peer sessions acts as a final safety net that is independent of route-map logic — it enforces the invariant "only our public prefixes go to transit" at the policy level. This should be treated as a foundational BGP hygiene requirement.

  2. Copy-paste in network configuration requires a structural diff review, not just intent review: When configuring a new BGP neighbor by copying an existing block, the review must compare the resulting configuration line-by-line against the previous running configuration. Intent-level review ("I'm adding a monitoring peer") is insufficient because copied configuration carries implicit settings (like peer-group assignments) that may not be obvious.

  3. Control-plane alerts fire faster than data-plane alerts: A BGP route change alert would have fired within seconds of the misconfiguration. A customer latency alert fired 7 minutes later. For network incidents, the control plane (BGP advertisements, routing table changes) should be monitored directly, not only inferred from customer-facing symptoms.

Cross-References

  • Failure Pattern: BGP Route Leak; Configuration Copy-Paste Error; Missing Egress Filter
  • Topic Packs: bgp-operations (route-maps, prefix-lists, peer groups, transit peering), network-change-management (review gates, rollback procedures), network-monitoring (BGP monitoring protocol, RIPE NCC tools)
  • Runbook: runbook-bgp-route-leak-response.md, runbook-bgp-neighbor-add.md
  • Decision Tree: US-East latency spike → rule out application layer → check network path via traceroute → if routing anomaly, pull BGP table from core routers → compare advertised prefixes to expected prefix-list → if unexpected prefix, initiate BGP soft reset after change authorization