Skip to content

Investigation: API Latency Spike, BGP Route Leak, Fix Is Network ACL

Phase 1: Observability Investigation (Dead End)

Start with the application metrics:

# Prometheus query: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m]))
# Result: 4.312 seconds — confirms the alert

# Check which endpoints are slow
# histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="checkout-api",handler="/v1/checkout"}[5m]))
# Result: 4.1s for /v1/checkout, 0.008s for /health, 0.012s for /v1/cart

Only the checkout endpoint is slow. That endpoint calls the payment gateway. Check the payment gateway call:

# From a checkout-api pod
$ kubectl exec -it checkout-api-6b5d8c9f-x2k4j -n prod -- \
    curl -w "@/tmp/curl-timing" -o /dev/null -s https://payment-gateway.partner.com/v2/charge
     time_namelookup:  0.004s
      time_connect:    3.217s
   time_appconnect:    3.418s
      time_total:      3.624s

3.2 seconds for TCP connect. DNS is fast (4ms). The latency is in the TCP handshake to the partner's IP. The application code is fine — the network path is slow.

# Check the route from the cluster
$ kubectl exec -it checkout-api-6b5d8c9f-x2k4j -n prod -- traceroute -n payment-gateway.partner.com
traceroute to payment-gateway.partner.com (198.51.100.42), 30 hops max
 1  10.0.0.1      0.5 ms
 2  10.0.1.1      0.8 ms
 3  172.16.0.1    1.2 ms    # Our edge router
 4  203.0.113.1   2.1 ms    # ISP-A
 5  203.0.113.5   45.2 ms   # ISP-A backbone
 6  198.18.0.1    89.4 ms   # Transit AS (unexpected!)
 7  198.18.0.5    134.7 ms  # Transit AS
 8  198.18.0.9    178.2 ms  # Transit AS
 9  * * *
10  198.51.100.1  3102.4 ms # Partner's edge (finally)
11  198.51.100.42 3217.1 ms # Partner's server

The traffic is going through a transit AS (198.18.0.0/15) that is adding massive latency. Compare with a direct route:

# From the laptop (home network)
$ traceroute -n 198.51.100.42
 1  192.168.1.1   1.2 ms
 2  10.128.0.1    3.4 ms
 3  72.14.215.1   5.2 ms    # Google edge
 4  198.51.100.1  12.4 ms   # Partner's edge (direct peering)
 5  198.51.100.42 14.1 ms

The home network reaches the partner in 4 hops via direct peering. The production cluster is taking 11 hops through a congested transit provider.

The Pivot

Check the BGP routing table on the edge router:

# SSH to edge router
router# show ip bgp 198.51.100.0/24
BGP routing table entry for 198.51.100.0/24
  Paths: (2 available, best #2)
    Path 1: AS_PATH 64501 64999 64700
      Next hop: 203.0.113.1 (ISP-A via transit)
      MED: 200, LOCAL_PREF: 100
      Origin: IGP, valid, best
    Path 2: AS_PATH 64502 64700
      Next hop: 203.0.114.1 (ISP-B direct peering)
      MED: 50, LOCAL_PREF: 150
      Origin: IGP, valid
      **INACTIVE: filtered by prefix-list DENY-PARTNER-ROUTES**

The direct peering path (ISP-B, 2 hops, LOCAL_PREF 150) is being filtered by a prefix-list. Traffic is forced through ISP-A's congested transit path (3 hops through AS 64999).

Phase 2: Networking Investigation (Root Cause)

Check the prefix-list:

router# show ip prefix-list DENY-PARTNER-ROUTES
ip prefix-list DENY-PARTNER-ROUTES:
   seq 10 deny 198.51.100.0/24
   seq 100 permit 0.0.0.0/0 le 32

This prefix-list was added 10 days ago. Check the change log:

router# show configuration commit list 5
   Date                User       Comment
   2026-03-09 14:22    netops     "block partner routes on ISP-B per security ticket SEC-4471"

Security ticket SEC-4471 requested blocking the partner's IP range on ISP-B because a vulnerability scan flagged traffic from that range. The network team implemented it as a BGP prefix-list filter, which inadvertently blocked the preferred routing path to the partner. The security intent was to block inbound connections from the partner's range, not to filter outbound BGP routes.

Domain Bridge: Why This Crossed Domains

Key insight: The symptom was API latency visible in observability dashboards, the root cause was a BGP route filter on the edge router (networking), applied because of a security ticket that was misinterpreted during implementation. The fix requires a network ACL change (security) to properly implement the security intent without breaking routing. This is common because: security policies implemented at the network layer can have unintended routing side effects. BGP prefix-lists affect route selection, not just packet filtering. A "block this range" security request can be implemented in multiple ways, some of which break legitimate traffic.

Root Cause

A security ticket requested blocking inbound traffic from the partner's IP range. The network team implemented it as a BGP prefix-list filter on ISP-B, which blocked the preferred outbound route to the partner. Traffic was rerouted through a congested transit path via ISP-A, adding 3+ seconds of latency. The security intent was to block inbound scanning traffic, not to affect outbound routing.