Symptoms: API Latency Spike, BGP Route Leak, Fix Is Network ACL¶

Domains: observability | networking | security Level: L3 Estimated time: 45 min

Initial Alert¶

Grafana alert fires at 11:23 UTC:

CRITICAL: API p99 latency > 2000ms
  service: checkout-api
  environment: prod
  current_value: 4,312ms
  threshold: 2,000ms
  duration: 10 minutes

Additional Prometheus alerts:

WARNING: HTTP error rate > 5% — checkout-api returning 504 Gateway Timeout (8.7%)
WARNING: Upstream response time — payment-gateway.partner.com p99 > 3000ms
CRITICAL: SLO burn rate exceeded — checkout 99.9th percentile error budget consumed 40% in 1 hour

Observable Symptoms¶

The checkout-api p99 latency jumped from 180ms to 4,312ms at 11:13 UTC.
8.7% of requests return 504 Gateway Timeout, all from calls to payment-gateway.partner.com.
The checkout-api itself is healthy — internal endpoints respond in <10ms.
curl to payment-gateway.partner.com from within the cluster takes 3-5 seconds.
curl to payment-gateway.partner.com from a laptop on a home network takes 90ms.
No deployments to checkout-api in the last 48 hours.
The partner's status page shows no incidents.

The Misleading Signal¶

The observability data clearly shows the checkout-api as the source of latency. Dashboards show the service in red. The natural investigation path is to profile the application, check database queries, examine the payment gateway integration code, or contact the payment partner. The fact that the partner's status page is clean makes engineers suspect their own code or infrastructure. The latency is coming from network path, not the application — but the observability tools only show the symptom at the application layer.