Symptoms: API Latency Spike, BGP Route Leak, Fix Is Network ACL¶
Domains: observability | networking | security Level: L3 Estimated time: 45 min
Initial Alert¶
Grafana alert fires at 11:23 UTC:
CRITICAL: API p99 latency > 2000ms
service: checkout-api
environment: prod
current_value: 4,312ms
threshold: 2,000ms
duration: 10 minutes
Additional Prometheus alerts:
WARNING: HTTP error rate > 5% — checkout-api returning 504 Gateway Timeout (8.7%)
WARNING: Upstream response time — payment-gateway.partner.com p99 > 3000ms
CRITICAL: SLO burn rate exceeded — checkout 99.9th percentile error budget consumed 40% in 1 hour
Observable Symptoms¶
- The checkout-api p99 latency jumped from 180ms to 4,312ms at 11:13 UTC.
- 8.7% of requests return 504 Gateway Timeout, all from calls to
payment-gateway.partner.com. - The checkout-api itself is healthy — internal endpoints respond in <10ms.
curltopayment-gateway.partner.comfrom within the cluster takes 3-5 seconds.curltopayment-gateway.partner.comfrom a laptop on a home network takes 90ms.- No deployments to checkout-api in the last 48 hours.
- The partner's status page shows no incidents.
The Misleading Signal¶
The observability data clearly shows the checkout-api as the source of latency. Dashboards show the service in red. The natural investigation path is to profile the application, check database queries, examine the payment gateway integration code, or contact the payment partner. The fact that the partner's status page is clean makes engineers suspect their own code or infrastructure. The latency is coming from network path, not the application — but the observability tools only show the symptom at the application layer.