Progressive Hints¶

Hint 1 (after 5 min)¶

Look at the kubectl top output: one payment-service pod (g4h2j) is using 380m CPU while the other two use 12-15m. Similarly, the Istio metrics show that pod handling 12,847 requests while the others handle ~100 each. Traffic is not being distributed evenly.

Hint 2 (after 10 min)¶

The DestinationRule has outlierDetection.maxEjectionPercent: 100 — Istio can eject all pods from the load balancing pool. Combined with consecutive5xxErrors: 3, any pod that returns 3 errors in a row gets ejected for 180 seconds. When the upstream bank API is slow, the slow pod causes 504 timeouts. If 2 of 3 pods get ejected (for returning slow responses that trigger upstream timeouts), all traffic concentrates on the remaining pod. Now also look at http1MaxPendingRequests: 1 — with only 1 pending request allowed, most requests to slow pods get rejected immediately.

Hint 3 (after 15 min)¶

This is a checkout system where order-service calls payment-service to charge customers. The payment service talks to an external bank API that is sometimes slow (8+ seconds). The Istio DestinationRule has overly aggressive circuit breaker settings: only 1 pending request, only 10 concurrent HTTP/2 requests, and outlier detection that ejects pods after just 3 consecutive 5xx errors — with up to 100% of pods ejectable. When the bank API slows down, two pods get ejected, all traffic hammers the surviving pod, the order-service sees 504 timeouts (5s deadline exceeded), and the payment-service log confirms 8.4s processing times. The p50 looks fine (45ms) because most requests on healthy pods are fast, but the p99 is 8.7 seconds — the slow bank responses.