Answer Key: The Requests That Vanish¶
The System¶
An e-commerce checkout pipeline with two microservices communicating over an Istio service mesh:
[Users] --> [order-service (2 pods)]
|
POST /api/v1/payments/charge (timeout=5s)
|
[Istio Sidecar] --> DestinationRule (circuit breaker) --> [Istio Sidecar]
|
[payment-service (3 pods)]
|
[External Bank API] (sometimes slow: 8+ seconds)
The order-service calls payment-service to charge customers during checkout. The payment-service integrates with an external bank API. All inter-service traffic flows through Istio sidecars (2/2 containers per pod = app + envoy sidecar).
What's Broken¶
Root cause: A combination of overly aggressive Istio circuit breaker settings and intermittent upstream bank API latency creates a cascading failure:
- The external bank API occasionally responds slowly (8+ seconds)
- The payment-service pods that hit slow bank responses take too long to complete
- The order-service has a 5-second timeout, so these slow requests generate 504s
- Istio's outlier detection sees 3 consecutive 5xx errors and ejects the slow pod for 180 seconds
- With
maxEjectionPercent: 100, multiple pods can be ejected http1MaxPendingRequests: 1andhttp2MaxRequests: 10are far too restrictive — requests queue up and overflow immediately- Traffic concentrates on the remaining pod, which then also gets slow and potentially ejected
Key clue: The wildly uneven request distribution (12,847 vs 94 vs 107) combined with the DestinationRule showing maxEjectionPercent: 100 and http1MaxPendingRequests: 1. One pod survived ejection and absorbed all traffic.
The Fix¶
Immediate (relax circuit breakers)¶
kubectl patch destinationrule payment-service -n checkout --type='merge' -p '{
"spec": {
"trafficPolicy": {
"connectionPool": {
"http": {
"http1MaxPendingRequests": 100,
"http2MaxRequests": 1000
}
},
"outlierDetection": {
"consecutive5xxErrors": 10,
"maxEjectionPercent": 50
}
}
}
}'
Permanent (fix the DestinationRule)¶
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
namespace: checkout
spec:
host: payment-service.checkout.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 10
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50 # Never eject more than half the pool
Also consider: - Increasing the order-service timeout to accommodate bank API latency - Adding a retry with timeout budget in the VirtualService - Adding a circuit breaker in the payment-service application code for the bank API
Verification¶
# Check all pods are receiving traffic
kubectl exec -n checkout deploy/order-service -c istio-proxy -- \
curl -s localhost:15000/stats | grep payment-service | grep upstream_rq
# Monitor request distribution
istioctl dashboard kiali -n checkout
# Watch for 504s
kubectl logs -n checkout -l app=order-service -c istio-proxy --tail=50 | grep 504
# Check outlier detection status
istioctl proxy-config endpoint -n checkout deploy/order-service | grep payment
Artifact Decoder¶
| Artifact | What It Revealed | What Was Misleading |
|---|---|---|
| CLI Output | One pod at 380m CPU, others at 12-15m = traffic imbalance; all pods show 2/2 Running | All pods are "healthy" by Kubernetes standards — the problem is at the mesh layer |
| Metrics | 12,847 vs 94 vs 107 requests = massive imbalance; p99 at 8.7s but p50 at 45ms | The p50 looks fine, hiding the severity; 504s from order-service, not payment-service |
| IaC Snippet | maxEjectionPercent: 100 + http1MaxPendingRequests: 1 = overly aggressive circuit breaking |
The config looks like standard safety settings — you need to understand the traffic volume |
| Log Lines | Payment-service confirms 8.4s bank API latency; order-service confirms 5s timeout | The istio-proxy log shows a normal 200 — this is from one of the non-ejected pods, making the mesh look healthy |
Skills Demonstrated¶
- Understanding Istio traffic policies and circuit breaker behavior
- Correlating resource usage (CPU) with traffic distribution anomalies
- Interpreting percentile latency metrics (p50 vs p99 divergence)
- Recognizing the interaction between upstream latency and mesh ejection policies
- Tracing multi-service request flows through a service mesh