Answer Key: The Requests That Vanish¶

The System¶

An e-commerce checkout pipeline with two microservices communicating over an Istio service mesh:

[Users] --> [order-service (2 pods)]
                |
           POST /api/v1/payments/charge (timeout=5s)
                |
           [Istio Sidecar] --> DestinationRule (circuit breaker) --> [Istio Sidecar]
                |
           [payment-service (3 pods)]
                |
           [External Bank API] (sometimes slow: 8+ seconds)

The order-service calls payment-service to charge customers during checkout. The payment-service integrates with an external bank API. All inter-service traffic flows through Istio sidecars (2/2 containers per pod = app + envoy sidecar).

What's Broken¶

Root cause: A combination of overly aggressive Istio circuit breaker settings and intermittent upstream bank API latency creates a cascading failure:

The external bank API occasionally responds slowly (8+ seconds)
The payment-service pods that hit slow bank responses take too long to complete
The order-service has a 5-second timeout, so these slow requests generate 504s
Istio's outlier detection sees 3 consecutive 5xx errors and ejects the slow pod for 180 seconds
With maxEjectionPercent: 100, multiple pods can be ejected
http1MaxPendingRequests: 1 and http2MaxRequests: 10 are far too restrictive — requests queue up and overflow immediately
Traffic concentrates on the remaining pod, which then also gets slow and potentially ejected

Key clue: The wildly uneven request distribution (12,847 vs 94 vs 107) combined with the DestinationRule showing maxEjectionPercent: 100 and http1MaxPendingRequests: 1. One pod survived ejection and absorbed all traffic.

The Fix¶

Immediate (relax circuit breakers)¶

kubectl patch destinationrule payment-service -n checkout --type='merge' -p '{
  "spec": {
    "trafficPolicy": {
      "connectionPool": {
        "http": {
          "http1MaxPendingRequests": 100,
          "http2MaxRequests": 1000
        }
      },
      "outlierDetection": {
        "consecutive5xxErrors": 10,
        "maxEjectionPercent": 50
      }
    }
  }
}'

Permanent (fix the DestinationRule)¶

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
  namespace: checkout
spec:
  host: payment-service.checkout.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 10
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50    # Never eject more than half the pool

Also consider: - Increasing the order-service timeout to accommodate bank API latency - Adding a retry with timeout budget in the VirtualService - Adding a circuit breaker in the payment-service application code for the bank API

Verification¶

# Check all pods are receiving traffic
kubectl exec -n checkout deploy/order-service -c istio-proxy -- \
  curl -s localhost:15000/stats | grep payment-service | grep upstream_rq

# Monitor request distribution
istioctl dashboard kiali -n checkout

# Watch for 504s
kubectl logs -n checkout -l app=order-service -c istio-proxy --tail=50 | grep 504

# Check outlier detection status
istioctl proxy-config endpoint -n checkout deploy/order-service | grep payment

Artifact Decoder¶

Artifact	What It Revealed	What Was Misleading
CLI Output	One pod at 380m CPU, others at 12-15m = traffic imbalance; all pods show 2/2 Running	All pods are "healthy" by Kubernetes standards — the problem is at the mesh layer
Metrics	12,847 vs 94 vs 107 requests = massive imbalance; p99 at 8.7s but p50 at 45ms	The p50 looks fine, hiding the severity; 504s from order-service, not payment-service
IaC Snippet	`maxEjectionPercent: 100` + `http1MaxPendingRequests: 1` = overly aggressive circuit breaking	The config looks like standard safety settings — you need to understand the traffic volume
Log Lines	Payment-service confirms 8.4s bank API latency; order-service confirms 5s timeout	The istio-proxy log shows a normal 200 — this is from one of the non-ejected pods, making the mesh look healthy

Skills Demonstrated¶

Understanding Istio traffic policies and circuit breaker behavior
Correlating resource usage (CPU) with traffic distribution anomalies
Interpreting percentile latency metrics (p50 vs p99 divergence)
Recognizing the interaction between upstream latency and mesh ejection policies
Tracing multi-service request flows through a service mesh

Answer Key: The Requests That Vanish¶

The System¶

What's Broken¶

The Fix¶

Immediate (relax circuit breakers)¶

Permanent (fix the DestinationRule)¶

Verification¶

Artifact Decoder¶

Skills Demonstrated¶

Prerequisite Topic Packs¶

Pages that link here¶