Skip to content

Answer Key: The Requests That Vanish

The System

An e-commerce checkout pipeline with two microservices communicating over an Istio service mesh:

[Users] --> [order-service (2 pods)]
                |
           POST /api/v1/payments/charge (timeout=5s)
                |
           [Istio Sidecar] --> DestinationRule (circuit breaker) --> [Istio Sidecar]
                |
           [payment-service (3 pods)]
                |
           [External Bank API] (sometimes slow: 8+ seconds)

The order-service calls payment-service to charge customers during checkout. The payment-service integrates with an external bank API. All inter-service traffic flows through Istio sidecars (2/2 containers per pod = app + envoy sidecar).

What's Broken

Root cause: A combination of overly aggressive Istio circuit breaker settings and intermittent upstream bank API latency creates a cascading failure:

  1. The external bank API occasionally responds slowly (8+ seconds)
  2. The payment-service pods that hit slow bank responses take too long to complete
  3. The order-service has a 5-second timeout, so these slow requests generate 504s
  4. Istio's outlier detection sees 3 consecutive 5xx errors and ejects the slow pod for 180 seconds
  5. With maxEjectionPercent: 100, multiple pods can be ejected
  6. http1MaxPendingRequests: 1 and http2MaxRequests: 10 are far too restrictive — requests queue up and overflow immediately
  7. Traffic concentrates on the remaining pod, which then also gets slow and potentially ejected

Key clue: The wildly uneven request distribution (12,847 vs 94 vs 107) combined with the DestinationRule showing maxEjectionPercent: 100 and http1MaxPendingRequests: 1. One pod survived ejection and absorbed all traffic.

The Fix

Immediate (relax circuit breakers)

kubectl patch destinationrule payment-service -n checkout --type='merge' -p '{
  "spec": {
    "trafficPolicy": {
      "connectionPool": {
        "http": {
          "http1MaxPendingRequests": 100,
          "http2MaxRequests": 1000
        }
      },
      "outlierDetection": {
        "consecutive5xxErrors": 10,
        "maxEjectionPercent": 50
      }
    }
  }
}'

Permanent (fix the DestinationRule)

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
  namespace: checkout
spec:
  host: payment-service.checkout.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 10
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50    # Never eject more than half the pool

Also consider: - Increasing the order-service timeout to accommodate bank API latency - Adding a retry with timeout budget in the VirtualService - Adding a circuit breaker in the payment-service application code for the bank API

Verification

# Check all pods are receiving traffic
kubectl exec -n checkout deploy/order-service -c istio-proxy -- \
  curl -s localhost:15000/stats | grep payment-service | grep upstream_rq

# Monitor request distribution
istioctl dashboard kiali -n checkout

# Watch for 504s
kubectl logs -n checkout -l app=order-service -c istio-proxy --tail=50 | grep 504

# Check outlier detection status
istioctl proxy-config endpoint -n checkout deploy/order-service | grep payment

Artifact Decoder

Artifact What It Revealed What Was Misleading
CLI Output One pod at 380m CPU, others at 12-15m = traffic imbalance; all pods show 2/2 Running All pods are "healthy" by Kubernetes standards — the problem is at the mesh layer
Metrics 12,847 vs 94 vs 107 requests = massive imbalance; p99 at 8.7s but p50 at 45ms The p50 looks fine, hiding the severity; 504s from order-service, not payment-service
IaC Snippet maxEjectionPercent: 100 + http1MaxPendingRequests: 1 = overly aggressive circuit breaking The config looks like standard safety settings — you need to understand the traffic volume
Log Lines Payment-service confirms 8.4s bank API latency; order-service confirms 5s timeout The istio-proxy log shows a normal 200 — this is from one of the non-ejected pods, making the mesh look healthy

Skills Demonstrated

  • Understanding Istio traffic policies and circuit breaker behavior
  • Correlating resource usage (CPU) with traffic distribution anomalies
  • Interpreting percentile latency metrics (p50 vs p99 divergence)
  • Recognizing the interaction between upstream latency and mesh ejection policies
  • Tracing multi-service request flows through a service mesh

Prerequisite Topic Packs