Symptoms: Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy¶
Domains: networking | kubernetes_ops | security Level: L3 Estimated time: 45 min
Initial Alert¶
Istio service mesh alert fires at 16:08 UTC:
CRITICAL: istio_requests_total 503 rate > 5%
destination_service: inventory-service.prod.svc.cluster.local
source_service: order-service.prod.svc.cluster.local
response_code: 503
response_flags: UF (upstream connection failure)
Additional alerts:
WARNING: order-service — 503 errors on /v1/orders endpoint (depends on inventory-service)
WARNING: inventory-service — Envoy sidecar reporting "no healthy upstream"
CRITICAL: SLO breach — order completion rate dropped to 71% (target: 99.5%)
Observable Symptoms¶
order-servicecalls toinventory-serviceare failing with 503 "no healthy upstream."- Direct calls to
inventory-servicefrom a debug pod (without mesh) succeed:curl http://inventory-service.prod:8080/v1/stockreturns 200. - The inventory-service pods are Running and Ready (3/3 replicas).
- Envoy sidecar on
order-serviceshows 0 healthy endpoints forinventory-servicein its cluster config. - Other mesh-to-mesh calls work fine (e.g.,
order-service->payment-service). - This started 20 minutes ago. No deployments to either service today.
The Misleading Signal¶
503 with "no healthy upstream" and "UF" response flag points directly to a networking/service mesh problem. The engineer's instinct is to check Envoy configuration, Istio DestinationRules, health check settings, and mTLS configuration. The fact that direct pod-to-pod communication works but mesh communication fails looks like an Istio control plane issue or a misconfigured VirtualService.