Skip to content

Symptoms: Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy

Domains: networking | kubernetes_ops | security Level: L3 Estimated time: 45 min

Initial Alert

Istio service mesh alert fires at 16:08 UTC:

CRITICAL: istio_requests_total 503 rate > 5%
  destination_service: inventory-service.prod.svc.cluster.local
  source_service: order-service.prod.svc.cluster.local
  response_code: 503
  response_flags: UF (upstream connection failure)

Additional alerts:

WARNING: order-service — 503 errors on /v1/orders endpoint (depends on inventory-service)
WARNING: inventory-service — Envoy sidecar reporting "no healthy upstream"
CRITICAL: SLO breach — order completion rate dropped to 71% (target: 99.5%)

Observable Symptoms

  • order-service calls to inventory-service are failing with 503 "no healthy upstream."
  • Direct calls to inventory-service from a debug pod (without mesh) succeed: curl http://inventory-service.prod:8080/v1/stock returns 200.
  • The inventory-service pods are Running and Ready (3/3 replicas).
  • Envoy sidecar on order-service shows 0 healthy endpoints for inventory-service in its cluster config.
  • Other mesh-to-mesh calls work fine (e.g., order-service -> payment-service).
  • This started 20 minutes ago. No deployments to either service today.

The Misleading Signal

503 with "no healthy upstream" and "UF" response flag points directly to a networking/service mesh problem. The engineer's instinct is to check Envoy configuration, Istio DestinationRules, health check settings, and mTLS configuration. The fact that direct pod-to-pod communication works but mesh communication fails looks like an Istio control plane issue or a misconfigured VirtualService.