Investigation: Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy¶

Phase 1: Networking Investigation (Dead End)¶

Check Envoy's view of the inventory-service endpoints:

$ kubectl exec order-service-6b5d8c9f-x2k4j -n prod -c istio-proxy -- \
    pilot-agent request GET clusters | grep "inventory-service"
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.42:8080::cx_active::0
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.42:8080::healthy::0
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.42:8080::rq_error::47
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.43:8080::healthy::0
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.44:8080::healthy::0

All 3 endpoints are marked unhealthy. Check the Envoy config dump:

$ kubectl exec order-service-6b5d8c9f-x2k4j -n prod -c istio-proxy -- \
    pilot-agent request GET config_dump | jq '.configs[] | select(.["@type"] | contains("ClustersConfigDump"))' \
    | grep -A5 "inventory-service"
# Shows outlier detection ejecting all endpoints

$ istioctl proxy-status
NAME                                                  CDS     LDS     EDS     RDS     ISTIOD
order-service-6b5d8c9f-x2k4j.prod                    SYNCED  SYNCED  STALE   SYNCED  istiod-5f8c7d6b-k2m3n

EDS (Endpoint Discovery Service) is STALE for the order-service proxy. Istiod is not pushing fresh endpoint information. Check Istiod:

$ kubectl logs -n istio-system deploy/istiod --tail=30 | grep "inventory-service\|error\|forbidden"
2026-03-19T15:48:02.112Z  error  ads  EDS: error listing endpoints for inventory-service.prod: forbidden: User "system:serviceaccount:istio-system:istiod" cannot list resource "endpoints" in namespace "prod"
2026-03-19T15:48:32.113Z  error  ads  EDS: error listing endpoints for inventory-service.prod: forbidden

Istiod cannot list endpoints in the prod namespace. This is a Kubernetes RBAC issue, not a mesh configuration issue.

The Pivot¶

Why can Istiod not list endpoints? It could before. Check the ClusterRole:

$ kubectl get clusterrolebinding istiod-clusterrole-binding -o yaml | head -20
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: istiod-clusterrole-binding
  creationTimestamp: "2025-11-15T10:00:00Z"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: istiod-clusterrole
subjects:
- kind: ServiceAccount
  name: istiod
  namespace: istio-system

The binding exists. Check the ClusterRole:

$ kubectl get clusterrole istiod-clusterrole -o yaml
# (returns not found!)
$ kubectl get clusterrole istiod-clusterrole
Error from server (NotFound): clusterroles.rbac.authorization.k8s.io "istiod-clusterrole" not found

The ClusterRole was deleted. The binding still references it, but the role no longer exists.

Phase 2: Kubernetes Investigation (Root Cause)¶

When was the ClusterRole deleted?

$ kubectl get events --all-namespaces --field-selector reason=Delete --sort-by='.lastTimestamp' | grep -i "clusterrole\|istiod"
# Events older than 1h are already evicted

# Check the audit log
$ kubectl logs -n kube-system kube-apiserver-master-01 --since=2h 2>/dev/null | \
    grep "istiod-clusterrole" | tail -3
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","verb":"delete",
 "objectRef":{"resource":"clusterroles","name":"istiod-clusterrole"},
 "user":{"username":"system:serviceaccount:kube-system:rbac-cleanup-controller"},
 "requestReceivedTimestamp":"2026-03-19T15:47:41Z"}

A rbac-cleanup-controller in kube-system deleted the ClusterRole 20 minutes ago. Check this controller:

$ kubectl get deployment rbac-cleanup-controller -n kube-system -o yaml | grep -A5 "image"
    - image: internal/rbac-cleanup:v1.2.0
      args:
        - --cleanup-orphaned-roles
        - --max-age=90d
        - --dry-run=false

The RBAC cleanup controller removes ClusterRoles that appear "orphaned" — roles not referenced by any active ServiceAccount based on a heuristic. It incorrectly flagged the Istio ClusterRole as orphaned because the ServiceAccount is in istio-system but the ClusterRole is cluster-scoped, and the controller's heuristic does not handle cross-namespace references correctly.

Domain Bridge: Why This Crossed Domains¶

Key insight: The symptom was service mesh 503 errors (networking), the root cause was a deleted RBAC ClusterRole breaking Istiod's access to endpoint data (kubernetes_ops), which was caused by an over-aggressive RBAC cleanup controller deployed as a security hardening measure (security). This is common because: service meshes depend on Kubernetes RBAC for control plane access. Automated RBAC cleanup tools can break these dependencies if they do not understand the full relationship between ClusterRoles, ClusterRoleBindings, and cross-namespace ServiceAccounts.

Root Cause¶

An automated RBAC cleanup controller, deployed as part of a security hardening initiative, incorrectly identified the Istio ClusterRole as orphaned and deleted it. Without the ClusterRole, Istiod lost permission to list endpoints in application namespaces. Envoy sidecars stopped receiving updated endpoint information, marking all endpoints as unhealthy and returning 503 errors.