Investigation: Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy¶
Phase 1: Networking Investigation (Dead End)¶
Check Envoy's view of the inventory-service endpoints:
$ kubectl exec order-service-6b5d8c9f-x2k4j -n prod -c istio-proxy -- \
pilot-agent request GET clusters | grep "inventory-service"
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.42:8080::cx_active::0
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.42:8080::healthy::0
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.42:8080::rq_error::47
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.43:8080::healthy::0
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.44:8080::healthy::0
All 3 endpoints are marked unhealthy. Check the Envoy config dump:
$ kubectl exec order-service-6b5d8c9f-x2k4j -n prod -c istio-proxy -- \
pilot-agent request GET config_dump | jq '.configs[] | select(.["@type"] | contains("ClustersConfigDump"))' \
| grep -A5 "inventory-service"
# Shows outlier detection ejecting all endpoints
$ istioctl proxy-status
NAME CDS LDS EDS RDS ISTIOD
order-service-6b5d8c9f-x2k4j.prod SYNCED SYNCED STALE SYNCED istiod-5f8c7d6b-k2m3n
EDS (Endpoint Discovery Service) is STALE for the order-service proxy. Istiod is not pushing fresh endpoint information. Check Istiod:
$ kubectl logs -n istio-system deploy/istiod --tail=30 | grep "inventory-service\|error\|forbidden"
2026-03-19T15:48:02.112Z error ads EDS: error listing endpoints for inventory-service.prod: forbidden: User "system:serviceaccount:istio-system:istiod" cannot list resource "endpoints" in namespace "prod"
2026-03-19T15:48:32.113Z error ads EDS: error listing endpoints for inventory-service.prod: forbidden
Istiod cannot list endpoints in the prod namespace. This is a Kubernetes RBAC issue, not a mesh configuration issue.
The Pivot¶
Why can Istiod not list endpoints? It could before. Check the ClusterRole:
$ kubectl get clusterrolebinding istiod-clusterrole-binding -o yaml | head -20
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: istiod-clusterrole-binding
creationTimestamp: "2025-11-15T10:00:00Z"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: istiod-clusterrole
subjects:
- kind: ServiceAccount
name: istiod
namespace: istio-system
The binding exists. Check the ClusterRole:
$ kubectl get clusterrole istiod-clusterrole -o yaml
# (returns not found!)
$ kubectl get clusterrole istiod-clusterrole
Error from server (NotFound): clusterroles.rbac.authorization.k8s.io "istiod-clusterrole" not found
The ClusterRole was deleted. The binding still references it, but the role no longer exists.
Phase 2: Kubernetes Investigation (Root Cause)¶
When was the ClusterRole deleted?
$ kubectl get events --all-namespaces --field-selector reason=Delete --sort-by='.lastTimestamp' | grep -i "clusterrole\|istiod"
# Events older than 1h are already evicted
# Check the audit log
$ kubectl logs -n kube-system kube-apiserver-master-01 --since=2h 2>/dev/null | \
grep "istiod-clusterrole" | tail -3
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","verb":"delete",
"objectRef":{"resource":"clusterroles","name":"istiod-clusterrole"},
"user":{"username":"system:serviceaccount:kube-system:rbac-cleanup-controller"},
"requestReceivedTimestamp":"2026-03-19T15:47:41Z"}
A rbac-cleanup-controller in kube-system deleted the ClusterRole 20 minutes ago. Check this controller:
$ kubectl get deployment rbac-cleanup-controller -n kube-system -o yaml | grep -A5 "image"
- image: internal/rbac-cleanup:v1.2.0
args:
- --cleanup-orphaned-roles
- --max-age=90d
- --dry-run=false
The RBAC cleanup controller removes ClusterRoles that appear "orphaned" — roles not referenced by any active ServiceAccount based on a heuristic. It incorrectly flagged the Istio ClusterRole as orphaned because the ServiceAccount is in istio-system but the ClusterRole is cluster-scoped, and the controller's heuristic does not handle cross-namespace references correctly.
Domain Bridge: Why This Crossed Domains¶
Key insight: The symptom was service mesh 503 errors (networking), the root cause was a deleted RBAC ClusterRole breaking Istiod's access to endpoint data (kubernetes_ops), which was caused by an over-aggressive RBAC cleanup controller deployed as a security hardening measure (security). This is common because: service meshes depend on Kubernetes RBAC for control plane access. Automated RBAC cleanup tools can break these dependencies if they do not understand the full relationship between ClusterRoles, ClusterRoleBindings, and cross-namespace ServiceAccounts.
Root Cause¶
An automated RBAC cleanup controller, deployed as part of a security hardening initiative, incorrectly identified the Istio ClusterRole as orphaned and deleted it. Without the ClusterRole, Istiod lost permission to list endpoints in application namespaces. Envoy sidecars stopped receiving updated endpoint information, marking all endpoints as unhealthy and returning 503 errors.