Remediation: Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy¶
Immediate Fix (Security — Domain C)¶
The fix requires restoring the RBAC ClusterRole and configuring the cleanup controller to exclude system roles.
Step 1: Restore the Istio ClusterRole¶
# Re-apply the Istio ClusterRole from the Istio installation manifests
$ istioctl manifest generate --set profile=default | \
kubectl apply -f - --selector 'kind=ClusterRole,app=istiod' --dry-run=server
clusterrole.rbac.authorization.k8s.io/istiod-clusterrole created (server dry run)
$ istioctl manifest generate --set profile=default | \
kubectl apply -f - --selector 'kind=ClusterRole,app=istiod'
clusterrole.rbac.authorization.k8s.io/istiod-clusterrole created
Step 2: Verify Istiod recovers¶
$ kubectl logs -n istio-system deploy/istiod --tail=5
2026-03-19T16:15:02.112Z info ads EDS: successfully listed endpoints for inventory-service.prod
2026-03-19T16:15:02.115Z info ads Pushing EDS update for inventory-service.prod.svc.cluster.local
$ istioctl proxy-status
NAME CDS LDS EDS RDS ISTIOD
order-service-6b5d8c9f-x2k4j.prod SYNCED SYNCED SYNCED SYNCED istiod-5f8c7d6b-k2m3n
EDS is now SYNCED.
Step 3: Add exclusion labels to the RBAC cleanup controller¶
# Label all Istio ClusterRoles to exclude from cleanup
$ kubectl label clusterrole istiod-clusterrole \
rbac-cleanup/exclude=true \
app.kubernetes.io/managed-by=istio
# Update the cleanup controller configuration
$ kubectl patch deployment rbac-cleanup-controller -n kube-system --type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--exclude-label=rbac-cleanup/exclude=true"}]'
Step 4: Add protection for all system ClusterRoles¶
# Label all system-critical ClusterRoles
$ for role in istiod-clusterrole cert-manager-controller system:coredns; do
kubectl label clusterrole $role rbac-cleanup/exclude=true 2>/dev/null || true
done
Verification¶
Domain A (Networking) — Mesh traffic flowing¶
$ kubectl exec order-service-6b5d8c9f-x2k4j -n prod -c istio-proxy -- \
pilot-agent request GET clusters | grep "inventory-service.*healthy"
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.42:8080::healthy::1
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.43:8080::healthy::1
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.44:8080::healthy::1
# Test a request
$ kubectl exec order-service-6b5d8c9f-x2k4j -n prod -- \
curl -s http://inventory-service.prod:8080/v1/stock | head -1
{"items":142,"warehouse":"us-east-1"}
Domain B (Kubernetes) — ClusterRole exists, Istiod has access¶
$ kubectl get clusterrole istiod-clusterrole
NAME CREATED AT
istiod-clusterrole 2026-03-19T16:15:00Z
$ kubectl auth can-i list endpoints --namespace prod \
--as=system:serviceaccount:istio-system:istiod
yes
Domain C (Security) — Cleanup controller configured with exclusions¶
$ kubectl get deployment rbac-cleanup-controller -n kube-system \
-o jsonpath='{.spec.template.spec.containers[0].args}'
["--cleanup-orphaned-roles","--max-age=90d","--dry-run=false","--exclude-label=rbac-cleanup/exclude=true"]
Prevention¶
- Monitoring: Add a ClusterRole existence check for critical system components. Alert immediately when a system ClusterRole is deleted.
- alert: CriticalClusterRoleDeleted
expr: absent(kube_clusterrole_info{clusterrole=~"istiod.*|cert-manager.*|system:.*"})
for: 1m
labels:
severity: critical
-
Runbook: RBAC cleanup controllers must have an exclusion list that includes all system component ClusterRoles. Never run RBAC cleanup in
--dry-run=falsemode without reviewing the deletion list. -
Architecture: Use Kubernetes Admission Webhooks (e.g., OPA/Gatekeeper) to deny deletion of ClusterRoles with specific labels. Implement a "break-glass" label requirement for deleting system RBAC resources.