Skip to content

Remediation: Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy

Immediate Fix (Security — Domain C)

The fix requires restoring the RBAC ClusterRole and configuring the cleanup controller to exclude system roles.

Step 1: Restore the Istio ClusterRole

# Re-apply the Istio ClusterRole from the Istio installation manifests
$ istioctl manifest generate --set profile=default | \
    kubectl apply -f - --selector 'kind=ClusterRole,app=istiod' --dry-run=server
clusterrole.rbac.authorization.k8s.io/istiod-clusterrole created (server dry run)

$ istioctl manifest generate --set profile=default | \
    kubectl apply -f - --selector 'kind=ClusterRole,app=istiod'
clusterrole.rbac.authorization.k8s.io/istiod-clusterrole created

Step 2: Verify Istiod recovers

$ kubectl logs -n istio-system deploy/istiod --tail=5
2026-03-19T16:15:02.112Z  info  ads  EDS: successfully listed endpoints for inventory-service.prod
2026-03-19T16:15:02.115Z  info  ads  Pushing EDS update for inventory-service.prod.svc.cluster.local

$ istioctl proxy-status
NAME                                                  CDS     LDS     EDS     RDS     ISTIOD
order-service-6b5d8c9f-x2k4j.prod                    SYNCED  SYNCED  SYNCED  SYNCED  istiod-5f8c7d6b-k2m3n

EDS is now SYNCED.

Step 3: Add exclusion labels to the RBAC cleanup controller

# Label all Istio ClusterRoles to exclude from cleanup
$ kubectl label clusterrole istiod-clusterrole \
    rbac-cleanup/exclude=true \
    app.kubernetes.io/managed-by=istio

# Update the cleanup controller configuration
$ kubectl patch deployment rbac-cleanup-controller -n kube-system --type=json \
    -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--exclude-label=rbac-cleanup/exclude=true"}]'

Step 4: Add protection for all system ClusterRoles

# Label all system-critical ClusterRoles
$ for role in istiod-clusterrole cert-manager-controller system:coredns; do
    kubectl label clusterrole $role rbac-cleanup/exclude=true 2>/dev/null || true
done

Verification

Domain A (Networking) — Mesh traffic flowing

$ kubectl exec order-service-6b5d8c9f-x2k4j -n prod -c istio-proxy -- \
    pilot-agent request GET clusters | grep "inventory-service.*healthy"
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.42:8080::healthy::1
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.43:8080::healthy::1
outbound|8080||inventory-service.prod.svc.cluster.local::10.244.3.44:8080::healthy::1

# Test a request
$ kubectl exec order-service-6b5d8c9f-x2k4j -n prod -- \
    curl -s http://inventory-service.prod:8080/v1/stock | head -1
{"items":142,"warehouse":"us-east-1"}

Domain B (Kubernetes) — ClusterRole exists, Istiod has access

$ kubectl get clusterrole istiod-clusterrole
NAME                  CREATED AT
istiod-clusterrole    2026-03-19T16:15:00Z

$ kubectl auth can-i list endpoints --namespace prod \
    --as=system:serviceaccount:istio-system:istiod
yes

Domain C (Security) — Cleanup controller configured with exclusions

$ kubectl get deployment rbac-cleanup-controller -n kube-system \
    -o jsonpath='{.spec.template.spec.containers[0].args}'
["--cleanup-orphaned-roles","--max-age=90d","--dry-run=false","--exclude-label=rbac-cleanup/exclude=true"]

Prevention

  • Monitoring: Add a ClusterRole existence check for critical system components. Alert immediately when a system ClusterRole is deleted.
- alert: CriticalClusterRoleDeleted
  expr: absent(kube_clusterrole_info{clusterrole=~"istiod.*|cert-manager.*|system:.*"})
  for: 1m
  labels:
    severity: critical
  • Runbook: RBAC cleanup controllers must have an exclusion list that includes all system component ClusterRoles. Never run RBAC cleanup in --dry-run=false mode without reviewing the deletion list.

  • Architecture: Use Kubernetes Admission Webhooks (e.g., OPA/Gatekeeper) to deny deletion of ClusterRoles with specific labels. Implement a "break-glass" label requirement for deleting system RBAC resources.