Istio Service Mesh — Street-Level Ops¶

Quick Diagnosis Commands¶

# Validate all Istio config in the cluster for issues
istioctl analyze

# Validate config in a specific namespace
istioctl analyze -n bookinfo

# Analyze local YAML files before applying
istioctl analyze ./my-virtualservice.yaml

# Check proxy sync status across all pods
istioctl proxy-status

# Check a specific pod's sync state (SYNCED vs STALE)
istioctl proxy-status <pod-name>.<namespace>

# Dump all routes a sidecar knows about
istioctl proxy-config routes <pod-name>.<namespace>

# Dump clusters (upstream services) visible to a sidecar
istioctl proxy-config clusters <pod-name>.<namespace>

# Dump listeners (inbound port configs) on a sidecar
istioctl proxy-config listeners <pod-name>.<namespace>

# Dump endpoints for a specific cluster
istioctl proxy-config endpoints <pod-name>.<namespace> --cluster "outbound|9080||reviews.bookinfo.svc.cluster.local"

# Full bootstrap config for a sidecar
istioctl proxy-config bootstrap <pod-name>.<namespace>

# Check istiod logs (config push errors, cert issues)
kubectl logs -l app=istiod -n istio-system --tail=100 -f

# Describe what Istio config applies to a pod (routing, policies, auth)
istioctl experimental describe pod <pod-name>.<namespace>

# Launch Kiali dashboard (requires kiali deployment)
istioctl dashboard kiali

# Launch Envoy admin UI for a specific sidecar
istioctl dashboard envoy <pod-name>.<namespace>

# Launch Jaeger tracing dashboard
istioctl dashboard jaeger

# Check installed Istio version and components
istioctl version

Gotcha: Proxy Status Shows STALE — Config Not Syncing¶

Symptom: istioctl proxy-status shows STALE for one or more pods. The VirtualService or DestinationRule you applied has no effect.

Cause: istiod could not push updated xDS config to the sidecar. Common reasons: istiod is overloaded, the sidecar lost its gRPC stream to istiod, or the new config failed validation silently.

Diagnosis and fix:

# Check for validation errors in the config you applied
istioctl analyze -n <namespace>

# Check istiod for push errors
kubectl logs -l app=istiod -n istio-system | grep -i "error\|NACK\|rejected"

# Force the sidecar to reconnect — restart the pod
kubectl rollout restart deployment/<deployment> -n <namespace>

# If istiod itself looks unhealthy
kubectl -n istio-system rollout restart deployment/istiod
kubectl -n istio-system get pods -l app=istiod -w

Rule: Always run istioctl analyze before investigating why a routing change has no effect. The most common cause of "Istio isn't working" is a config error that istiod rejected silently.

Debug clue: istioctl proxy-status shows three columns: CDS, LDS, EDS (cluster/listener/endpoint discovery). If any shows STALE while others show SYNCED, the issue is specific to that config type. CDS stale = DestinationRule problem. LDS stale = VirtualService or Gateway problem. EDS stale = endpoint discovery problem (usually a Service selector mismatch).

Gotcha: mTLS PERMISSIVE Hiding Real Failures¶

You have PERMISSIVE mTLS. A service is accepting plaintext from an unknown caller and you don't realize it because there are no errors — just unencrypted traffic. Or you think mTLS is working because traffic flows, but the connection is actually plaintext because the caller has no sidecar.

Diagnosis:

# Check the effective mTLS mode on a workload
istioctl experimental describe pod <pod>.<ns>
# Look for "mTLS: PERMISSIVE" vs "mTLS: STRICT"

# Check if a connection is mTLS in access logs
kubectl logs <pod> -c istio-proxy | grep '"connection_security_policy"'
# mTLS shows: "mutual_tls"
# Plaintext shows: "none" or "unknown"

# Check PeerAuthentication in effect
kubectl get peerauthentication -A

Fix: Switch to STRICT once all services have sidecars. Do it namespace by namespace, not all at once:

# Enable STRICT for one namespace
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: bookinfo
spec:
  mtls:
    mode: STRICT
EOF

# Verify no traffic is breaking (watch error rates in Kiali or metrics)

Gotcha: Sidecar Resource Limits Too Low¶

Under load, the Envoy sidecar OOMs or throttles, causing request failures that look like application bugs. The sidecar container is killed and restarted; during the restart, all traffic through that pod drops.

Symptoms: Intermittent 503s on one specific pod; pod restarts; OOMKilled in pod events.

# Check if the sidecar is OOMKilling
kubectl describe pod <pod> -n <ns> | grep -A5 "istio-proxy"
# Look for: "OOMKilled", "Reason: OOMKilled"

# Current memory usage
kubectl top pod <pod> -n <ns> --containers

# Check current sidecar resource settings
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[?(@.name=="istio-proxy")].resources}'

Fix: Set proxy resource limits in IstioOperator or per-pod annotation:

# Per-pod annotation (override for resource-heavy pods)
metadata:
  annotations:
    sidecar.istio.io/proxyCPU: "200m"
    sidecar.istio.io/proxyMemory: "256Mi"
    sidecar.istio.io/proxyCPULimit: "500m"
    sidecar.istio.io/proxyMemoryLimit: "512Mi"

Gotcha: Init Container Race — App Starts Before Proxy Ready¶

Symptom: Application fails at startup with connection errors to other services (DNS resolution failure, immediate connection refused). It works fine after a restart.

Cause: The istio-init init container programs iptables, but the istio-proxy sidecar container may not have completed its xDS sync with istiod before the application container begins making outbound calls. Calls exit the pod before Envoy is ready to handle them.

Fix:

# In IstioOperator or MeshConfig
spec:
  meshConfig:
    defaultConfig:
      holdApplicationUntilProxyStarts: true

Default trap: Without holdApplicationUntilProxyStarts: true, your application container may start making outbound calls before Envoy's iptables rules are programmed. The first few requests fail with connection refused or DNS resolution failures. This is especially insidious because it only happens on cold starts -- the app works fine after the initial failure, making it look like a flaky dependency.

This annotation can also be set per-pod:

metadata:
  annotations:
    proxy.istio.io/config: '{"holdApplicationUntilProxyStarts": true}'

Pattern: Canary Rollout Verification¶

# 1. Apply DestinationRule with v1 and v2 subsets
kubectl apply -f destinationrule-reviews.yaml

# 2. Start with 5% canary traffic
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
  namespace: bookinfo
spec:
  hosts:
    - reviews
  http:
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 95
        - destination:
            host: reviews
            subset: v2
          weight: 5
EOF

# 3. Watch error rates by version in real time
kubectl -n bookinfo exec -it <load-gen-pod> -- watch -n2 \
  "curl -s http://reviews:9080/health"

# 4. Verify traffic split in proxy metrics
istioctl proxy-config clusters <productpage-pod>.bookinfo \
  | grep reviews

# 5. Check actual request distribution
kubectl logs <reviews-v2-pod> -c istio-proxy | \
  grep '"response_code":"200"' | wc -l

# 6. Promote: shift to 50%, then 100%
kubectl patch virtualservice reviews -n bookinfo --type=json \
  -p='[{"op":"replace","path":"/spec/http/0/route/0/weight","value":50},
       {"op":"replace","path":"/spec/http/0/route/1/weight","value":50}]'

# 7. Complete rollout: delete VirtualService (traffic goes to all pods equally)
kubectl delete virtualservice reviews -n bookinfo

Pattern: mTLS Migration Checklist¶

Use this when moving a namespace from PERMISSIVE to STRICT:

# 1. Verify all pods in the namespace have sidecars
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].name}{"\n"}{end}' \
  | grep -v istio-proxy

# 2. Check for any callers from outside the mesh (no sidecar)
kubectl logs -l app=<service> -n <ns> -c istio-proxy \
  | grep '"connection_security_policy":"none"'
# Any "none" entries = plaintext callers that will break under STRICT

# 3. Check PeerAuthentication currently in effect
istioctl experimental describe pod <pod>.<ns> | grep -i mtls

# 4. Apply STRICT to namespace (not mesh-wide yet)
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: <ns>
spec:
  mtls:
    mode: STRICT
EOF

# 5. Watch for 403/503 errors immediately after
kubectl logs -l istio=ingressgateway -n istio-system -f | grep '"response_code":"5'

# 6. If all clear for 30 minutes, expand to next namespace

Pattern: Debug with `istioctl experimental describe`¶

The describe command synthesizes the effective Istio configuration for a specific pod, including which VirtualServices, DestinationRules, AuthorizationPolicies, and PeerAuthentications apply. It is the fastest way to understand why traffic to a pod behaves unexpectedly.

istioctl experimental describe pod reviews-v2-abc123-xyz.bookinfo

# Example output sections to look for:
# - "Service: reviews -> Port 9080 (http)"
# - "VirtualService: reviews/bookinfo"  (which VS applies)
# - "DestinationRule: reviews/bookinfo" (which DR applies)
# - "PeerAuthentication: STRICT"        (mTLS mode)
# - "AuthorizationPolicy: reviews-policy ALLOW"
# - "WARNING: ..."                      (config issues)

Scenario: AuthorizationPolicy Blocking Kubernetes Health Checks¶

You applied an AuthorizationPolicy. Pod readiness probes start failing. Kubernetes marks the pod unhealthy and evicts it. The eviction triggers another pod start, which also fails probes. Rolling restart cascades across your Deployment.

Cause: Kubernetes health check probes come from the kubelet, which has no Istio sidecar. Under a DENY-by-default AuthorizationPolicy, these probes are blocked.

# Confirm by checking sidecar logs for probe path rejections
kubectl logs <pod> -c istio-proxy | grep "403\|health\|readiness\|liveness"

# Fix: explicitly allow health check paths from any source
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-healthchecks
  namespace: <ns>
spec:
  selector:
    matchLabels:
      app: <app>
  action: ALLOW
  rules:
    - to:
        - operation:
            paths: ["/health", "/healthz", "/ready", "/readyz", "/metrics"]
EOF

Useful One-Liners¶

# List all VirtualServices and their hosts
kubectl get virtualservices -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.spec.hosts}{"\n"}{end}'

# Find all DestinationRules
kubectl get destinationrules -A

# Check which pods have Istio sidecar injection enabled
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.spec.containers[*].name}{"\n"}{end}' | grep istio-proxy

# Check all PeerAuthentication policies
kubectl get peerauthentication -A

# Check all AuthorizationPolicies
kubectl get authorizationpolicy -A

# Tail access logs from a specific service's sidecar
kubectl logs -l app=reviews -n bookinfo -c istio-proxy -f

# Check Envoy stats for a sidecar (requires port-forward to admin port 15000)
kubectl exec <pod> -c istio-proxy -- curl -s http://localhost:15000/stats | grep upstream_rq

# Verify mTLS certificates on a sidecar
kubectl exec <pod> -c istio-proxy -- openssl s_client -connect <target-svc>:9080 -showcerts

# Check Istio injection label on namespaces
kubectl get ns --show-labels | grep istio-injection

# Dump entire Envoy xDS config for a sidecar (large output)
istioctl proxy-config all <pod>.<ns> -o json > envoy-config-dump.json

Quick Reference¶

Cheatsheet: Service Mesh