Skip to content

Portal | Level: L3: Advanced | Topics: Service Mesh | Domain: Kubernetes

Scenario: 100% 503 Errors After Mesh Rollout

The Prompt

"We enabled Istio on the grokdevops namespace. Immediately after, all requests return 503 errors. The app was working fine before. What happened?"

Initial Report

Slack: "Rolled out Istio to grokdevops namespace. All traffic is now 503. Pods show 2/2 Ready. App logs look fine - requests never reach the app."

Constraints

  • Time pressure: Production is down. Need to either fix or roll back within 10 minutes.
  • Quick rollback available: Can disable Istio injection and restart pods.

Observable Evidence

  • All requests return upstream connect error or disconnect/reset before headers. reset reason: connection failure
  • Pods show 2/2 containers (sidecar injected successfully)
  • App container logs show no incoming requests
  • istioctl analyze -n grokdevops shows warnings

Expected Investigation Path

# 1. Quick check: is the sidecar injected?
kubectl get pods -n grokdevops -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{","}{end}{"\n"}{end}'

# 2. Check Istio proxy logs
kubectl logs deploy/grokdevops -n grokdevops -c istio-proxy --tail=50

# 3. Run istio analysis
istioctl analyze -n grokdevops

# 4. Check Service port naming
kubectl get svc grokdevops -n grokdevops -o yaml | grep -A5 ports:
# PROBLEM: port named "web" instead of "http-web"

# 5. Check if there's a strict mTLS policy for non-meshed clients
kubectl get peerauthentication -A

# 6. Fix the port name
kubectl patch svc grokdevops -n grokdevops --type=json \
  -p='[{"op":"replace","path":"/spec/ports/0/name","value":"http-web"}]'

Root Cause Possibilities

  1. Port naming — Istio requires ports named with protocol prefix (http-, grpc-, tcp-). Unnamed ports are treated as TCP.
  2. Strict mTLS — PeerAuthentication set to STRICT but some clients aren't meshed.
  3. VirtualService misconfiguration — Route pointing to wrong host/subset.
  4. Init container ordering — Istio init container must run before app init containers that need network.

What a Strong Answer Includes

  • Immediate rollback option: "I'd first assess if this is fixable in 5 minutes, otherwise rollback by removing injection and restarting"
  • Knowledge of Istio port naming convention
  • Checking istio-proxy logs for the actual error
  • Using istioctl analyze as a diagnostic tool
  • Understanding that 2/2 Ready doesn't mean the mesh is working correctly
  • Post-incident: test mesh enablement in staging first, use canary rollout per namespace

Wiki Navigation