Service Mesh Footguns¶

Mistakes that turn your mesh into a reliability liability instead of an improvement.

1. Missing port naming convention¶

Istio needs ports named with the protocol prefix (http-api, grpc-backend, tcp-db). You create a Service with port: 8080 and no name. Istio defaults to TCP and disables all L7 features — no retries, no routing, no metrics. Everything "works" but you get zero observability.

Fix: Name every port: name: http-api, name: grpc-backend. Istio uses the prefix to determine the protocol.

Debug clue: If your Istio metrics show zero request-level data (no HTTP status codes, no latency histograms) but traffic is flowing, check port naming first. istioctl analyze will flag unnamed ports, and istioctl x describe pod <name> shows the detected protocol per port.

2. Sidecar injection on jobs and migrations¶

You enable namespace-wide sidecar injection. Your database migration Job gets a sidecar. The Job finishes but the sidecar keeps running. The Job never completes. Your CI pipeline hangs waiting for the Job.

Fix: Disable sidecar for Jobs: sidecar.istio.io/inject: "false" annotation on the pod template. Or use Istio's native Job completion detection (available in newer versions).

3. mTLS strict mode before all services have sidecars¶

You enable STRICT mTLS on a namespace. Three services don't have sidecars yet. They can't receive mTLS traffic. They're now unreachable. Production is down.

Fix: Start with PERMISSIVE mode (accepts both plain and mTLS). Verify all services have sidecars with istioctl analyze. Only then switch to STRICT.

4. VirtualService with no matching Gateway¶

You create a VirtualService with routing rules. You forget to bind it to a Gateway, or the Gateway hosts field doesn't match the VirtualService hosts. External traffic ignores your routing rules entirely. Internal traffic may work, making the bug intermittent.

Fix: Always specify gateways: in VirtualService. Match hostnames exactly between Gateway and VirtualService. Use istioctl analyze to catch mismatches.

5. Retry storms¶

You configure retries: attempts: 5, perTryTimeout: 10s. When a backend is down, every incoming request generates 5 retry requests. If you have 3 levels of services, one request becomes 5^3 = 125 backend requests. You've turned a partial outage into a complete one.

Fix: Limit retry attempts (2-3 max). Only retry on specific status codes (503, not 500). Combine retries with circuit breakers. Set retry budgets.

Under the hood: Retry amplification follows exponential math. With N levels of services each doing R retries, worst-case load on the deepest service is R^N. 3 retries across 4 layers = 81x amplification. Envoy's retry budget (default off) caps retries to a percentage of active requests — enable it.

6. Ignoring sidecar resource usage¶

You didn't set resource requests on the Envoy sidecar. On a node with 50 pods, you have 50 Envoy proxies consuming ~50MB RAM each — 2.5GB unaccounted for. The node runs out of memory and pods get evicted.

Fix: Set sidecar resource requests in the global mesh config. Account for sidecar overhead in node sizing. Monitor envoy_server_memory_allocated.

7. Mutual TLS breaking health checks¶

Your cloud load balancer health checks fail because they can't speak mTLS. The LB marks all backends as unhealthy. Traffic stops flowing.

Fix: Use Istio's health check rewriting (enabled by default in recent versions). Or exclude health check paths from mTLS. Configure the LB to check a port that doesn't require mTLS.

8. Traffic policy without fallback¶

You set up 90/10 canary routing. The canary pod crashes. 10% of traffic goes to a dead backend. Istio doesn't automatically reroute — it respects the weight you set.

Fix: Combine traffic policies with circuit breakers. Set outlierDetection to eject unhealthy hosts. Monitor canary health and automate rollback with Flagger or Argo Rollouts.

9. Upgrading Istio without canary¶

You do an in-place Istio control plane upgrade. The new version has a breaking change in Envoy config. Every sidecar gets updated. Every service starts 503ing simultaneously.

Fix: Use canary upgrades (revision-based). Run two control planes side-by-side. Migrate namespaces one at a time. Roll back by relabeling namespaces to the old revision.

War story: Istio 1.12 to 1.13 changed how EnvoyFilter resources were processed. Teams that did in-place upgrades discovered their custom filters silently stopped working, breaking auth and rate-limiting across their mesh. Canary upgrades would have caught this in one namespace first.

10. Debug headers leaking to production¶

You enable Envoy debug headers (x-envoy-upstream-service-time, x-request-id) during troubleshooting. You forget to disable them. Clients see internal service topology in response headers.

Fix: Disable debug headers in production mesh config. Use them only in staging or during active debugging with a short TTL.