Service Mesh - Street-Level Ops¶
Real-world patterns and gotchas from production service mesh operations.
Quick Diagnosis Commands¶
# Is the sidecar injected?
kubectl get pods -n myapp -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{","}{end}{"\n"}{end}'
# Istio: analyze config for issues
istioctl analyze -n myapp
# Linkerd: full health check
linkerd check --proxy -n myapp
# Proxy logs (Istio)
kubectl logs deploy/my-service -n myapp -c istio-proxy --tail=50
# Proxy logs (Linkerd)
kubectl logs deploy/my-service -n myapp -c linkerd-proxy --tail=50
# Check mTLS status
linkerd viz edges deploy -n myapp
# or
istioctl x describe pod <pod-name> -n myapp
One-liner: Quick sidecar injection check across all namespaces:
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{","}{end}{"\n"}{end}' | grep -v istio-proxy | grep -v linkerd-proxy— shows pods WITHOUT sidecars.
Gotcha: Pod Not Getting Sidecar¶
Symptoms: Pod shows 1/1 instead of 2/2 containers.
Causes:
1. Namespace not labeled/annotated for injection
2. Pod has sidecar.istio.io/inject: "false" annotation
3. Pod was created before injection was enabled (restart needed)
4. The webhook is not running
# Check namespace labels
kubectl get ns myapp --show-labels
# Check webhook
kubectl get mutatingwebhookconfigurations | grep -E 'istio|linkerd'
# Force re-injection
kubectl rollout restart deployment -n myapp
Analogy: A service mesh sidecar is like a postal clerk sitting next to every employee in an office. Every letter (request) goes through the clerk, who stamps it (mTLS), logs it (metrics), and can redirect it (routing rules). Ambient mode replaces per-desk clerks with one clerk per floor (ztunnel per node) — same security, less overhead.
Gotcha: Port Naming (Istio)¶
Istio requires Service port names to follow <protocol>[-<suffix>] convention:
# BAD - Istio treats as raw TCP
ports:
- name: web
port: 8080
# GOOD - Istio applies HTTP routing
ports:
- name: http-web
port: 8080
Valid prefixes: http, http2, grpc, tcp, tls, https, mongo, redis, mysql.
Debug clue: If Istio metrics show all traffic as TCP instead of HTTP, the port naming is wrong. Check
istioctl proxy-config listeners <pod>— if the listener shows"transport_protocol":"raw_buffer"instead of"http", Envoy is treating it as raw TCP.
Gotcha: Resource Budgets¶
Every sidecar uses resources. Plan for it:
# Istio: set sidecar resource limits
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
defaultConfig:
proxyMetadata: {}
values:
global:
proxy:
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
Rule of thumb: Budget 50-100m CPU and 64-128Mi memory per pod for the sidecar.
Scale note: At 500+ pods, sidecar overhead adds up to real money. 500 pods x 100m CPU = 50 CPU cores just for proxies. Istio Ambient mode (sidecar-less) eliminates per-pod overhead by using per-node ztunnels instead — benchmarks show ~70% memory savings. Evaluate it for large clusters.
Pattern: Gradual Mesh Rollout¶
Don't mesh everything at once:
- Start with one non-critical namespace in permissive mode
- Verify metrics and no errors in the mesh dashboard
- Move to strict mTLS for that namespace
- Expand namespace by namespace
- Last: mesh critical production services
Pattern: Exclude Jobs and CronJobs¶
Sidecars in Jobs prevent completion (the proxy never exits):
# Istio: tell the proxy to exit when the main container exits
apiVersion: batch/v1
kind: Job
metadata:
annotations:
proxy.istio.io/config: '{"holdApplicationUntilProxyStarts": true}'
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: my-job
# At the end of the job, signal Envoy to quit:
command: ["sh", "-c", "do_work && curl -sf -XPOST http://localhost:15020/quitquitquit"]
Linkerd handles this automatically with linkerd.io/inject: enabled - the proxy exits when the main container exits.
Gotcha: The
quitquitquitendpoint only works if you setISTIO_QUIT_API=truein the proxy or use Istio 1.12+. Without it,curlto the quit endpoint returns 404 and your Job hangs forever.Default trap: Istio's
PeerAuthenticationdefaults toPERMISSIVEmode, meaning it accepts both plaintext and mTLS traffic. This is safe for rollout but provides no security guarantee — any pod can talk plaintext. After mesh rollout is complete, set it toSTRICTper-namespace to enforce mTLS everywhere.
Pattern: Canary with Mesh¶
The mesh gives you traffic splitting without duplicate Ingress rules:
- Deploy canary with different version label
- Create TrafficSplit or VirtualService (10% canary)
- Monitor error rate in mesh dashboard
- Gradually shift traffic (10 -> 25 -> 50 -> 100%)
- If errors spike, shift back to 0% canary instantly
Emergency: Disable Mesh Fast¶
If the mesh is causing a production outage:
# Option 1: Disable injection and restart (recommended)
kubectl label namespace myapp istio-injection- # Istio
kubectl annotate namespace myapp linkerd.io/inject- # Linkerd
kubectl rollout restart deployment -n myapp
# Option 2: Scale down control plane (nuclear option)
kubectl scale deploy istiod -n istio-system --replicas=0
# Note: existing proxies keep working with last config
Under the hood: When you scale down
istiod, existing Envoy sidecars continue to route traffic using their last-known configuration (xDS snapshot). They will not get updates, but they will not crash. This is why Option 2 is a viable emergency move — it buys you time without immediately breaking traffic.
Quick Reference¶
- Cheatsheet: Service-Mesh