Service Mesh - Street-Level Ops¶

Real-world patterns and gotchas from production service mesh operations.

Quick Diagnosis Commands¶

# Is the sidecar injected?
kubectl get pods -n myapp -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{","}{end}{"\n"}{end}'

# Istio: analyze config for issues
istioctl analyze -n myapp

# Linkerd: full health check
linkerd check --proxy -n myapp

# Proxy logs (Istio)
kubectl logs deploy/my-service -n myapp -c istio-proxy --tail=50

# Proxy logs (Linkerd)
kubectl logs deploy/my-service -n myapp -c linkerd-proxy --tail=50

# Check mTLS status
linkerd viz edges deploy -n myapp
# or
istioctl x describe pod <pod-name> -n myapp

One-liner: Quick sidecar injection check across all namespaces: kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{","}{end}{"\n"}{end}' | grep -v istio-proxy | grep -v linkerd-proxy — shows pods WITHOUT sidecars.

Gotcha: Pod Not Getting Sidecar¶

Symptoms: Pod shows 1/1 instead of 2/2 containers.

Causes: 1. Namespace not labeled/annotated for injection 2. Pod has sidecar.istio.io/inject: "false" annotation 3. Pod was created before injection was enabled (restart needed) 4. The webhook is not running

# Check namespace labels
kubectl get ns myapp --show-labels

# Check webhook
kubectl get mutatingwebhookconfigurations | grep -E 'istio|linkerd'

# Force re-injection
kubectl rollout restart deployment -n myapp

Analogy: A service mesh sidecar is like a postal clerk sitting next to every employee in an office. Every letter (request) goes through the clerk, who stamps it (mTLS), logs it (metrics), and can redirect it (routing rules). Ambient mode replaces per-desk clerks with one clerk per floor (ztunnel per node) — same security, less overhead.

Gotcha: Port Naming (Istio)¶

Istio requires Service port names to follow <protocol>[-<suffix>] convention:

# BAD - Istio treats as raw TCP
ports:
  - name: web
    port: 8080

# GOOD - Istio applies HTTP routing
ports:
  - name: http-web
    port: 8080

Valid prefixes: http, http2, grpc, tcp, tls, https, mongo, redis, mysql.

Debug clue: If Istio metrics show all traffic as TCP instead of HTTP, the port naming is wrong. Check istioctl proxy-config listeners <pod> — if the listener shows "transport_protocol":"raw_buffer" instead of "http", Envoy is treating it as raw TCP.

Gotcha: Resource Budgets¶

Every sidecar uses resources. Plan for it:

# Istio: set sidecar resource limits
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    defaultConfig:
      proxyMetadata: {}
  values:
    global:
      proxy:
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
          limits:
            cpu: 200m
            memory: 128Mi

Rule of thumb: Budget 50-100m CPU and 64-128Mi memory per pod for the sidecar.

Scale note: At 500+ pods, sidecar overhead adds up to real money. 500 pods x 100m CPU = 50 CPU cores just for proxies. Istio Ambient mode (sidecar-less) eliminates per-pod overhead by using per-node ztunnels instead — benchmarks show ~70% memory savings. Evaluate it for large clusters.

Pattern: Gradual Mesh Rollout¶

Don't mesh everything at once:

Start with one non-critical namespace in permissive mode
Verify metrics and no errors in the mesh dashboard
Move to strict mTLS for that namespace
Expand namespace by namespace
Last: mesh critical production services

Pattern: Exclude Jobs and CronJobs¶

Sidecars in Jobs prevent completion (the proxy never exits):

# Istio: tell the proxy to exit when the main container exits
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    proxy.istio.io/config: '{"holdApplicationUntilProxyStarts": true}'
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "true"
    spec:
      containers:
        - name: my-job
          # At the end of the job, signal Envoy to quit:
          command: ["sh", "-c", "do_work && curl -sf -XPOST http://localhost:15020/quitquitquit"]

Linkerd handles this automatically with linkerd.io/inject: enabled - the proxy exits when the main container exits.

Gotcha: The quitquitquit endpoint only works if you set ISTIO_QUIT_API=true in the proxy or use Istio 1.12+. Without it, curl to the quit endpoint returns 404 and your Job hangs forever.

Default trap: Istio's PeerAuthentication defaults to PERMISSIVE mode, meaning it accepts both plaintext and mTLS traffic. This is safe for rollout but provides no security guarantee — any pod can talk plaintext. After mesh rollout is complete, set it to STRICT per-namespace to enforce mTLS everywhere.

Pattern: Canary with Mesh¶

The mesh gives you traffic splitting without duplicate Ingress rules:

Deploy canary with different version label
Create TrafficSplit or VirtualService (10% canary)
Monitor error rate in mesh dashboard
Gradually shift traffic (10 -> 25 -> 50 -> 100%)
If errors spike, shift back to 0% canary instantly

Emergency: Disable Mesh Fast¶

If the mesh is causing a production outage:

# Option 1: Disable injection and restart (recommended)
kubectl label namespace myapp istio-injection-  # Istio
kubectl annotate namespace myapp linkerd.io/inject-  # Linkerd
kubectl rollout restart deployment -n myapp

# Option 2: Scale down control plane (nuclear option)
kubectl scale deploy istiod -n istio-system --replicas=0
# Note: existing proxies keep working with last config

Under the hood: When you scale down istiod, existing Envoy sidecars continue to route traffic using their last-known configuration (xDS snapshot). They will not get updates, but they will not crash. This is why Option 2 is a viable emergency move — it buys you time without immediately breaking traffic.

Quick Reference¶

Cheatsheet: Service-Mesh