Investigation: Pod OOMKilled, Memory Leak Is in Sidecar, Fix Is Helm Values¶

Phase 1: Kubernetes Investigation (Dead End)¶

The engineer checks the OOMKill details:

$ kubectl describe pod payment-service-7f8b9c6d4-xk2nm -n prod | grep -A5 "Last State"
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 19 Mar 2026 14:28:11 +0000
      Finished:     Thu, 19 Mar 2026 14:31:47 +0000

$ kubectl top pod payment-service-7f8b9c6d4-xk2nm -n prod --containers
POD                                    NAME               CPU(cores)   MEMORY(bytes)
payment-service-7f8b9c6d4-xk2nm       payment-service    45m          118Mi
payment-service-7f8b9c6d4-xk2nm       istio-proxy        12m          127Mi

Wait — the payment-service container is only using 118Mi. The pod has 256Mi total limit. But the istio-proxy sidecar is using 127Mi. Combined: 245Mi, which is within the pod's memory limit. But look at the container-level limits:

$ kubectl get pod payment-service-7f8b9c6d4-xk2nm -n prod -o jsonpath='{.spec.containers[*].name}'
payment-service istio-proxy

$ kubectl get pod payment-service-7f8b9c6d4-xk2nm -n prod \
  -o jsonpath='{range .spec.containers[*]}{.name}: {.resources.limits.memory}{"\n"}{end}'
payment-service: 256Mi
istio-proxy: 128Mi

The payment-service container has 256Mi and is using 118Mi — plenty of headroom. The istio-proxy sidecar has 128Mi and is using 127Mi — it is about to OOM. But Kubernetes reports the OOMKill against the payment-service container name because the kubelet's OOM killer scores both containers, and the cgroup accounting at the pod level triggered the kill on the container that happened to allocate last.

The Pivot¶

The key clue: kubectl top shows the sidecar at 127Mi/128Mi, not the app. Checking the sidecar's memory trend:

$ kubectl logs payment-service-7f8b9c6d4-xk2nm -c istio-proxy -n prod --tail=20
2026-03-19T14:30:12.882Z  warning  envoy config  gRPC config stream closed: 14, upstream connect error
2026-03-19T14:30:13.004Z  warning  envoy main    caught SIGTERM

# Prometheus query for sidecar memory
# container_memory_working_set_bytes{pod=~"payment-service.*", container="istio-proxy"}
# Shows a steady climb from 64Mi to 128Mi over 2 hours

Phase 2: Observability Investigation (Root Cause)¶

The istio-proxy sidecar is leaking memory. But why did this start 2 hours ago? Check what changed:

$ kubectl get pod payment-service-7f8b9c6d4-xk2nm -n prod \
  -o jsonpath='{.spec.containers[?(@.name=="istio-proxy")].image}'
docker.io/istio/proxyv2:1.20.3

The Istio version did not change. But check the sidecar configuration:

$ kubectl get pod payment-service-7f8b9c6d4-xk2nm -n prod \
  -o jsonpath='{.metadata.annotations}' | jq . | grep -i proxy
  "sidecar.istio.io/proxyMemoryLimit": "128Mi",
  "proxy.istio.io/config": "{\"concurrency\":0,\"tracing\":{\"sampling\":100}}"

tracing.sampling: 100 — 100% trace sampling! The previous deployment had sampling: 1 (1%). The Helm values change bumped trace sampling to 100%, causing the Envoy proxy to buffer 100x more trace data in memory.

$ helm diff upgrade payment-service devops/helm/grokdevops -f devops/helm/values-prod.yaml -n prod 2>/dev/null | grep -A3 "sampling"
+  proxy.istio.io/config: '{"concurrency":0,"tracing":{"sampling":100}}'
-  proxy.istio.io/config: '{"concurrency":0,"tracing":{"sampling":1}}'

Domain Bridge: Why This Crossed Domains¶

Key insight: The symptom was a Kubernetes OOMKill on the application container, but the actual memory pressure came from the Envoy sidecar (observability concern — trace sampling). The root cause was a Helm values change that set trace sampling to 100%. This is common because: sidecar containers share the pod's resource constraints. A configuration change to an observability sidecar can starve the primary application container. Kubernetes OOMKill attribution at the pod level can point to the wrong container.

Root Cause¶

The Helm values file (values-prod.yaml) was updated to set Istio proxy trace sampling from 1% to 100%. This was intended as a temporary debug setting but was committed to prod values. The 100x increase in trace buffering caused the Envoy sidecar to consume its full 128Mi memory limit, triggering pod-level OOM kills that were misattributed to the payment-service container.