Skip to content

Remediation: Pod OOMKilled, Memory Leak Is in Sidecar, Fix Is Helm Values

Immediate Fix (DevOps Tooling — Domain C)

The fix is in the Helm values file, not in Kubernetes directly or in the observability stack.

Step 1: Revert the trace sampling in Helm values

$ cd devops/helm
$ grep -n "sampling" values-prod.yaml
47:      sampling: 100

# Fix the value
$ sed -i 's/sampling: 100/sampling: 1/' values-prod.yaml

Step 2: Deploy the fix

$ helm upgrade payment-service devops/helm/grokdevops \
    -f devops/helm/values-prod.yaml \
    -n prod \
    --wait --timeout=300s
Release "payment-service" has been upgraded. Happy Helming!
NAME: payment-service
LAST DEPLOYED: Thu Mar 19 14:45:00 2026
NAMESPACE: prod
STATUS: deployed
REVISION: 14

Step 3: Verify pods stabilize

$ kubectl rollout status deployment/payment-service -n prod
deployment "payment-service" successfully rolled out

$ kubectl get pods -n prod -l app=payment-service
NAME                               READY   STATUS    RESTARTS   AGE
payment-service-6a4c8e5b3-j7mnp   2/2     Running   0          2m
payment-service-6a4c8e5b3-k9vqr   2/2     Running   0          2m
payment-service-6a4c8e5b3-m3wtx   2/2     Running   0          2m

Verification

Domain A (Kubernetes) — No more OOMKills

$ kubectl get events -n prod --field-selector reason=OOMKilling --sort-by='.lastTimestamp'
# No new events since the fix

$ kubectl top pod -n prod -l app=payment-service --containers
POD                                NAME              CPU(cores)   MEMORY(bytes)
payment-service-6a4c8e5b3-j7mnp   payment-service   42m          112Mi
payment-service-6a4c8e5b3-j7mnp   istio-proxy       8m           47Mi

Sidecar memory dropped from 127Mi to 47Mi.

Domain B (Observability) — Trace sampling confirmed at 1%

$ kubectl get pod payment-service-6a4c8e5b3-j7mnp -n prod \
  -o jsonpath='{.metadata.annotations}' | jq '."proxy.istio.io/config"' -r | jq .
{
  "concurrency": 0,
  "tracing": {
    "sampling": 1
  }
}

Domain C (DevOps Tooling) — Helm values correct

$ helm get values payment-service -n prod | grep -A2 "sampling"
      sampling: 1

Prevention

  • Monitoring: Add a per-container memory utilization alert that fires when any container (including sidecars) exceeds 80% of its limit for 5 minutes.
- alert: ContainerMemoryNearLimit
  expr: |
    container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.8
  for: 5m
  labels:
    severity: warning
  • Runbook: Document that Istio trace sampling changes affect sidecar memory consumption. Never set sampling: 100 in production without proportionally increasing the sidecar memory limit.

  • Architecture: Add a CI/CD check that flags changes to trace sampling in production values files. Use OPA/Gatekeeper to enforce maximum trace sampling rates per environment:

# OPA constraint
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sMaxTraceSampling
metadata:
  name: max-prod-trace-sampling
spec:
  match:
    namespaces: ["prod"]
  parameters:
    maxSampling: 10