Remediation: Pod OOMKilled, Memory Leak Is in Sidecar, Fix Is Helm Values¶
Immediate Fix (DevOps Tooling — Domain C)¶
The fix is in the Helm values file, not in Kubernetes directly or in the observability stack.
Step 1: Revert the trace sampling in Helm values¶
$ cd devops/helm
$ grep -n "sampling" values-prod.yaml
47: sampling: 100
# Fix the value
$ sed -i 's/sampling: 100/sampling: 1/' values-prod.yaml
Step 2: Deploy the fix¶
$ helm upgrade payment-service devops/helm/grokdevops \
-f devops/helm/values-prod.yaml \
-n prod \
--wait --timeout=300s
Release "payment-service" has been upgraded. Happy Helming!
NAME: payment-service
LAST DEPLOYED: Thu Mar 19 14:45:00 2026
NAMESPACE: prod
STATUS: deployed
REVISION: 14
Step 3: Verify pods stabilize¶
$ kubectl rollout status deployment/payment-service -n prod
deployment "payment-service" successfully rolled out
$ kubectl get pods -n prod -l app=payment-service
NAME READY STATUS RESTARTS AGE
payment-service-6a4c8e5b3-j7mnp 2/2 Running 0 2m
payment-service-6a4c8e5b3-k9vqr 2/2 Running 0 2m
payment-service-6a4c8e5b3-m3wtx 2/2 Running 0 2m
Verification¶
Domain A (Kubernetes) — No more OOMKills¶
$ kubectl get events -n prod --field-selector reason=OOMKilling --sort-by='.lastTimestamp'
# No new events since the fix
$ kubectl top pod -n prod -l app=payment-service --containers
POD NAME CPU(cores) MEMORY(bytes)
payment-service-6a4c8e5b3-j7mnp payment-service 42m 112Mi
payment-service-6a4c8e5b3-j7mnp istio-proxy 8m 47Mi
Sidecar memory dropped from 127Mi to 47Mi.
Domain B (Observability) — Trace sampling confirmed at 1%¶
$ kubectl get pod payment-service-6a4c8e5b3-j7mnp -n prod \
-o jsonpath='{.metadata.annotations}' | jq '."proxy.istio.io/config"' -r | jq .
{
"concurrency": 0,
"tracing": {
"sampling": 1
}
}
Domain C (DevOps Tooling) — Helm values correct¶
Prevention¶
- Monitoring: Add a per-container memory utilization alert that fires when any container (including sidecars) exceeds 80% of its limit for 5 minutes.
- alert: ContainerMemoryNearLimit
expr: |
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.8
for: 5m
labels:
severity: warning
-
Runbook: Document that Istio trace sampling changes affect sidecar memory consumption. Never set
sampling: 100in production without proportionally increasing the sidecar memory limit. -
Architecture: Add a CI/CD check that flags changes to trace sampling in production values files. Use OPA/Gatekeeper to enforce maximum trace sampling rates per environment: