Symptoms: Pod OOMKilled, Memory Leak Is in Sidecar, Fix Is Helm Values¶
Domains: kubernetes_ops | observability | devops_tooling Level: L2 Estimated time: 30-45 min
Initial Alert¶
Prometheus Alertmanager fires at 14:32 UTC:
FIRING: KubePodCrashLooping
pod: payment-service-7f8b9c6d4-xk2nm
namespace: prod
container: payment-service
reason: OOMKilled
restarts: 7 in last 30 minutes
The oncall dashboard shows:
payment-service pod restart rate: 14/hour (baseline: 0)
payment-service error rate: 23% (baseline: 0.1%)
payment-service p99 latency: 12.4s (baseline: 180ms)
Observable Symptoms¶
- The
payment-servicecontainer is being OOMKilled every 2-4 minutes. kubectl top podshows the payment-service container using 245Mi out of its 256Mi limit.- The application logs show normal operation right up until the kill — no memory-related errors from the app itself.
- The Deployment has 3 replicas, and all 3 are experiencing the same OOMKill cycle.
- Recent deploy: a new version of payment-service was deployed 2 hours ago via Helm. The app team says "no memory-related changes in this release."
The Misleading Signal¶
The OOMKill events clearly point to the payment-service container. The Kubernetes events say OOMKilled with container name payment-service. The natural response is to investigate the application for a memory leak — check heap dumps, review recent code changes, look at garbage collection logs. The app team's release notes show only a minor API endpoint addition, nothing memory-intensive. The engineer is now deep in application profiling.