Skip to content

Portal | Level: L2: Operations | Topics: OOMKilled | Domain: Kubernetes

Scenario: Pods OOMKilled Under Load

The Prompt

"Our application pods keep getting OOMKilled during peak traffic hours. They restart but it causes brief outages. The application worked fine until traffic increased last week. How do you approach this?"

Initial Report

PagerDuty alert: "Pod grokdevops-7f8b9c6d4-x2k9p OOMKilled (3 times in the last hour). Service is degrading during peak hours. Customers report intermittent 503 errors."

Constraints

  • Time pressure: You have 15 minutes before the next escalation. The pod is restarting every 10-15 minutes during peak load.
  • Limited access: You can view metrics and logs but modifying resource limits requires a Helm values change and a new deploy. No access to application source code.

Observable Evidence

  • Dashboard: Memory usage graph shows a sawtooth pattern — climbing to the limit then dropping to zero on restart. Container restarts counter is incrementing.
  • Pod describe: Last State: Terminated - Reason: OOMKilled - Exit Code: 137. Current memory limit is 256Mi.
  • Logs: kubectl top pods shows the pod at 240Mi/256Mi (94% of limit). Prometheus query container_memory_working_set_bytes confirms steady climb correlated with request rate.

Expected Investigation Path

# 1. Confirm OOMKill
kubectl get pods -n grokdevops
kubectl describe pod -n grokdevops -l app.kubernetes.io/name=grokdevops | grep -A5 "Last State"

# 2. Check current limits
kubectl get deploy grokdevops -n grokdevops -o jsonpath='{.spec.template.spec.containers[0].resources}' | python3 -m json.tool

# 3. Check actual usage at baseline
kubectl top pods -n grokdevops

# 4. Check if this correlates with traffic
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Query: container_memory_working_set_bytes{namespace="grokdevops"}

Strong Answer

"OOMKilled means the container exceeded its memory limit and the kernel killed it. Since it started with increased traffic, this is likely not a memory leak but rather the app legitimately needing more memory under load — more connections, larger request buffers, more in-flight objects. My approach: first, check current limits and actual usage via kubectl top and Prometheus metrics. If memory usage is legitimately growing with traffic, I'd increase the memory limit and add HPA based on memory utilization so we scale out before hitting limits. Short-term fix: bump the limit in the Helm values. Long-term: profile the app to understand per-request memory cost, consider connection pooling, and set up alerts for memory usage at 80% of limit so we get warned before OOMKills."

Common Traps

  • Just increasing the limit without understanding why — could be a leak
  • Not correlating with traffic — if it happens at low traffic too, it's a leak
  • Forgetting HPA can help — more replicas = less memory per pod
  • Not mentioning monitoring/alerting — a senior should proactively set up alerts
  • Lab: training/interactive/runtime-labs/lab-runtime-08-resource-limits-oom/
  • Runbook: training/library/runbooks/oomkilled.md
  • Drills: training/library/drills/kubectl_drills.md — Drill 11 (resource limits), Drill 12 (find OOMKilled)

Wiki Navigation