Portal | Level: L2: Operations | Topics: OOMKilled | Domain: Kubernetes

Scenario: Pods OOMKilled Under Load¶

The Prompt¶

"Our application pods keep getting OOMKilled during peak traffic hours. They restart but it causes brief outages. The application worked fine until traffic increased last week. How do you approach this?"

Initial Report¶

PagerDuty alert: "Pod grokdevops-7f8b9c6d4-x2k9p OOMKilled (3 times in the last hour). Service is degrading during peak hours. Customers report intermittent 503 errors."

Constraints¶

Time pressure: You have 15 minutes before the next escalation. The pod is restarting every 10-15 minutes during peak load.
Limited access: You can view metrics and logs but modifying resource limits requires a Helm values change and a new deploy. No access to application source code.

Observable Evidence¶

Dashboard: Memory usage graph shows a sawtooth pattern — climbing to the limit then dropping to zero on restart. Container restarts counter is incrementing.
Pod describe: Last State: Terminated - Reason: OOMKilled - Exit Code: 137. Current memory limit is 256Mi.
Logs: kubectl top pods shows the pod at 240Mi/256Mi (94% of limit). Prometheus query container_memory_working_set_bytes confirms steady climb correlated with request rate.

Expected Investigation Path¶

# 1. Confirm OOMKill
kubectl get pods -n grokdevops
kubectl describe pod -n grokdevops -l app.kubernetes.io/name=grokdevops | grep -A5 "Last State"

# 2. Check current limits
kubectl get deploy grokdevops -n grokdevops -o jsonpath='{.spec.template.spec.containers[0].resources}' | python3 -m json.tool

# 3. Check actual usage at baseline
kubectl top pods -n grokdevops

# 4. Check if this correlates with traffic
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Query: container_memory_working_set_bytes{namespace="grokdevops"}

Strong Answer¶

"OOMKilled means the container exceeded its memory limit and the kernel killed it. Since it started with increased traffic, this is likely not a memory leak but rather the app legitimately needing more memory under load — more connections, larger request buffers, more in-flight objects. My approach: first, check current limits and actual usage via kubectl top and Prometheus metrics. If memory usage is legitimately growing with traffic, I'd increase the memory limit and add HPA based on memory utilization so we scale out before hitting limits. Short-term fix: bump the limit in the Helm values. Long-term: profile the app to understand per-request memory cost, consider connection pooling, and set up alerts for memory usage at 80% of limit so we get warned before OOMKills."

Common Traps¶

Just increasing the limit without understanding why — could be a leak
Not correlating with traffic — if it happens at low traffic too, it's a leak
Forgetting HPA can help — more replicas = less memory per pod
Not mentioning monitoring/alerting — a senior should proactively set up alerts

Practice and Links¶

Lab: training/interactive/runtime-labs/lab-runtime-08-resource-limits-oom/
Runbook: training/library/runbooks/oomkilled.md
Drills: training/library/drills/kubectl_drills.md — Drill 11 (resource limits), Drill 12 (find OOMKilled)

Case Study: Node Pressure Evictions (Case Study, L2) — OOMKilled
Case Study: Pod OOMKilled — Memory Leak in Sidecar, Fix Is Helm Values (Case Study, L2) — OOMKilled
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — OOMKilled
Kubernetes Ops (Production) (Topic Pack, L2) — OOMKilled
Lab: Resource Limits OOMKilled (CLI) (Lab, L1) — OOMKilled
Ops Archaeology: The Session Store That Keeps Dying (Case Study, L2) — OOMKilled
Runbook: OOMKilled Container (Runbook, L1) — OOMKilled

Scenario: Pods OOMKilled Under Load¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Strong Answer¶

Common Traps¶

Practice and Links¶

Wiki Navigation¶

Pages that link here¶

Scenario: Pods OOMKilled Under Load¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Strong Answer¶

Common Traps¶

Practice and Links¶

Wiki Navigation¶

Related Content¶

Pages that link here¶