The Deploy That Ate Prod¶

Category: The Incident Domains: kubernetes, ci-cd Read time: ~5 min

Setting the Scene¶

Mid-size e-commerce company, 200 engineers, running about 80 microservices on EKS. I was one of three platform engineers responsible for the deployment pipeline. We'd just migrated from Helm 2 to Helm 3 and were feeling pretty good about ourselves. It was a Thursday afternoon — deploy freeze started Friday at 5 PM, so teams were rushing to get their releases in. Our largest service, the order-processing monolith (yeah, it was a "microservice" in name only), was getting a routine deploy.

What Happened¶

2:15 PM — A developer merges a PR that updates the order-processor's Helm values. The change looks tiny: bumping a sidecar image version. The PR got a quick LGTM. Nobody noticed that the values.yaml diff also included a copy-paste of resource limits from the sidecar section into the main container section. The main container went from memory: 4Gi to memory: 256Mi.

2:18 PM — ArgoCD picks up the change and starts rolling out. First pod comes up, passes its readiness check (which just checks if the HTTP port is listening), and the rollout continues.

2:19 PM — First OOMKill. Then another. Then twelve more. The order-processor needs about 2Gi at idle. At 256Mi, it starts, passes the readiness probe during its 5-second grace period, then gets OOMKilled as soon as it loads its caches.

2:21 PM — Kubernetes does what Kubernetes does. It keeps trying to schedule new pods, they keep getting OOMKilled, old pods are getting terminated as part of the rollout. Within three minutes, we have zero healthy pods for order-processing. PagerDuty goes berserk. Slack lights up.

2:24 PM — I'm staring at kubectl get pods watching a wall of OOMKilled and CrashLoopBackOff statuses. My first instinct is wrong — I think it's a memory leak in the new sidecar version. I spend four minutes looking at sidecar logs.

2:28 PM — My colleague runs kubectl describe pod on one of the crashing pods and spots it: Limits: memory: 256Mi. She says "that can't be right" and pulls up the Helm values diff. There it is. One line. Copy-paste from the sidecar block.

2:31 PM — We revert the Helm values in git, ArgoCD syncs, pods start coming up healthy. Total outage: 13 minutes. Estimated lost orders: around 1,400 based on our typical throughput.

The Moment of Truth¶

It wasn't a complicated bug. It was a copy-paste error in a YAML file that a code review missed because the diff looked small and harmless. The real failure was that nothing in our pipeline compared resource limits against the running configuration and flagged dramatic changes.

The Aftermath¶

We added a conftest policy that flags any resource limit change greater than 50% as requiring explicit approval from the platform team. We also fixed our readiness probes to actually test application functionality (hitting a /ready endpoint that loads a small query) instead of just checking if the port was open. The rollout strategy got changed to require at least 60 seconds of healthy status before proceeding. The 1,400 lost orders got us executive attention, which ironically made it easier to get budget for deployment safety tooling.

The Lessons¶

Diff configs before deploy: Automated policy checks on resource changes would have caught this before a single pod was affected. Tools like conftest or OPA Gatekeeper exist for exactly this reason.
Memory limits are not optional, but they must be correct: A wrong limit is worse than no limit — at least without limits the pod runs and you can fix it without an outage.
Canary deploys save lives: If we'd rolled out to 1 pod and waited 5 minutes before continuing, we'd have caught the OOMKill on a single pod instead of losing the entire service.

What I'd Do Differently¶

I'd implement a deployment pipeline that does a dry-run diff of the rendered manifests against what's currently running in the cluster, surfacing any resource changes prominently. I'd also set maxUnavailable: 0 on critical services so the rollout can't terminate old pods until new ones are genuinely healthy for a meaningful duration.

The Quote¶

"YAML doesn't care about your intentions. It just does exactly what you told it to do."

Cross-References¶

Topic Packs: K8s Ops, CI/CD Pipelines & Patterns, OOMKilled, Kubernetes Ops
Case Studies: Kubernetes Ops