Anti-Primer: Kubernetes Ops¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A platform engineer is performing a Kubernetes cluster upgrade from 1.27 to 1.28 during a scheduled maintenance window. The cluster runs 200 production workloads. The upgrade plan was written two months ago and never reviewed.
The Timeline¶
Hour 0: Skipping Version Compatibility Check¶
Upgrades the control plane without checking deprecated API versions used by deployed manifests. The deadline was looming, and this seemed like the fastest path forward. But the result is 30 deployments fail to reconcile because they use removed API versions.
Footgun #1: Skipping Version Compatibility Check — upgrades the control plane without checking deprecated API versions used by deployed manifests, leading to 30 deployments fail to reconcile because they use removed API versions.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: No PodDisruptionBudgets¶
Drains nodes without PDBs in place for stateful services. Under time pressure, the team chose speed over caution. But the result is all replicas of the payment service are evicted simultaneously; transactions fail.
Footgun #2: No PodDisruptionBudgets — drains nodes without PDBs in place for stateful services, leading to all replicas of the payment service are evicted simultaneously; transactions fail.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Ignoring Resource Limits¶
Skips setting resource requests and limits because 'the cluster has plenty of capacity'. Nobody pushed back because the shortcut looked harmless in the moment. But the result is a noisy neighbor OOMs the node; 15 unrelated pods are evicted.
Footgun #3: Ignoring Resource Limits — skips setting resource requests and limits because 'the cluster has plenty of capacity', leading to a noisy neighbor OOMs the node; 15 unrelated pods are evicted.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Rolling Update Without Readiness Probes¶
Deploys a new version without readiness probes; Kubernetes routes traffic immediately. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is users hit 502 errors for 90 seconds while the new pods are still initializing.
Footgun #4: Rolling Update Without Readiness Probes — deploys a new version without readiness probes; Kubernetes routes traffic immediately, leading to users hit 502 errors for 90 seconds while the new pods are still initializing.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Skipping Version Compatibility Check | 30 deployments fail to reconcile because they use removed API versions | Primer: Run deprecation checks before upgrading |
| 2 | No PodDisruptionBudgets | All replicas of the payment service are evicted simultaneously; transactions fail | Primer: PDBs on every critical workload before maintenance |
| 3 | Ignoring Resource Limits | A noisy neighbor OOMs the node; 15 unrelated pods are evicted | Primer: Resource quotas and limit ranges enforced per namespace |
| 4 | Rolling Update Without Readiness Probes | Users hit 502 errors for 90 seconds while the new pods are still initializing | Primer: Readiness probes with appropriate thresholds |
Damage Report¶
- Downtime: 2-4 hours of pod-level or cluster-wide disruption
- Data loss: Risk of volume data loss if StatefulSets were affected
- Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
- Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
- Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on skipping version compatibility check, they would have learned: Run deprecation checks before upgrading.
- Footgun #2: If the engineer had read the primer, section on no poddisruptionbudgets, they would have learned: PDBs on every critical workload before maintenance.
- Footgun #3: If the engineer had read the primer, section on ignoring resource limits, they would have learned: Resource quotas and limit ranges enforced per namespace.
- Footgun #4: If the engineer had read the primer, section on rolling update without readiness probes, they would have learned: Readiness probes with appropriate thresholds.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice