Kubernetes Ops Footguns¶

[!WARNING] These will bite you in production. Every item here has caused a real outage.

1. Setting `maxUnavailable: 0` and `maxSurge: 0`¶

Your deployment will never update. Kubernetes can't remove old pods (maxUnavailable: 0) and can't create new ones (maxSurge: 0). The rollout hangs forever with no error message — just Progressing status that never finishes.

Fix: Always allow at least one of these to be non-zero. A sane default is maxUnavailable: 1, maxSurge: 1.

2. `kubectl apply` on a manifest you didn't read¶

Someone emails you a YAML. You kubectl apply -f it. It has namespace: default and replicas: 1 and just overwrote your 10-replica production deployment. Or worse, it has a ClusterRole granting * on everything.

Fix: Always kubectl diff -f manifest.yaml first. Or kubectl apply --dry-run=server -f manifest.yaml.

3. Forgetting that `kubectl delete pod` doesn't fix anything¶

Deleting a pod managed by a Deployment just creates another one with the same config. If the config is broken, the new pod crashes too. You've now added a restart to the outage for no reason.

Fix: Fix the Deployment/ConfigMap/Secret, not the pod.

4. PDB blocking node drains indefinitely¶

You set minAvailable: 100% or maxUnavailable: 0 on a PDB. Now kubectl drain hangs forever because it can't evict any pods. Your node upgrade stalls at 2am and you can't figure out why.

Fix: Use maxUnavailable: 1 for most workloads. Check PDBs before starting maintenance: kubectl get pdb -A.

5. Missing resource requests¶

Pods without resource requests are BestEffort — they get evicted first when a node is under pressure. One noisy neighbor starves your pods because the scheduler has no idea how much they actually need.

Fix: Always set both requests and limits. Requests determine scheduling; limits prevent runaway containers.

6. Setting memory limit == request too tight¶

You profile your app at 256Mi and set both request and limit to 256Mi. App hits a traffic spike, allocates 258Mi, gets OOMKilled. You've turned a traffic spike into an outage.

Fix: Set limits 50-100% above requests for memory. CPU limits are more forgiving (throttling vs killing).

7. `latest` tag in production¶

You deploy myapp:latest. It works. A month later someone pushes a broken image with the same latest tag. A node restarts, pulls the new latest, and your prod pod is now running untested code.

Fix: Always use immutable tags (v2.1.3) or SHA digests (@sha256:abc123).

8. Liveness probe on the same endpoint as readiness¶

Both probes hit /health. App is slow under load. Readiness probe fails (good — removes from service). Liveness probe also fails — kills the container. Now you're crashlooping instead of just shedding load.

Fix: Use separate endpoints. Readiness checks dependencies (DB, cache). Liveness just checks "is the process alive." Or use a startup probe for slow starts.

9. `terminationGracePeriodSeconds: 30` with a slow shutdown¶

Your app takes 45 seconds to drain connections. Kubernetes sends SIGTERM, waits 30 seconds, then sends SIGKILL. In-flight requests get dropped every deploy.

Fix: Set terminationGracePeriodSeconds longer than your drain time. And actually handle SIGTERM in your app.

10. Hardcoding namespace in manifests¶

You put namespace: production in your YAML. Someone does kubectl apply -f deploy.yaml -n staging thinking it'll go to staging. It goes to production because the manifest namespace wins.

Fix: Omit namespace from manifests. Let the context or -n flag decide. Or use Kustomize to set it.

11. Running `kubectl exec` as root in production¶

You exec into a production pod and run apt-get install tcpdump. It works because the container runs as root. You just installed untracked software in a production container that'll disappear on next restart — and you've proven your security posture has holes.

Fix: Use kubectl debug with an ephemeral container. Run containers as non-root. Use readOnlyRootFilesystem: true.

12. Port-forward as a "fix"¶

Your ingress is broken so you kubectl port-forward to production. You close your laptop, the port-forward dies, and the "fix" disappears. Your coworker doesn't know the service is being held together by your terminal session.

Fix: Fix the actual routing. Port-forward is for debugging, not production traffic.

13. Ignoring `Evicted` pods¶

Evicted pods pile up. Hundreds of them. They don't consume resources, but they clutter kubectl get pods output, hide real problems, and eventually slow down the API server on large clusters.

Fix: Clean them up periodically: kubectl delete pods --field-selector status.phase=Failed -A. Investigate why evictions happen.

HPA Footguns¶

14. No resource requests on HPA target pods¶

You create an HPA targeting CPU utilization, but your pods have no resources.requests.cpu. HPA reports <unknown>/70% and never scales. HPA calculates utilization as currentUsage / request — no request means undefined.

Fix: Always set resources.requests.cpu on pods targeted by HPA.

15. Using memory as the primary HPA scaling metric¶

You scale on memory utilization. The app allocates memory on startup and never releases it (JVM heap, Python objects). HPA scales up but never scales down.

Fix: Use CPU as the primary metric. Use memory only as a safety backstop.

16. No stabilization window on HPA scale-down¶

You remove the default 5-minute scale-down stabilization. Traffic is bursty. The HPA flaps replicas up and down every minute.

Fix: Keep the default 300-second scale-down stabilization. For bursty workloads, increase to 600-900 seconds.

17. HPA and manual scaling fighting¶

You kubectl scale deployment web --replicas=10 while an HPA is active. The HPA overwrites your change on its next 15-second sync cycle.

Fix: Delete the HPA first or set minReplicas = maxReplicas to freeze the count.

18. HPA and VPA targeting the same metric¶

Both HPA (scaling on CPU) and VPA (adjusting CPU requests) are active. They fight in a loop — VPA changes requests, HPA recomputes utilization, scales differently.

Fix: If HPA scales on CPU, do not let VPA adjust CPU requests. Use VPA for memory only, or in recommendation-only mode.

Probe Footguns¶

19. Checking dependency health in liveness probes¶

Your liveness endpoint calls the database. Database goes down for 30 seconds. Every pod fails liveness, restarts simultaneously, thundering-herds the database. Cascading restart storm.

Fix: Liveness should only check "is this process alive" — a trivial 200 with no dependencies. Put dependency checks in readiness only.

20. Same endpoint for liveness and readiness¶

Both probes point at /health which checks the database. Dependency failures trigger restarts instead of just removing traffic.

Fix: Use separate endpoints: /healthz for liveness (no dependencies), /ready for readiness (checks dependencies).

21. No startup probe on slow-starting applications¶

Your Java app takes 90 seconds to initialize. Liveness has a 30-second budget. The app is killed mid-startup. Permanent CrashLoopBackOff.

Fix: Add a startup probe with a generous budget: failureThreshold: 60, periodSeconds: 5 gives 300 seconds.

22. Timeout too short for GC pauses¶

JVM app has timeoutSeconds: 1 on liveness. A full GC pause takes 3 seconds. Three consecutive timeouts trigger a restart.

Fix: Set timeoutSeconds to at least 5 seconds for JVM applications.

23. Removing probes to "fix" restarts¶

Pods keep restarting due to liveness failures. You remove the probe. Restarts stop. The underlying issue (memory leak, deadlock) is now undetected.

Fix: Fix the root cause of probe failures. Only temporarily remove a probe during active debugging.