Pattern: Rollout Hang (Zero Surge + Zero Unavailable)¶
ID: FP-032 Family: Configuration Landmine Frequency: Common Blast Radius: Single Service Detection Difficulty: Moderate
The Shape¶
A Kubernetes Deployment with maxSurge: 0 and maxUnavailable: 0 can never progress.
To update, it must either bring up a new pod (requires surge capacity) or take down an
old pod (requires allowing unavailability). With both at zero, neither is permitted.
The rollout controller enters a "Progressing" state indefinitely, displaying no error —
just perpetual progress. Engineers think the deploy is slow; it is actually frozen.
How You'll See It¶
In Kubernetes¶
$ kubectl rollout status deployment/myapp
Waiting for deployment "myapp" rollout to finish: 0 out of 5 new replicas have been updated...
# (Never progresses)
$ kubectl describe deployment myapp | grep -A5 "Strategy"
Strategy: RollingUpdate
RollingUpdateStrategy: 0 max unavailable, 0 max surge
kubectl get pods shows all original
pods running. The deployment revision increments; no pods actually change.
In CI/CD¶
Deployment pipeline reports "waiting for rollout to complete." CI timeout fires after 10 minutes. Engineers check manually: deployment is stuck. They assume the image is bad or the cluster is unhealthy when the issue is just the strategy configuration.
The Tell¶
kubectl rollout statushangs indefinitely.kubectl describe deploymentshows0 max unavailable, 0 max surge. No pods are being created or deleted. Thekubectl rollout historyshows the new revision exists but no pods have it.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Image pull failure | Rollout strategy prevents progress | No ImagePullBackOff events; no new pods were ever created |
| Resource quota blocking | Strategy configuration | Quota error would appear in events; no events at all here |
| Cluster overloaded | Configuration landmine | Cluster healthy; describe shows the 0/0 strategy |
The Fix (Generic)¶
- Immediate:
kubectl patch deployment myapp -p '{"spec":{"strategy":{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0}}}}' - Short-term: Set
maxSurge: 1(at minimum) for rolling updates; or useRecreatestrategy if you explicitly want downtime during updates. - Long-term: Add admission webhook or CI validation that rejects
maxSurge: 0+maxUnavailable: 0combinations; document the valid strategy configurations in team runbooks.
Real-World Examples¶
- Example 1: DevOps engineer set both to 0 believing it would "guarantee no downtime during the update." Deploy hung for 45 minutes until escalated to senior engineer who recognized the pattern immediately.
- Example 2: Helm chart template had
maxSurge: {{ .Values.maxSurge | default 0 }}. The default was never overridden. All deploys via the chart were permanently hung until a values file was provided.
War Story¶
We deployed a critical security patch at 3pm. CI/CD said "waiting for rollout." At 3:10pm: still waiting. At 3:20pm: same. We checked the cluster — everything looked fine. We checked pod events — no events at all. I ran
kubectl describe deploymentand saw0 max unavailable, 0 max surge. It had been like that for 6 months; we'd just never done a deploy during business hours before (we always deployed at night when we'd accept brief downtime and used Recreate). Someone had "fixed" the strategy to "safer" and broken it instead. Security patch took 35 minutes instead of 2.
Cross-References¶
- Topic Packs: k8s-ops
- Footguns: k8s-ops/footguns.md — "maxUnavailable: 0 and maxSurge: 0"
- Related Patterns: FP-011 (restart avalanche — the opposite mistake: maxSurge=100%), FP-037 (StatefulSet OrderedReady deadlock — same "configuration prevents progress" shape)