Skip to content

Pattern: Rollout Hang (Zero Surge + Zero Unavailable)

ID: FP-032 Family: Configuration Landmine Frequency: Common Blast Radius: Single Service Detection Difficulty: Moderate

The Shape

A Kubernetes Deployment with maxSurge: 0 and maxUnavailable: 0 can never progress. To update, it must either bring up a new pod (requires surge capacity) or take down an old pod (requires allowing unavailability). With both at zero, neither is permitted. The rollout controller enters a "Progressing" state indefinitely, displaying no error — just perpetual progress. Engineers think the deploy is slow; it is actually frozen.

How You'll See It

In Kubernetes

$ kubectl rollout status deployment/myapp
Waiting for deployment "myapp" rollout to finish: 0 out of 5 new replicas have been updated...
# (Never progresses)

$ kubectl describe deployment myapp | grep -A5 "Strategy"
Strategy:               RollingUpdate
RollingUpdateStrategy:  0 max unavailable, 0 max surge
No pods are terminating. No new pods are starting. kubectl get pods shows all original pods running. The deployment revision increments; no pods actually change.

In CI/CD

Deployment pipeline reports "waiting for rollout to complete." CI timeout fires after 10 minutes. Engineers check manually: deployment is stuck. They assume the image is bad or the cluster is unhealthy when the issue is just the strategy configuration.

The Tell

kubectl rollout status hangs indefinitely. kubectl describe deployment shows 0 max unavailable, 0 max surge. No pods are being created or deleted. The kubectl rollout history shows the new revision exists but no pods have it.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Image pull failure Rollout strategy prevents progress No ImagePullBackOff events; no new pods were ever created
Resource quota blocking Strategy configuration Quota error would appear in events; no events at all here
Cluster overloaded Configuration landmine Cluster healthy; describe shows the 0/0 strategy

The Fix (Generic)

  1. Immediate: kubectl patch deployment myapp -p '{"spec":{"strategy":{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0}}}}'
  2. Short-term: Set maxSurge: 1 (at minimum) for rolling updates; or use Recreate strategy if you explicitly want downtime during updates.
  3. Long-term: Add admission webhook or CI validation that rejects maxSurge: 0 + maxUnavailable: 0 combinations; document the valid strategy configurations in team runbooks.

Real-World Examples

  • Example 1: DevOps engineer set both to 0 believing it would "guarantee no downtime during the update." Deploy hung for 45 minutes until escalated to senior engineer who recognized the pattern immediately.
  • Example 2: Helm chart template had maxSurge: {{ .Values.maxSurge | default 0 }}. The default was never overridden. All deploys via the chart were permanently hung until a values file was provided.

War Story

We deployed a critical security patch at 3pm. CI/CD said "waiting for rollout." At 3:10pm: still waiting. At 3:20pm: same. We checked the cluster — everything looked fine. We checked pod events — no events at all. I ran kubectl describe deployment and saw 0 max unavailable, 0 max surge. It had been like that for 6 months; we'd just never done a deploy during business hours before (we always deployed at night when we'd accept brief downtime and used Recreate). Someone had "fixed" the strategy to "safer" and broken it instead. Security patch took 35 minutes instead of 2.

Cross-References

  • Topic Packs: k8s-ops
  • Footguns: k8s-ops/footguns.md — "maxUnavailable: 0 and maxSurge: 0"
  • Related Patterns: FP-011 (restart avalanche — the opposite mistake: maxSurge=100%), FP-037 (StatefulSet OrderedReady deadlock — same "configuration prevents progress" shape)