Pattern: Restart Avalanche¶
ID: FP-011 Family: Thundering Herd Frequency: Common Blast Radius: Multi-Service Detection Difficulty: Moderate
The Shape¶
Multiple pods (or services) restart simultaneously — either due to a rolling deploy with
high maxSurge, a node restart bringing all pods back at once, or a shared dependency
failure causing CrashLoopBackOff. Each restarting pod attempts to establish connections
(to databases, caches, downstream services) simultaneously during warmup. This simultaneous
connection burst exhausts connection pools and can crash the dependencies the pods are
trying to connect to.
How You'll See It¶
In Kubernetes¶
kubectl rollout restart deployment/myapp with maxSurge=100% terminates 20 pods and
starts 20 new ones simultaneously. Each new pod opens 5 DB connections on startup =
100 simultaneous new connections. Postgres is at max_connections=100 immediately.
New pods cannot complete their readiness probe (requires a successful DB query); they
are killed and restarted. CrashLoopBackOff cascade.
In Linux/Infrastructure¶
A host reboots. All 15 services configured to Restart=on-failure start within the
same systemd activation window. Their combined startup connection load exceeds what
MySQL can handle; MySQL itself cannot start cleanly under that load.
In CI/CD¶
A deploy pipeline restarts all workers simultaneously. Workers connect to a job queue (e.g., RabbitMQ) on startup. 40 simultaneous connections trigger RabbitMQ's per-vhost connection limit; workers fail to start.
The Tell¶
All failing pods have a timestamp within a few seconds of each other in their
LastState.Terminatedtime. The failure mode is "can't connect to dependency" not "application error." Database connection count spike matches the pod restart timestamp.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Dependency is down | Dependency overloaded by restart | Dependency was fine before restart; connection count spiked exactly at restart |
| Application bug | Thundering connection burst | All pods fail at startup, not during request processing |
| Cascading failure from dependency | Self-inflicted restart avalanche | Dependency was healthy before the rolling deploy began |
The Fix (Generic)¶
- Immediate: Reduce
maxSurgeto 1 or 2, or pause the rollout and restart pods one at a time. - Short-term: Add
minReadySecondsto the deployment to stagger new pod readiness; setmaxSurge=25%for large deployments. - Long-term: Use connection poolers; add startup probes that wait for a database ping before proceeding; implement PodDisruptionBudgets to limit simultaneous pod restarts.
Real-World Examples¶
- Example 1:
kubectl rollout restarton a 20-pod service. All 20 pods started simultaneously. Each attempted to create a connection pool of 10 (total: 200). Postgresmax_connections=150. 50 pods couldn't connect; CrashLoopBackOff loop began. - Example 2: Kubernetes node rescheduled after maintenance. 35 pods all started on the node simultaneously. Redis received 35 × 3 = 105 connections in 200ms; Redis's
maxclients=100triggered a connection refusal cascade.
War Story¶
We did a rolling restart before a Black Friday event to clear a memory leak. Set maxSurge=100% to make it fast. 50 new pods all came up at once, each opening 4 Postgres connections. 200 connections in 2 seconds hit a 100-connection max. The pods couldn't complete startup; they crashlooped. The "fast" restart turned into a 20-minute incident. We had to manually set maxSurge=1 via kubectl patch, wait for each pod, then proceed. Rolling restart with maxSurge=1 took 8 minutes total — slower than the 20-minute incident we caused.
Cross-References¶
- Topic Packs: k8s-ops, distributed-systems
- Footguns: k8s-ops/footguns.md — "Simultaneous pod restart (maxSurge=100%)"
- Related Patterns: FP-010 (cache stampede — same thundering herd, different trigger), FP-002 (connection pool exhaustion — the resource that gets depleted)