Skip to content

Pattern: Restart Avalanche

ID: FP-011 Family: Thundering Herd Frequency: Common Blast Radius: Multi-Service Detection Difficulty: Moderate

The Shape

Multiple pods (or services) restart simultaneously — either due to a rolling deploy with high maxSurge, a node restart bringing all pods back at once, or a shared dependency failure causing CrashLoopBackOff. Each restarting pod attempts to establish connections (to databases, caches, downstream services) simultaneously during warmup. This simultaneous connection burst exhausts connection pools and can crash the dependencies the pods are trying to connect to.

How You'll See It

In Kubernetes

kubectl rollout restart deployment/myapp with maxSurge=100% terminates 20 pods and starts 20 new ones simultaneously. Each new pod opens 5 DB connections on startup = 100 simultaneous new connections. Postgres is at max_connections=100 immediately. New pods cannot complete their readiness probe (requires a successful DB query); they are killed and restarted. CrashLoopBackOff cascade.

In Linux/Infrastructure

A host reboots. All 15 services configured to Restart=on-failure start within the same systemd activation window. Their combined startup connection load exceeds what MySQL can handle; MySQL itself cannot start cleanly under that load.

In CI/CD

A deploy pipeline restarts all workers simultaneously. Workers connect to a job queue (e.g., RabbitMQ) on startup. 40 simultaneous connections trigger RabbitMQ's per-vhost connection limit; workers fail to start.

The Tell

All failing pods have a timestamp within a few seconds of each other in their LastState.Terminated time. The failure mode is "can't connect to dependency" not "application error." Database connection count spike matches the pod restart timestamp.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Dependency is down Dependency overloaded by restart Dependency was fine before restart; connection count spiked exactly at restart
Application bug Thundering connection burst All pods fail at startup, not during request processing
Cascading failure from dependency Self-inflicted restart avalanche Dependency was healthy before the rolling deploy began

The Fix (Generic)

  1. Immediate: Reduce maxSurge to 1 or 2, or pause the rollout and restart pods one at a time.
  2. Short-term: Add minReadySeconds to the deployment to stagger new pod readiness; set maxSurge=25% for large deployments.
  3. Long-term: Use connection poolers; add startup probes that wait for a database ping before proceeding; implement PodDisruptionBudgets to limit simultaneous pod restarts.

Real-World Examples

  • Example 1: kubectl rollout restart on a 20-pod service. All 20 pods started simultaneously. Each attempted to create a connection pool of 10 (total: 200). Postgres max_connections=150. 50 pods couldn't connect; CrashLoopBackOff loop began.
  • Example 2: Kubernetes node rescheduled after maintenance. 35 pods all started on the node simultaneously. Redis received 35 × 3 = 105 connections in 200ms; Redis's maxclients=100 triggered a connection refusal cascade.

War Story

We did a rolling restart before a Black Friday event to clear a memory leak. Set maxSurge=100% to make it fast. 50 new pods all came up at once, each opening 4 Postgres connections. 200 connections in 2 seconds hit a 100-connection max. The pods couldn't complete startup; they crashlooped. The "fast" restart turned into a 20-minute incident. We had to manually set maxSurge=1 via kubectl patch, wait for each pod, then proceed. Rolling restart with maxSurge=1 took 8 minutes total — slower than the 20-minute incident we caused.

Cross-References

  • Topic Packs: k8s-ops, distributed-systems
  • Footguns: k8s-ops/footguns.md — "Simultaneous pod restart (maxSurge=100%)"
  • Related Patterns: FP-010 (cache stampede — same thundering herd, different trigger), FP-002 (connection pool exhaustion — the resource that gets depleted)