Skip to content

Pattern: Simultaneous Timer Expiry

ID: FP-013 Family: Thundering Herd Frequency: Uncommon Blast Radius: Multi-Service Detection Difficulty: Subtle

The Shape

When many instances of a service are deployed at the same time (rolling deploy, cluster restart), time-based events (lease renewals, token refreshes, cache TTLs, cron jobs) all expire at the same time — the deploy time plus the TTL. What was designed as a background maintenance task becomes a synchronized thundering herd because all instances share the same "start time."

How You'll See It

In Kubernetes

All 30 pods deployed at 14:00. Each pod caches a session token with a 1-hour TTL. At 15:00, all 30 pods simultaneously attempt to refresh their tokens via the auth service. The auth service, which normally handles 1–2 token refreshes per minute, receives 30 in one second. If the auth service is rate-limited or has a small connection pool, all 30 tokens fail to refresh simultaneously.

In Linux/Infrastructure

All cron jobs on a fleet of 100 servers are scheduled at the top of the hour via /etc/cron.hourly. 100 jobs hit the shared NFS mount, LDAP server, or backup destination simultaneously. The destination is designed for one job at a time.

In Networking

BGP keepalive timers, initialized to the same value at the same time, all expire simultaneously after a routing daemon restart. The BGP speaker sends a burst of keepalives that overwhelms the control-plane CPU of the peer router.

The Tell

Events or failures occur at a fixed offset from the deployment timestamp or cluster restart time. The pattern repeats every TTL interval after each deploy. Multiple instances fail simultaneously, not staggered.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Random load spike Synchronized timer expiry Spike occurs exactly TTL minutes after last deploy
Service degradation Self-inflicted timer storm Failure is periodic, correlates with deploy timestamps
Dependency failure Timer thundering herd Dependency handles the load fine between timer expiry events

The Fix (Generic)

  1. Immediate: Stagger timer values manually by adding a random offset to each instance's TTL: TTL = base_TTL + random(0, base_TTL * 0.2).
  2. Short-term: Use jitter in all time-based renewals at the application level.
  3. Long-term: Design background tasks to use independent schedules per instance (e.g., offset by pod ordinal); use distributed lock/lease systems that self-stagger renewals.

Real-World Examples

  • Example 1: JWT tokens with 1-hour expiry issued at deploy time. At T+60min, all 50 services simultaneously tried to get new tokens from the OAuth server. OAuth server rate-limited all of them; all services returned 401 errors for 30 seconds.
  • Example 2: Kubernetes HPA scaling decision uses a 5-minute window. After a fleet restart, all HPA controllers began their 5-minute evaluation windows at the same time. At T+5min, all made scaling decisions simultaneously, triggering a resource request wave on the cluster.

War Story

We noticed our auth service got hammered every hour, on the hour. Took us two weeks to figure out why. We had done a full fleet deploy 3 weeks before. Every pod cached auth tokens with a 60-minute TTL. Because they all started within 2 minutes of each other, they all refreshed within 2 minutes of each other, every hour. The fix was one line: ttl = 3600 + random.randint(0, 600). The synchronized storm disappeared immediately on the next deploy.

Cross-References