The Leap Second Incident¶
Category: The Mystery Domains: linux-ops, distributed-systems Read time: ~5 min
Setting the Scene¶
It was June 30th, and I was on-call for a fleet of about 1,200 Linux servers running a real-time ad-bidding platform. We handled 50,000 bid requests per second with a hard 100ms SLA. At 23:59:60 UTC, the International Earth Rotation Service added a leap second. I knew it was coming -- I'd even read the email from the sysadmin team saying "we're fine, NTP handles it." I went to bed early.
At 00:00:01 UTC on July 1st, my phone exploded.
What Happened¶
Nagios fired 347 alerts in 90 seconds. CPU utilization on roughly half the fleet -- about 580 servers -- jumped to 100% and stayed there. The other half was completely fine. top showed Java processes consuming 100% CPU with no useful work. The bidding service had effectively stopped responding on those hosts.
My first instinct was a bad deploy. I checked the deploy log -- nothing had been pushed in 12 hours. I checked for config changes via our Puppet dashboard. Nothing. I ran dmesg on an affected host and saw a line I'd never seen before: Clock: inserting leap second 23:59:60 UTC.
I started comparing affected vs. unaffected hosts. Same Java version (OpenJDK 1.7.0_25), same kernel (3.2.0), same configs. The difference was subtle: affected hosts were running kernel 3.2.0-23, unaffected were on 3.2.0-51. A minor kernel patch version.
I found a Red Hat bugzilla entry describing a known issue: the kernel's hrtimer subsystem reacted to the leap second by causing a livelock in futex operations. Java's Thread.sleep() and Object.wait() used futexes under the hood, and when the leap second hit, those calls started spinning instead of sleeping. Every sleeping thread turned into a CPU-burning busy-wait loop.
The Moment of Truth¶
The fix was absurdly simple. On each affected host, I ran:
This reset the kernel's time representation, clearing the leap second flag from the hrtimer state. CPU dropped from 100% to 8% within seconds. We scripted it across the fleet with pssh and had all 580 hosts recovered in under four minutes.
The Aftermath¶
We patched every host to kernel 3.2.0-51+ within the following week. We also upgraded Java to a version that used CLOCK_MONOTONIC instead of CLOCK_REALTIME for its internal timers. I wrote a runbook titled "Weird Time Events" that included leap seconds, DST transitions, and NTP jumps. Nobody laughed at that title after this incident.
The Lessons¶
- Keep systems patched: The fix for this kernel bug had been available for months before the leap second hit. We just hadn't applied it to the older fleet tier.
- Leap seconds are real edge cases: They happen roughly every 18 months, and they expose assumptions in code and kernels that you'd never think to test. (Note: UTC leap seconds were frozen after 2035 decisions, but the lesson applies to all "impossible" time events.)
- Have a "weird time" runbook: DST transitions, leap seconds, NTP jumps, clock skew -- all of these cause real production incidents. Document the known failure modes and their fixes before they happen.
What I'd Do Differently¶
Run a leap-second simulation in staging using adjtimex before every scheduled leap second. Maintain a fleet-wide kernel version inventory with CVE tracking tied to known time-handling bugs. And never again trust "NTP handles it" without verifying which kernel version is doing the handling.
The Quote¶
"Five hundred and eighty servers brought to their knees by a single extra second that the Earth insisted on."
Cross-References¶
- Topic Packs: Linux Ops, Distributed Systems, Kernel Troubleshooting, Linux Kernel Tuning