The Clock Skew Catastrophe¶
Category: The Incident Domains: linux-ops, distributed-systems Read time: ~5 min
Setting the Scene¶
I was a mid-level SRE at a financial data aggregation company. We ran a CockroachDB cluster — nine nodes across three availability zones — because we needed strong consistency for transaction records. The system had been rock-solid for 18 months. We also ran an internal NTP pool using three chrony servers that all our infrastructure synced against. NTP is one of those things you set up once and forget about. Which is exactly what we did.
What Happened¶
Wednesday 4:30 PM — Alerts fire: CockroachDB write latency spiking. P99 goes from 15ms to 800ms. A few transactions start failing with uncertainty interval errors. I've never seen this error before.
4:35 PM — I check CockroachDB's built-in dashboard. The clock offset chart shows nodes in AZ-C drifting. They're about 280ms ahead of the other zones. CockroachDB's default maximum clock offset is 500ms. We're not there yet, but the database is adding uncertainty windows to transactions to compensate, which is destroying latency.
4:40 PM — I check chronyc tracking on the AZ-C nodes. They're syncing to our internal NTP server ntp-3.internal. I check ntp-3.internal: its upstream sources are all unreachable. The server has been free-running — drifting on its local hardware clock — for about six hours. The reason: a network ACL change that morning blocked UDP 123 outbound from the NTP server's subnet. Someone was tightening egress rules.
4:50 PM — The clock drift is getting worse. Two AZ-C nodes are now at 340ms offset. If they hit 500ms, CockroachDB will refuse to accept writes from those nodes entirely. We'd lose a third of our cluster.
4:52 PM — I fix the network ACL to allow UDP 123 outbound. But chrony doesn't instantly correct a 340ms drift — it slews the clock slowly to avoid disrupting applications. At the default slew rate, it'll take almost an hour to correct.
4:55 PM — I make a judgment call. I run chronyc makestep on the three AZ-C nodes to force an immediate clock correction. This is risky — stepping the clock on a database node can cause its own problems. But we're 160ms away from CockroachDB quarantining those nodes entirely.
4:56 PM — Clocks snap back into sync. CockroachDB uncertainty errors stop within 30 seconds. Write latency drops back to normal. I exhale for the first time in 25 minutes.
5:10 PM — I check for data integrity issues from the clock step. CockroachDB handled it gracefully — the step brought clocks back to within 5ms, well inside the uncertainty window. We got lucky.
The Moment of Truth¶
NTP was invisible infrastructure. Nobody monitored it. Nobody thought about it. But our entire distributed database's consistency model depended on clocks agreeing to within half a second. When the clocks drifted, the database did exactly what it was designed to do — slow down to preserve correctness. We experienced consistency working as intended, and it felt like an outage.
The Aftermath¶
We added clock offset monitoring on every node with alerts at 50ms and 200ms. We configured chrony with multiple upstream sources and alerting when any NTP server lost all its upstream references. The NTP servers got added to our "critical infrastructure" tier alongside DNS and the database itself. We also added a CockroachDB-specific dashboard that prominently displayed clock offsets — something we should have had from day one given the database's architecture.
The Lessons¶
- NTP is critical infrastructure: In a distributed system, clock synchronization is a dependency as fundamental as networking. Monitor it, alert on it, have redundancy.
- Monitor clock skew proactively: Don't wait for your database to tell you clocks are wrong. Alert on offsets well before they hit the tolerance limits of your systems.
- Distributed systems and time are enemies: Any system that depends on synchronized clocks (which is most distributed databases) needs explicit operational attention to clock health. It's not optional.
What I'd Do Differently¶
I'd treat NTP servers as first-class infrastructure from the start — monitored, redundant, with automated failover to public NTP pools if internal servers lose their upstream sources. I'd also run a weekly synthetic check that compares the clock on every node against a public reference and alerts if any node deviates by more than 10ms.
The Quote¶
"Our database was perfectly correct. It was time itself that was broken."
Cross-References¶
- Topic Packs: Distributed Systems, Linux Ops, Database Ops
- Case Studies: Linux Ops