Why NTP Matters More Than You Think

lesson
ntp
clock-skew
tls
kafka
distributed-locks
totp-2fa
leap-seconds
chrony ---# Why NTP Matters More Than You Think

Topics: NTP, clock skew, TLS, Kafka, distributed locks, TOTP 2FA, leap seconds, chrony Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: None

The Mission¶

Your TLS certificates are valid. Your Kafka consumers are healthy. Your 2FA codes work. Your make builds correctly. Your cron jobs run on time. Your database replication is consistent.

All of these depend on one thing: your system clock being correct. When NTP breaks, none of them tell you "the clock is wrong." They just... break. In confusing ways. With error messages that point everywhere except the actual problem.

What NTP Does¶

NTP (Network Time Protocol) synchronizes your system clock with authoritative time sources. Without it, your server's clock drifts — typically 0.5-2 seconds per day on modern hardware, much more on VMs.

# Check NTP status
timedatectl
# → System clock synchronized: yes
# → NTP service: active

# Check chrony (modern NTP client)
chronyc tracking
# → Reference ID:    169.254.169.123 (Amazon NTP)
# → System time:     0.000023 seconds fast of NTP time
# → Last offset:     +0.000012 seconds

# Check NTP sources
chronyc sources
# → ^* 169.254.169.123    1   6  377   12   +23us   45us
#   ↑ * = current source  ↑ stratum 1  ↑ 23 microsecond offset

Name Origin: NTP is one of the oldest internet protocols still in active use. It was designed by David Mills at the University of Delaware in 1985 (RFC 958). Mills maintained and evolved NTP for over 30 years — one of the longest single-maintainer runs in internet history. The protocol has survived nearly unchanged because the problem (keeping clocks synchronized over a network) hasn't changed.

What Breaks When the Clock Is Wrong¶

TLS certificates "expire" early (or aren't valid yet)¶

Certificates have "Not Before" and "Not After" timestamps. If your clock is 5 minutes ahead, a certificate that expires in 4 minutes appears expired NOW.

Server clock:  2026-03-23 14:05:00 (5 minutes fast)
Certificate:   Not After: 2026-03-23 14:02:00
Reality:       Certificate is valid (it's actually 14:00:00)
Server sees:   Certificate EXPIRED (clock says 14:05, cert says 14:02)

Your HTTPS connections fail. curl returns SSL certificate problem: certificate has expired. Nothing is actually expired — the clock is wrong.

War Story: A datacenter's NTP server drifted 5 minutes ahead after a firmware update disabled synchronization. Over 72 hours, BMC management certificates "expired" one by one. Monitoring flagged the warnings but they landed in a Slack channel with 800 alerts per day. By the time someone investigated, 30% of server management interfaces were unreachable.

Kafka rejects messages¶

Kafka uses timestamps for log compaction, retention, and consumer offset management. Clock skew between brokers causes:

Messages with future timestamps (from fast brokers) that don't get compacted on schedule
Consumer lag calculations that show negative numbers (consumer "ahead" of producer)
Retention policies that don't fire correctly

TOTP 2FA codes fail¶

TOTP (Time-based One-Time Password — what Google Authenticator uses) generates codes based on the current 30-second window. If your server's clock is off by more than 30 seconds, the code your user entered is "expired" (or "not yet valid") from the server's perspective.

User's phone: 14:00:00 → code: 123456 (valid 14:00:00–14:00:30)
Server clock: 14:00:45 → expects code for 14:00:30–14:01:00
Server rejects 123456 because it's from the "previous" window

Most TOTP implementations accept ±1 window (±30 seconds). But 2+ minutes of drift = all 2FA fails for all users.

`make` rebuilds everything (or nothing)¶

make compares file timestamps to decide what needs rebuilding. If your clock jumps backward, source files appear "older" than build artifacts. make thinks nothing changed. If your clock jumps forward, everything appears "newer" and make rebuilds from scratch.

Distributed locks expire early¶

Distributed locks (Redis, etcd, ZooKeeper) have TTLs. If the server holding the lock has a fast clock, the lock expires early — before the work is done. Another process acquires the "expired" lock, and now two processes hold the same lock simultaneously.

Database replication anomalies¶

Statement-based replication uses timestamps from the primary. If the primary's clock is wrong, replicated NOW() values are wrong on the replica. Audit trails, created_at fields, and time-based queries produce different results on primary vs replica.

The Leap Second: When Time Itself Is Wrong¶

Earth's rotation isn't constant. It slows down slightly over time. To keep UTC aligned with solar time, a leap second is occasionally inserted — 23:59:59 is followed by 23:59:60 instead of 00:00:00.

Most software doesn't handle 23:59:60 well.

Trivia: On June 30, 2012, a leap second crashed parts of Reddit, Mozilla, Yelp, and caused widespread Linux kernel bugs. The bug: a leap second notification triggered a futex (fast userspace mutex) bug in the kernel, causing high CPU usage in Java and MySQL processes. The fix was a kernel patch, but many teams discovered the problem at midnight UTC on a Saturday night.

Google's solution: leap smear. Instead of inserting one extra second, Google's NTP servers spread the adjustment over 24 hours — each second is slightly longer (by ~11.6 microseconds). No 23:59:60, no software bugs. AWS, Azure, and Cloudflare now do the same. The last leap second was in 2016. The next one might never happen — in 2022, the General Conference on Weights and Measures voted to abolish leap seconds by 2035.

Configuring NTP Properly¶

chrony (recommended for modern Linux)¶

# Install
sudo apt install chrony

# /etc/chrony/chrony.conf
server 169.254.169.123 prefer iburst   # AWS NTP (if on AWS)
server time.google.com iburst           # Google NTP
server time.cloudflare.com iburst       # Cloudflare NTP

# Allow large initial correction (for VMs that start with wrong time)
makestep 1.0 3    # Jump up to 1 second, up to 3 times at startup

# After startup, slew (gradually adjust) — never jump
maxslewrate 500

# Log statistics for monitoring
logdir /var/log/chrony
log measurements statistics tracking

# Start and enable
sudo systemctl enable --now chronyd

# Check status
chronyc tracking
chronyc sources -v

Gotcha: VMs and containers are especially vulnerable to clock drift. VMs don't have direct access to hardware clocks — they rely on the hypervisor or NTP. Containers inherit the host's clock. If the host's NTP is broken, every container on that host has wrong time — and there's nothing the container can do about it.

Monitoring Clock Health¶

# Prometheus node_exporter provides:
# node_timex_offset_seconds — current offset from NTP
# node_timex_sync_status — 1 if synchronized, 0 if not

# Alert when clock drifts
- alert: ClockSkew
  expr: abs(node_timex_offset_seconds) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Clock offset {{ $value }}s on {{ $labels.instance }}"

# Alert when NTP is not synchronized
- alert: NTPNotSynced
  expr: node_timex_sync_status != 1
  for: 10m
  labels:
    severity: critical

Flashcard Check¶

Q1: TLS certificate "expired" but it shouldn't be. First thing to check?

System clock. timedatectl — is NTP synchronized? A fast clock makes valid certificates appear expired.

Q2: TOTP 2FA codes rejected for all users. Most likely cause?

Server clock drifted more than 30-60 seconds from actual time. TOTP codes are time-based with a 30-second window.

Q3: What is leap smear?

Instead of inserting a leap second (23:59:60), NTP servers spread the extra second over 24 hours. Each second is slightly longer. No 23:59:60, no software crashes.

Q4: Why are VMs/containers especially vulnerable to clock drift?

VMs don't have direct hardware clock access. Containers inherit the host's clock. If the host's NTP breaks, every container on it has wrong time.

Q5: chrony vs ntpd — which should you use?

chrony. It handles intermittent connections (laptops, VMs), converges faster after boot, and has better VM support. ntpd is older and designed for always-connected servers.

Cheat Sheet¶

Task	Command
Check NTP status	`timedatectl`
Check chrony status	`chronyc tracking`
Check NTP sources	`chronyc sources -v`
Force time sync	`chronyc makestep`
Enable NTP	`timedatectl set-ntp true`
Check offset	`chronyc tracking \\| grep "System time"`

What Breaks at Each Drift Level¶

Drift	What breaks
>1 second	Distributed locks may expire early
>30 seconds	TOTP 2FA fails for all users
>5 minutes	TLS certificates appear expired
>1 hour	Kafka retention/compaction anomalies
>1 day	Everything — databases, logs, cron, backups

Takeaways¶

NTP is invisible infrastructure. You never think about it until it breaks. Then everything breaks and nothing says "the clock is wrong."
Certificate "expiry" is often clock skew. Before blaming the cert, check timedatectl. A fast clock makes valid certs appear expired.
VMs and containers can't fix their own clocks. They inherit the host's time. Monitor NTP on every host, not just "the NTP server."
Leap seconds crash software. Google solved this with leap smear (gradual adjustment). Use Google, AWS, or Cloudflare NTP servers that implement smear.
Monitor clock offset, not just "NTP is running." NTP can be running and still drifting if all upstream sources are unreachable.

What Happens When Your Certificate Expires — TLS failures from clock skew
The Split-Brain Nightmare — distributed consensus depends on time
The Mysterious Latency Spike — clock issues in performance monitoring