Solution¶
Triage¶
- Compare system times across all nodes:
- Check chrony status on the affected node:
- Check if the chrony service is running:
- Check network connectivity to NTP servers:
Root Cause¶
During the hardware maintenance event, the node was powered off for an extended period. The hardware clock (RTC) drifted while the machine was off. After reboot, the system clock was set from the drifted RTC. The chrony service started but could not reach its configured NTP servers because:
- The NTP server addresses in
/etc/chrony.confpointed to an internal NTP pool that requires the node's network to be fully configured. - A firewall rule added during maintenance blocks outbound UDP 123 traffic.
Without NTP synchronization, the system clock remained 4 minutes and 37 seconds ahead. etcd's Raft consensus protocol detects clock skew beyond 1 second and rejects messages from the affected node, causing leader election instability. TLS certificate notBefore timestamps on freshly issued certificates appear to be in the future from the perspective of the other nodes.
Fix¶
Immediate (correct the clock):
- Fix network/firewall access to NTP:
- Force an immediate time step:
-
Verify synchronization:
ConfirmSystem timeoffset is < 1ms andLeap statusis Normal. -
Restart etcd on the affected node (it may need a clean restart after the clock jump):
Permanent fix:
- Ensure chrony starts on boot and can reach NTP servers:
- Configure chrony to allow large initial corrections:
- Persist the firewall rule or fix the firewall configuration that blocks NTP.
- Sync hardware clock to system time:
Rollback / Safety¶
- A sudden time jump can affect running applications (timers, cron jobs, database transactions). Plan the
makestepduring low-traffic periods if possible. - etcd may need to be restarted after a clock correction to clear stale leader election state.
- Monitor the cluster for 15-30 minutes after correction to ensure consensus stability.
Common Traps¶
- Using
date --setinstead ofchronyc makestep. Setting the time manually does not integrate with chrony and can cause chrony to fight the manual change. - Not checking the hardware clock. If the RTC is drifted and chrony is configured with
rtcsync, the drift will return after a reboot. - Assuming NTP is always running. Some cloud instances and VMs use hypervisor-based time sync instead of NTP. Check
timedatectlfor the actual sync method. - Ignoring sub-second drift. Even 500ms of skew can cause issues with distributed databases and certificate validation.
- Not monitoring clock skew. Export
node_timex_offset_secondsfrom node_exporter and alert on values exceeding 100ms.