Skip to content

Solution

Triage

  1. Compare system times across all nodes:
    # On each node:
    date -u +%Y-%m-%dT%H:%M:%S.%NZ
    
  2. Check chrony status on the affected node:
    timedatectl
    chronyc tracking
    chronyc sources -v
    
  3. Check if the chrony service is running:
    systemctl status chronyd
    
  4. Check network connectivity to NTP servers:
    chronyc ntpdata
    nc -u -zv pool.ntp.org 123
    

Root Cause

During the hardware maintenance event, the node was powered off for an extended period. The hardware clock (RTC) drifted while the machine was off. After reboot, the system clock was set from the drifted RTC. The chrony service started but could not reach its configured NTP servers because:

  • The NTP server addresses in /etc/chrony.conf pointed to an internal NTP pool that requires the node's network to be fully configured.
  • A firewall rule added during maintenance blocks outbound UDP 123 traffic.

Without NTP synchronization, the system clock remained 4 minutes and 37 seconds ahead. etcd's Raft consensus protocol detects clock skew beyond 1 second and rejects messages from the affected node, causing leader election instability. TLS certificate notBefore timestamps on freshly issued certificates appear to be in the future from the perspective of the other nodes.

Fix

Immediate (correct the clock):

  1. Fix network/firewall access to NTP:
    iptables -I OUTPUT -p udp --dport 123 -j ACCEPT
    
  2. Force an immediate time step:
    chronyc makestep
    
  3. Verify synchronization:

    chronyc tracking
    
    Confirm System time offset is < 1ms and Leap status is Normal.

  4. Restart etcd on the affected node (it may need a clean restart after the clock jump):

    systemctl restart etcd
    

Permanent fix:

  1. Ensure chrony starts on boot and can reach NTP servers:
    systemctl enable chronyd
    
  2. Configure chrony to allow large initial corrections:
    # /etc/chrony.conf
    makestep 1.0 3    # Allow stepping the clock in the first 3 updates if offset > 1 second
    
  3. Persist the firewall rule or fix the firewall configuration that blocks NTP.
  4. Sync hardware clock to system time:
    hwclock --systohc
    

Rollback / Safety

  • A sudden time jump can affect running applications (timers, cron jobs, database transactions). Plan the makestep during low-traffic periods if possible.
  • etcd may need to be restarted after a clock correction to clear stale leader election state.
  • Monitor the cluster for 15-30 minutes after correction to ensure consensus stability.

Common Traps

  • Using date --set instead of chronyc makestep. Setting the time manually does not integrate with chrony and can cause chrony to fight the manual change.
  • Not checking the hardware clock. If the RTC is drifted and chrony is configured with rtcsync, the drift will return after a reboot.
  • Assuming NTP is always running. Some cloud instances and VMs use hypervisor-based time sync instead of NTP. Check timedatectl for the actual sync method.
  • Ignoring sub-second drift. Even 500ms of skew can cause issues with distributed databases and certificate validation.
  • Not monitoring clock skew. Export node_timex_offset_seconds from node_exporter and alert on values exceeding 100ms.