Incident Replay: Time Sync Skew Breaks Application¶
Setup¶
- System context: Distributed application with 5 API servers behind a load balancer. One server's clock has drifted 3 minutes into the future, causing JWT token validation failures and cache inconsistencies.
- Time: Monday 11:30 UTC
- Your role: On-call SRE
Round 1: Alert Fires¶
[Pressure cue: "20% of API requests are returning 401 Unauthorized. Users report being randomly logged out. Auth team says tokens are valid. 5 minutes to Sev-1 declaration."]
What you see: Error rate is 20% — suspiciously close to 1/5 of the server pool. Load balancer shows all 5 servers healthy. Application logs on most servers show normal auth processing. One server (api-03) has elevated auth failure logs.
Choose your action: - A) Remove api-03 from the load balancer pool - B) Check the authentication flow on api-03 specifically - C) Restart the auth service on all servers - D) Check if a recent deploy changed the auth configuration
If you chose B (recommended):¶
[Result: On api-03,
curl localhost:8080/auth/validatewith a valid JWT returns "Token not yet valid" (nbf claim is in the future from this server's perspective).dateon api-03 shows the time is 3 minutes ahead of the other servers. Clock skew is causing JWT validation to reject tokens that were issued by other servers. Proceed to Round 2.]
If you chose A:¶
[Result: Removing api-03 stops the 401 errors. Good mitigation. But you need to find and fix the root cause.]
If you chose C:¶
[Result: Restarting does not fix a clock skew issue. The auth service is working correctly — it is rejecting tokens based on its (wrong) clock.]
If you chose D:¶
[Result: No recent deploys. The issue is environmental (clock), not code.]
Round 2: First Triage Data¶
[Pressure cue: "api-03 has a 3-minute clock skew. Why is NTP not fixing this?"]
What you see:
timedatectl status on api-03 shows "NTP synchronized: no." systemctl status chronyd shows the NTP daemon crashed 2 days ago and was not restarted. Without NTP, the hardware clock has drifted.
Choose your action:
- A) Manually set the correct time with date -s and restart chronyd
- B) Restart chronyd and let it gradually slew the clock
- C) Use chronyc makestep to force an immediate time correction
- D) Restart chronyd then use chronyc makestep for immediate sync
If you chose D (recommended):¶
[Result:
systemctl start chronydthenchronyc makestepforces an immediate time correction. Clock jumps to correct time. NTP is syncing again. JWT validation immediately starts working. Proceed to Round 3.]
If you chose A:¶
[Result:
date -sworks but NTP will not be running to prevent future drift. You need both the fix and the daemon.]
If you chose B:¶
[Result: chronyd's default slew rate for a 3-minute offset would take hours. Users cannot wait hours for the clock to gradually correct.]
If you chose C:¶
[Result:
makesteprequires chronyd to be running. It is crashed. Start it first.]
Round 3: Root Cause Identification¶
[Pressure cue: "Clock fixed. Why did chronyd crash?"]
What you see:
Root cause: chronyd crashed due to a DNS resolution failure — it could not resolve the NTP pool hostname. The crash was not restarted because the systemd unit had Restart=no (default). Two days without NTP caused 3 minutes of drift.
Choose your action:
- A) Change the systemd unit to Restart=on-failure and add NTP health monitoring
- B) Use IP addresses instead of hostnames for NTP servers
- C) Configure multiple NTP sources for redundancy
- D) All of the above
If you chose D (recommended):¶
[Result: systemd unit updated with
Restart=on-failure. Multiple NTP sources configured (both hostname and IP). NTP sync monitoring alert added. Proceed to Round 4.]
If you chose A:¶
[Result: Restart and monitoring are good but a single NTP source is still fragile.]
If you chose B:¶
[Result: IPs avoid DNS issues but miss pool rotation benefits.]
If you chose C:¶
[Result: Multiple sources add redundancy but chronyd still will not restart if it crashes.]
Round 4: Remediation¶
[Pressure cue: "NTP syncing. API healthy. Close."]
Actions:
1. Verify time is correct across all 5 servers: for s in api-0{1..5}; do ssh $s date; done
2. Verify NTP is syncing: chronyc tracking
3. Add api-03 back to the load balancer pool (if it was removed)
4. Verify 401 error rate returned to baseline
5. Apply NTP config improvements to all servers in the fleet
Damage Report¶
- Total downtime: 0 (4 of 5 servers served traffic correctly)
- Blast radius: 20% of API requests returning 401 for ~2 days
- Optimal resolution time: 8 minutes (identify clock skew -> restart NTP -> force sync)
- If every wrong choice was made: 2+ hours with gradual clock slew and continued auth failures
Cross-References¶
- Primer: Linux Ops
- Primer: Linux Ops
- Primer: TLS & Certificates
- Footguns: Linux Ops