Investigation: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config¶
Phase 1: Kubernetes Investigation (Dead End)¶
Check the HPA configuration:
$ kubectl describe hpa frontend-hpa -n prod
Name: frontend-hpa
Namespace: prod
Reference: Deployment/frontend
Metrics: (current / target)
resource cpu on pods: fluctuating / 50%
Min replicas: 3
Max replicas: 15
Behavior:
Scale Up:
Stabilization Window: 60s
Policies:
Type: Pods, Value: 4, Period: 60s
Scale Down:
Stabilization Window: 300s
Policies:
Type: Percent, Value: 25, Period: 60s
Events:
Type Reason Age Message
---- ------ ---- -------
Normal SuccessfulRescale 2m New size: 12; reason: cpu resource utilization above target
Normal SuccessfulRescale 5m New size: 3; reason: All metrics below target
Normal SuccessfulRescale 7m New size: 11; reason: cpu resource utilization above target
Normal SuccessfulRescale 10m New size: 3; reason: All metrics below target
The stabilization window is 300s for scale-down, which should prevent rapid oscillation. But it is not working. Check the metrics directly:
$ kubectl top pods -n prod -l app=frontend
NAME CPU(cores) MEMORY(bytes)
frontend-7c6d5e4f3-a1b2c 478m 128Mi
frontend-7c6d5e4f3-d3e4f 12m 126Mi
frontend-7c6d5e4f3-g5h6i 451m 130Mi
Wildly different CPU values for pods running the same workload. But wait — run it again 30 seconds later:
$ kubectl top pods -n prod -l app=frontend
NAME CPU(cores) MEMORY(bytes)
frontend-7c6d5e4f3-a1b2c 15m 128Mi
frontend-7c6d5e4f3-d3e4f 462m 126Mi
frontend-7c6d5e4f3-g5h6i 18m 130Mi
The CPU values flipped completely. Pod d3e4f went from 12m to 462m while the others dropped. This is not real CPU usage — the metrics are wrong.
The Pivot¶
Check the Metrics Server:
$ kubectl get pods -n kube-system -l k8s-app=metrics-server
NAME READY STATUS RESTARTS AGE
metrics-server-6d94bc8694-r7x2m 1/1 Running 0 14d
$ kubectl logs -n kube-system metrics-server-6d94bc8694-r7x2m --tail=20
E0319 13:44:12.482 scraper.go:140] "Failed to scrape node" err="time skew detected: node time differs from server time by 127s" node="worker-node-05"
E0319 13:44:12.483 scraper.go:140] "Failed to scrape node" err="time skew detected: node time differs from server time by -89s" node="worker-node-02"
W0319 13:44:42.100 scraper.go:119] "Stale metrics from kubelet" node="worker-node-05" age="2m7s"
Clock skew detected. The Metrics Server is seeing time differences of over 2 minutes between nodes. This corrupts the CPU usage calculation because CPU utilization is computed as (cpu_time_delta / wall_time_delta). If the wall time delta is wrong due to clock skew, the computed utilization is meaningless.
Phase 2: Observability Investigation (Root Cause)¶
The Metrics Server computes CPU utilization using timestamps from the kubelet. If a node's clock is ahead, the wall time delta appears larger, making CPU usage appear lower. If the clock is behind, the delta appears smaller, making usage appear higher. Check the node clocks:
$ for node in worker-node-{01..08}; do
echo -n "$node: "
ssh $node "date -u +%Y-%m-%dT%H:%M:%S"
done
worker-node-01: 2026-03-19T13:46:12
worker-node-02: 2026-03-19T13:44:43 # 89 seconds behind
worker-node-03: 2026-03-19T13:46:15
worker-node-04: 2026-03-19T13:46:11
worker-node-05: 2026-03-19T13:48:19 # 127 seconds ahead
worker-node-06: 2026-03-19T13:46:14
worker-node-07: 2026-03-19T13:46:10
worker-node-08: 2026-03-19T13:46:13
Nodes 02 and 05 have significant clock drift. Check NTP:
$ ssh worker-node-05 "timedatectl status"
Local time: Thu 2026-03-19 13:48:21 UTC
Universal time: Thu 2026-03-19 13:48:21 UTC
RTC time: Thu 2026-03-19 13:48:21 UTC
Time zone: UTC (UTC, +0000)
System clock synchronized: no
NTP service: inactive
RTC in local TZ: no
$ ssh worker-node-05 "systemctl status chronyd"
● chronyd.service - NTP client/server
Active: failed (Result: exit-code) since Sat 2026-03-15 08:14:22 UTC; 4 days ago
Process: 1847 ExecStart=/usr/sbin/chronyd (code=exited, status=1/FAILURE)
$ ssh worker-node-05 "journalctl -u chronyd --since '4 days ago' | tail -5"
Mar 15 08:14:22 worker-node-05 chronyd[1847]: Fatal error : Could not open NTP socket : Address already in use
Chronyd crashed 4 days ago because another process was using UDP port 123. Check what:
$ ssh worker-node-05 "ss -ulnp | grep 123"
UNCONN 0 0 0.0.0.0:123 0.0.0.0:* users:(("ntpd",pid=9482,fd=4))
An old ntpd process is running on the same port, preventing chronyd from starting. This node was imaged from an older AMI that had ntpd installed. The bootstrap script installs chronyd but does not disable ntpd first. Without NTP synchronization for 4 days, the clock has drifted 127 seconds.
Domain Bridge: Why This Crossed Domains¶
Key insight: The symptom was HPA flapping in Kubernetes (kubernetes_ops), the root cause was the Metrics Server computing wrong CPU utilization due to clock skew (observability), caused by NTP service failure on two nodes (linux_ops). This is common because: Kubernetes metrics computation depends on accurate timestamps. Clock skew between nodes corrupts any rate-based metric (CPU utilization, request rate). NTP is a foundational Linux service that, when broken, causes cascading failures in systems that depend on time accuracy.
Root Cause¶
Two nodes had their NTP service (chronyd) fail because a stale ntpd process from an older base image was occupying UDP port 123. Without NTP synchronization for 4 days, the node clocks drifted by 89 and 127 seconds respectively. The Metrics Server, which computes CPU utilization as a rate over time, produced wildly inaccurate values — making pods appear to jump between 12% and 95% CPU. The HPA acted on these erroneous metrics, scaling up and down in rapid succession.