Investigation: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config¶

Phase 1: Kubernetes Investigation (Dead End)¶

Check the HPA configuration:

$ kubectl describe hpa frontend-hpa -n prod
Name:                     frontend-hpa
Namespace:                prod
Reference:                Deployment/frontend
Metrics:                  (current / target)
  resource cpu on pods:   fluctuating / 50%
Min replicas:             3
Max replicas:             15
Behavior:
  Scale Up:
    Stabilization Window: 60s
    Policies:
      Type: Pods, Value: 4, Period: 60s
  Scale Down:
    Stabilization Window: 300s
    Policies:
      Type: Percent, Value: 25, Period: 60s
Events:
  Type     Reason              Age    Message
  ----     ------              ----   -------
  Normal   SuccessfulRescale   2m     New size: 12; reason: cpu resource utilization above target
  Normal   SuccessfulRescale   5m     New size: 3; reason: All metrics below target
  Normal   SuccessfulRescale   7m     New size: 11; reason: cpu resource utilization above target
  Normal   SuccessfulRescale   10m    New size: 3; reason: All metrics below target

The stabilization window is 300s for scale-down, which should prevent rapid oscillation. But it is not working. Check the metrics directly:

$ kubectl top pods -n prod -l app=frontend
NAME                        CPU(cores)   MEMORY(bytes)
frontend-7c6d5e4f3-a1b2c   478m         128Mi
frontend-7c6d5e4f3-d3e4f   12m          126Mi
frontend-7c6d5e4f3-g5h6i   451m         130Mi

Wildly different CPU values for pods running the same workload. But wait — run it again 30 seconds later:

$ kubectl top pods -n prod -l app=frontend
NAME                        CPU(cores)   MEMORY(bytes)
frontend-7c6d5e4f3-a1b2c   15m          128Mi
frontend-7c6d5e4f3-d3e4f   462m         126Mi
frontend-7c6d5e4f3-g5h6i   18m          130Mi

The CPU values flipped completely. Pod d3e4f went from 12m to 462m while the others dropped. This is not real CPU usage — the metrics are wrong.

The Pivot¶

Check the Metrics Server:

$ kubectl get pods -n kube-system -l k8s-app=metrics-server
NAME                              READY   STATUS    RESTARTS   AGE
metrics-server-6d94bc8694-r7x2m   1/1     Running   0          14d

$ kubectl logs -n kube-system metrics-server-6d94bc8694-r7x2m --tail=20
E0319 13:44:12.482 scraper.go:140] "Failed to scrape node" err="time skew detected: node time differs from server time by 127s" node="worker-node-05"
E0319 13:44:12.483 scraper.go:140] "Failed to scrape node" err="time skew detected: node time differs from server time by -89s" node="worker-node-02"
W0319 13:44:42.100 scraper.go:119] "Stale metrics from kubelet" node="worker-node-05" age="2m7s"

Clock skew detected. The Metrics Server is seeing time differences of over 2 minutes between nodes. This corrupts the CPU usage calculation because CPU utilization is computed as (cpu_time_delta / wall_time_delta). If the wall time delta is wrong due to clock skew, the computed utilization is meaningless.

Phase 2: Observability Investigation (Root Cause)¶

The Metrics Server computes CPU utilization using timestamps from the kubelet. If a node's clock is ahead, the wall time delta appears larger, making CPU usage appear lower. If the clock is behind, the delta appears smaller, making usage appear higher. Check the node clocks:

$ for node in worker-node-{01..08}; do
    echo -n "$node: "
    ssh $node "date -u +%Y-%m-%dT%H:%M:%S"
done
worker-node-01: 2026-03-19T13:46:12
worker-node-02: 2026-03-19T13:44:43    # 89 seconds behind
worker-node-03: 2026-03-19T13:46:15
worker-node-04: 2026-03-19T13:46:11
worker-node-05: 2026-03-19T13:48:19    # 127 seconds ahead
worker-node-06: 2026-03-19T13:46:14
worker-node-07: 2026-03-19T13:46:10
worker-node-08: 2026-03-19T13:46:13

Nodes 02 and 05 have significant clock drift. Check NTP:

$ ssh worker-node-05 "timedatectl status"
               Local time: Thu 2026-03-19 13:48:21 UTC
           Universal time: Thu 2026-03-19 13:48:21 UTC
                 RTC time: Thu 2026-03-19 13:48:21 UTC
                Time zone: UTC (UTC, +0000)
System clock synchronized: no
              NTP service: inactive
          RTC in local TZ: no

$ ssh worker-node-05 "systemctl status chronyd"
● chronyd.service - NTP client/server
     Active: failed (Result: exit-code) since Sat 2026-03-15 08:14:22 UTC; 4 days ago
   Process: 1847 ExecStart=/usr/sbin/chronyd (code=exited, status=1/FAILURE)

$ ssh worker-node-05 "journalctl -u chronyd --since '4 days ago' | tail -5"
Mar 15 08:14:22 worker-node-05 chronyd[1847]: Fatal error : Could not open NTP socket : Address already in use

Chronyd crashed 4 days ago because another process was using UDP port 123. Check what:

$ ssh worker-node-05 "ss -ulnp | grep 123"
UNCONN  0  0  0.0.0.0:123  0.0.0.0:*  users:(("ntpd",pid=9482,fd=4))

An old ntpd process is running on the same port, preventing chronyd from starting. This node was imaged from an older AMI that had ntpd installed. The bootstrap script installs chronyd but does not disable ntpd first. Without NTP synchronization for 4 days, the clock has drifted 127 seconds.

Domain Bridge: Why This Crossed Domains¶

Key insight: The symptom was HPA flapping in Kubernetes (kubernetes_ops), the root cause was the Metrics Server computing wrong CPU utilization due to clock skew (observability), caused by NTP service failure on two nodes (linux_ops). This is common because: Kubernetes metrics computation depends on accurate timestamps. Clock skew between nodes corrupts any rate-based metric (CPU utilization, request rate). NTP is a foundational Linux service that, when broken, causes cascading failures in systems that depend on time accuracy.

Root Cause¶

Two nodes had their NTP service (chronyd) fail because a stale ntpd process from an older base image was occupying UDP port 123. Without NTP synchronization for 4 days, the node clocks drifted by 89 and 127 seconds respectively. The Metrics Server, which computes CPU utilization as a rate over time, produced wildly inaccurate values — making pods appear to jump between 12% and 95% CPU. The HPA acted on these erroneous metrics, scaling up and down in rapid succession.