Remediation: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config¶
Immediate Fix (Linux Ops — Domain C)¶
The fix is on the Linux nodes — stop the rogue ntpd, restart chronyd, and fix the base image.
Step 1: Fix the affected nodes¶
# On worker-node-05
$ ssh worker-node-05
$ sudo systemctl stop ntpd
$ sudo systemctl disable ntpd
$ sudo systemctl restart chronyd
$ sudo chronyc makestep
200 OK
$ timedatectl status
System clock synchronized: yes
NTP service: active
# On worker-node-02
$ ssh worker-node-02
$ sudo systemctl stop ntpd
$ sudo systemctl disable ntpd
$ sudo systemctl restart chronyd
$ sudo chronyc makestep
200 OK
Step 2: Verify clock synchronization¶
$ for node in worker-node-{01..08}; do
echo -n "$node: "
ssh $node "chronyc tracking | grep 'System time'"
done
worker-node-01: System time : 0.000012 seconds fast of NTP time
worker-node-02: System time : 0.000087 seconds slow of NTP time
worker-node-03: System time : 0.000009 seconds fast of NTP time
worker-node-04: System time : 0.000015 seconds slow of NTP time
worker-node-05: System time : 0.000042 seconds fast of NTP time
worker-node-06: System time : 0.000011 seconds fast of NTP time
worker-node-07: System time : 0.000008 seconds slow of NTP time
worker-node-08: System time : 0.000013 seconds fast of NTP time
All nodes within microseconds of NTP time.
Step 3: Update the node bootstrap script¶
# In the AMI build or user-data script, add:
systemctl stop ntpd 2>/dev/null || true
systemctl disable ntpd 2>/dev/null || true
yum remove -y ntp 2>/dev/null || true
systemctl enable --now chronyd
Verification¶
Domain A (Kubernetes) — HPA stable¶
$ kubectl get hpa frontend-hpa -n prod
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
frontend-hpa Deployment/frontend 32%/50% 3 15 3 47d
$ kubectl get hpa frontend-hpa -n prod -w
# (stable at 3 replicas, 30-35% CPU, no oscillation)
Domain B (Observability) — Metrics consistent¶
$ kubectl top pods -n prod -l app=frontend
NAME CPU(cores) MEMORY(bytes)
frontend-7c6d5e4f3-a1b2c 158m 128Mi
frontend-7c6d5e4f3-d3e4f 162m 126Mi
frontend-7c6d5e4f3-g5h6i 155m 130Mi
Consistent CPU values across pods — no wild fluctuations.
$ kubectl logs -n kube-system metrics-server-6d94bc8694-r7x2m --tail=10 | grep skew
# (no clock skew errors)
Domain C (Linux Ops) — NTP running on all nodes¶
$ for node in worker-node-{01..08}; do
echo -n "$node: "
ssh $node "systemctl is-active chronyd"
done
worker-node-01: active
worker-node-02: active
...
worker-node-08: active
Prevention¶
- Monitoring: Add a clock skew alert that fires when any node's clock differs from NTP by more than 5 seconds.
- alert: NodeClockSkew
expr: abs(node_timex_offset_seconds) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} clock is {{ $value }}s off from NTP"
-
Runbook: All node bootstrap scripts must ensure
ntpdis removed beforechronydis started. Add a health check for NTP synchronization status (chronyc tracking). -
Architecture: Add NTP synchronization verification as a node admission check. Use a DaemonSet or node-problem-detector to continuously monitor clock accuracy and taint nodes with excessive clock drift.