Skip to content

Remediation: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config

Immediate Fix (Linux Ops — Domain C)

The fix is on the Linux nodes — stop the rogue ntpd, restart chronyd, and fix the base image.

Step 1: Fix the affected nodes

# On worker-node-05
$ ssh worker-node-05
$ sudo systemctl stop ntpd
$ sudo systemctl disable ntpd
$ sudo systemctl restart chronyd
$ sudo chronyc makestep
200 OK
$ timedatectl status
System clock synchronized: yes
              NTP service: active

# On worker-node-02
$ ssh worker-node-02
$ sudo systemctl stop ntpd
$ sudo systemctl disable ntpd
$ sudo systemctl restart chronyd
$ sudo chronyc makestep
200 OK

Step 2: Verify clock synchronization

$ for node in worker-node-{01..08}; do
    echo -n "$node: "
    ssh $node "chronyc tracking | grep 'System time'"
done
worker-node-01: System time     : 0.000012 seconds fast of NTP time
worker-node-02: System time     : 0.000087 seconds slow of NTP time
worker-node-03: System time     : 0.000009 seconds fast of NTP time
worker-node-04: System time     : 0.000015 seconds slow of NTP time
worker-node-05: System time     : 0.000042 seconds fast of NTP time
worker-node-06: System time     : 0.000011 seconds fast of NTP time
worker-node-07: System time     : 0.000008 seconds slow of NTP time
worker-node-08: System time     : 0.000013 seconds fast of NTP time

All nodes within microseconds of NTP time.

Step 3: Update the node bootstrap script

# In the AMI build or user-data script, add:
systemctl stop ntpd 2>/dev/null || true
systemctl disable ntpd 2>/dev/null || true
yum remove -y ntp 2>/dev/null || true
systemctl enable --now chronyd

Verification

Domain A (Kubernetes) — HPA stable

$ kubectl get hpa frontend-hpa -n prod
NAME           REFERENCE             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
frontend-hpa   Deployment/frontend   32%/50%   3         15        3          47d

$ kubectl get hpa frontend-hpa -n prod -w
# (stable at 3 replicas, 30-35% CPU, no oscillation)

Domain B (Observability) — Metrics consistent

$ kubectl top pods -n prod -l app=frontend
NAME                        CPU(cores)   MEMORY(bytes)
frontend-7c6d5e4f3-a1b2c   158m         128Mi
frontend-7c6d5e4f3-d3e4f   162m         126Mi
frontend-7c6d5e4f3-g5h6i   155m         130Mi

Consistent CPU values across pods — no wild fluctuations.

$ kubectl logs -n kube-system metrics-server-6d94bc8694-r7x2m --tail=10 | grep skew
# (no clock skew errors)

Domain C (Linux Ops) — NTP running on all nodes

$ for node in worker-node-{01..08}; do
    echo -n "$node: "
    ssh $node "systemctl is-active chronyd"
done
worker-node-01: active
worker-node-02: active
...
worker-node-08: active

Prevention

  • Monitoring: Add a clock skew alert that fires when any node's clock differs from NTP by more than 5 seconds.
- alert: NodeClockSkew
  expr: abs(node_timex_offset_seconds) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.instance }} clock is {{ $value }}s off from NTP"
  • Runbook: All node bootstrap scripts must ensure ntpd is removed before chronyd is started. Add a health check for NTP synchronization status (chronyc tracking).

  • Architecture: Add NTP synchronization verification as a node admission check. Use a DaemonSet or node-problem-detector to continuously monitor clock accuracy and taint nodes with excessive clock drift.