Skip to content

Symptoms: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config

Domains: kubernetes_ops | observability | linux_ops Level: L2 Estimated time: 30-45 min

Initial Alert

Prometheus alert fires at 13:45 UTC:

WARNING: KubeHpaReplicasMismatch
  hpa: frontend-hpa
  namespace: prod
  desired: oscillating between 3 and 12
  message: "HPA has been unable to maintain target replica count"

Additional alerts:

WARNING: PodScaleUpThrottled — frontend pods scaling up/down every 2 minutes
WARNING: frontend-service — response latency variance 3x normal (p99 bouncing 80ms-800ms)

Observable Symptoms

  • The HPA for frontend is scaling between 3 and 12 replicas every 2-3 minutes.
  • kubectl get hpa frontend-hpa -n prod -w shows TARGETS fluctuating between 12%/50% and 95%/50%.
  • Pods are being created and terminated in rapid succession. New pods take 30s to be ready, then the HPA scales down, then scales up again.
  • The Metrics Server is running and returning data, but the CPU utilization values seem inconsistent — the same pod shows 12% one minute and 95% the next.
  • Application-level metrics (requests per second, latency) show stable traffic. No traffic spikes.
  • This started approximately 4 hours ago. No deployments or HPA configuration changes in the past 24 hours.

The Misleading Signal

HPA flapping looks like a Kubernetes autoscaling configuration problem — maybe the target utilization is too aggressive, the stabilization window is too short, or the CPU requests are misconfigured. Engineers investigate HPA behavior parameters, scaling policies, and resource requests. The fluctuating CPU metrics seem like an application issue — maybe the frontend has a periodic background task that spikes CPU.