Symptoms: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config¶
Domains: kubernetes_ops | observability | linux_ops Level: L2 Estimated time: 30-45 min
Initial Alert¶
Prometheus alert fires at 13:45 UTC:
WARNING: KubeHpaReplicasMismatch
hpa: frontend-hpa
namespace: prod
desired: oscillating between 3 and 12
message: "HPA has been unable to maintain target replica count"
Additional alerts:
WARNING: PodScaleUpThrottled — frontend pods scaling up/down every 2 minutes
WARNING: frontend-service — response latency variance 3x normal (p99 bouncing 80ms-800ms)
Observable Symptoms¶
- The HPA for
frontendis scaling between 3 and 12 replicas every 2-3 minutes. kubectl get hpa frontend-hpa -n prod -wshowsTARGETSfluctuating between12%/50%and95%/50%.- Pods are being created and terminated in rapid succession. New pods take 30s to be ready, then the HPA scales down, then scales up again.
- The Metrics Server is running and returning data, but the CPU utilization values seem inconsistent — the same pod shows 12% one minute and 95% the next.
- Application-level metrics (requests per second, latency) show stable traffic. No traffic spikes.
- This started approximately 4 hours ago. No deployments or HPA configuration changes in the past 24 hours.
The Misleading Signal¶
HPA flapping looks like a Kubernetes autoscaling configuration problem — maybe the target utilization is too aggressive, the stabilization window is too short, or the CPU requests are misconfigured. Engineers investigate HPA behavior parameters, scaling policies, and resource requests. The fluctuating CPU metrics seem like an application issue — maybe the frontend has a periodic background task that spikes CPU.