Diagnostic Questions¶

Before revealing the investigation path:¶

The HPA is scaling between 3 and 12 replicas every 2-3 minutes despite stable traffic. The stabilization window is 300s. What could cause the HPA to override its own stabilization policy?
kubectl top pods shows wildly different CPU values (12m vs 478m) for pods running the same workload, and these values flip on the next measurement. Is this more likely an application issue, a metric collection issue, or a resource limit issue?
The Metrics Server logs show "time skew detected: node time differs from server time by 127s." How does clock skew cause the CPU utilization metric to be inaccurate?
Two nodes have their NTP service crashed because ntpd and chronyd are competing for the same port. Why is the correct fix a Linux ops change (stop ntpd, restart chronyd, fix the AMI) rather than a Kubernetes change (adjust HPA settings)?
The HPA flapping started 4 days ago, which is when the NTP service crashed. Why did nobody notice for 4 days? What monitoring would catch clock drift before it affects metrics?