Diagnostic Questions¶
Before revealing the investigation path:¶
-
You receive 47 alerts in 3 minutes. All are KubePodNotReady firing and resolving for the same service. Is the right first step to tune Alertmanager to suppress the noise, or to investigate why the pods are flapping?
-
The health endpoint responds in 0.8-1.2 seconds and the probe timeout is 1 second. The health check calls an Elasticsearch dependency. Is the application broken, or is the probe misconfigured?
-
DNS resolution for
catalog-search.prodtakes 723ms from the pod namespace but only 45ms from the monitoring namespace. What could cause namespace-specific DNS latency? -
The platform team added extra search domains to pod DNS config. With
ndots:5, how many DNS queries does a lookup forcatalog-search.prodgenerate? Why does this cause intermittent probe failures? -
The liveness probe checks dependency health (database, cache, Elasticsearch). Why is this a design mistake? What is the correct split between liveness and readiness probes?