Diagnostic Questions¶
Before revealing the investigation path:¶
-
The initial alerts show both DNS latency spikes and SSL certificate errors. Which symptom would you investigate first, and why? What is the risk of starting with the wrong one?
-
The Datadog synthetic monitor reports "dns_resolution_time > 5s" but manual
digqueries return in under 50ms. What could explain this discrepancy? How would you determine whether the DNS latency metric is trustworthy? -
You discover the TLS certificate expired today. What is the first thing you check to understand why automatic renewal failed? What commands would you run?
-
The root cause is that Helm deleted the cert-manager secret during an upgrade. The fix is to re-issue the certificate and annotate the secret. Why is the fix a Kubernetes/Helm configuration change rather than a security (cert rotation) or networking (DNS) change?
-
What monitoring would you add to detect this failure mode earlier? Specifically, how would you catch the 30-day window between "renewal started failing" and "certificate actually expired"?