Diagnostic Questions¶

Before revealing the investigation path:¶

The initial alerts show both DNS latency spikes and SSL certificate errors. Which symptom would you investigate first, and why? What is the risk of starting with the wrong one?
The Datadog synthetic monitor reports "dns_resolution_time > 5s" but manual dig queries return in under 50ms. What could explain this discrepancy? How would you determine whether the DNS latency metric is trustworthy?
You discover the TLS certificate expired today. What is the first thing you check to understand why automatic renewal failed? What commands would you run?
The root cause is that Helm deleted the cert-manager secret during an upgrade. The fix is to re-issue the certificate and annotate the secret. Why is the fix a Kubernetes/Helm configuration change rather than a security (cert rotation) or networking (DNS) change?
What monitoring would you add to detect this failure mode earlier? Specifically, how would you catch the 30-day window between "renewal started failing" and "certificate actually expired"?