Symptoms: DNS Looks Broken, TLS Is Expired, Fix Is in Cert-Manager¶
Domains: networking | security | kubernetes_ops Level: L2 Estimated time: 30-45 min
Initial Alert¶
PagerDuty fires at 03:17 UTC:
CRITICAL: api.prod.example.com — SSL certificate verification failed
Source: synthetic-monitor (Datadog)
Status: CRITICAL for 5 minutes
Within 2 minutes, a cascade of alerts follows:
CRITICAL: api.prod.example.com — connection refused on port 443
WARNING: dns_resolution_time for api.prod.example.com > 5s
WARNING: external_dns_lookup_failures spike — 47% failure rate
Observable Symptoms¶
- External users report "Your connection is not private" (NET::ERR_CERT_DATE_INVALID) when hitting
https://api.prod.example.com. dig api.prod.example.comfrom an external resolver returns the correct A record but takes 4-6 seconds.curl -v https://api.prod.example.comshowsSSL certificate problem: certificate has expired.- The Datadog DNS resolution check shows a spike in latency from 50ms to 5200ms.
- Internal services calling the API via the Kubernetes Service name (
api-svc.prod.svc.cluster.local) work fine over HTTP but fail over HTTPS.
The Misleading Signal¶
The DNS latency spike and the 47% failure rate on external lookups make this look like a DNS infrastructure problem. A reasonable engineer would start investigating DNS resolution — checking CoreDNS, external DNS providers, and DNS propagation. The fact that the error says "certificate has expired" might be dismissed as a secondary symptom caused by DNS resolving to a stale endpoint. The DNS latency is real (caused by TLS handshake timeouts being misattributed to DNS in the monitoring config), which reinforces the misdirection.