Skip to content

Symptoms: DNS Looks Broken, TLS Is Expired, Fix Is in Cert-Manager

Domains: networking | security | kubernetes_ops Level: L2 Estimated time: 30-45 min

Initial Alert

PagerDuty fires at 03:17 UTC:

CRITICAL: api.prod.example.com — SSL certificate verification failed
Source: synthetic-monitor (Datadog)
Status: CRITICAL for 5 minutes

Within 2 minutes, a cascade of alerts follows:

CRITICAL: api.prod.example.com — connection refused on port 443
WARNING: dns_resolution_time for api.prod.example.com > 5s
WARNING: external_dns_lookup_failures spike — 47% failure rate

Observable Symptoms

  • External users report "Your connection is not private" (NET::ERR_CERT_DATE_INVALID) when hitting https://api.prod.example.com.
  • dig api.prod.example.com from an external resolver returns the correct A record but takes 4-6 seconds.
  • curl -v https://api.prod.example.com shows SSL certificate problem: certificate has expired.
  • The Datadog DNS resolution check shows a spike in latency from 50ms to 5200ms.
  • Internal services calling the API via the Kubernetes Service name (api-svc.prod.svc.cluster.local) work fine over HTTP but fail over HTTPS.

The Misleading Signal

The DNS latency spike and the 47% failure rate on external lookups make this look like a DNS infrastructure problem. A reasonable engineer would start investigating DNS resolution — checking CoreDNS, external DNS providers, and DNS propagation. The fact that the error says "certificate has expired" might be dismissed as a secondary symptom caused by DNS resolving to a stale endpoint. The DNS latency is real (caused by TLS handshake timeouts being misattributed to DNS in the monitoring config), which reinforces the misdirection.