Skip to content

Investigation: DNS Looks Broken, TLS Is Expired, Fix Is in Cert-Manager

Phase 1: Networking Investigation (Dead End)

The engineer starts with DNS, since the alerts suggest resolution failures.

$ dig api.prod.example.com @8.8.8.8

; <<>> DiG 9.18.18 <<>> api.prod.example.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41832
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
api.prod.example.com.   300     IN      A       203.0.113.42

;; Query time: 47 msec
;; SERVER: 8.8.8.8#53(8.8.8.8) (UDP)

DNS resolves correctly. The A record is correct. TTL is fine. Try multiple resolvers:

$ dig api.prod.example.com @1.1.1.1 +short
203.0.113.42

$ dig api.prod.example.com @9.9.9.9 +short
203.0.113.42

All resolvers agree. DNS is fine. Check if the latency alert is real:

$ dig api.prod.example.com +stats | grep "Query time"
;; Query time: 43 msec

43ms is normal. The "DNS latency" alert was actually measuring end-to-end HTTP check latency, which includes the TLS handshake timeout. The Datadog check is configured with dns_resolution_time but the probe type is http — it reports the full connection time under the DNS metric when TLS fails.

The Pivot

The DNS investigation is clean, but curl is explicit about the real problem:

$ curl -vI https://api.prod.example.com 2>&1 | grep -A2 "SSL"
* SSL certificate problem: certificate has expired
* Closing connection 0
curl: (60) SSL certificate problem: certificate has expired

Check the certificate directly:

$ echo | openssl s_client -connect api.prod.example.com:443 -servername api.prod.example.com 2>/dev/null | openssl x509 -noout -dates
notBefore=Dec 19 00:00:00 2025 GMT
notAfter=Mar 19 00:00:00 2026 GMT

The certificate expired today. This is not a DNS problem — it is a TLS certificate expiration problem.

Phase 2: Security Investigation (Root Cause)

Now investigate why the certificate expired. Check cert-manager in the cluster:

$ kubectl get certificates -n prod
NAME              READY   SECRET                AGE
api-prod-tls      False   api-prod-tls-secret   91d

$ kubectl describe certificate api-prod-tls -n prod
Name:         api-prod-tls
Namespace:    prod
...
Status:
  Conditions:
    Type:                  Ready
    Status:                False
    Reason:                Renewing
    Message:               Renewing certificate as renewal was scheduled at 2026-02-17T00:00:00Z
  Not After:               2026-03-19T00:00:00Z
  Renewal Time:            2026-02-17T00:00:00Z
Events:
  Type     Reason           Age   From          Message
  ----     ------           ----  ----          -------
  Warning  ErrRenewCert     30d   cert-manager  Failed to renew certificate: error getting keypair: secret "api-prod-tls-secret" not found
  Warning  ErrRenewCert     23d   cert-manager  Failed to renew certificate: error getting keypair: secret "api-prod-tls-secret" not found
  Warning  ErrRenewCert     16d   cert-manager  Failed to renew certificate: error getting keypair: secret "api-prod-tls-secret" not found

Cert-manager has been trying to renew for 30 days but failing. The secret it needs was deleted. Check why:

$ kubectl get events -n prod --field-selector reason=Killing --sort-by='.lastTimestamp' | head -5
# (no relevant events)

$ kubectl logs -n cert-manager deploy/cert-manager --since=1h | grep "api-prod"
E0319 03:00:12.482910  1 controller.go:167] cert-manager/certificates-key-manager: "msg"="error getting keypair" "error"="secret \"api-prod-tls-secret\" not found" "key"="prod/api-prod-tls"

The secret api-prod-tls-secret was deleted. Check audit logs:

$ kubectl get events -n prod --sort-by='.lastTimestamp' | grep "api-prod-tls-secret"
# Nothing recent — the deletion happened 30+ days ago

$ kubectl logs -n kube-system deploy/kube-apiserver --since=720h 2>/dev/null | grep "api-prod-tls-secret" | head -3
# Audit log shows deletion by helm-controller at 2026-02-16T14:22:31Z

A Helm upgrade 30 days ago removed the secret because the TLS secret was not properly marked as managed by cert-manager. Helm's 3-way merge saw it as an orphaned resource and deleted it.

Domain Bridge: Why This Crossed Domains

Key insight: The symptom was DNS latency (networking), the root cause was an expired TLS certificate (security), but the certificate expired because Helm deleted the renewal secret (Kubernetes ops). The monitoring system conflated TLS handshake timeouts with DNS resolution latency. This is common because: TLS certificate lifecycle management sits at the intersection of security (cert issuance), Kubernetes (secret storage), and networking (TLS termination). A break in any link manifests as a connectivity failure.

Root Cause

A Helm upgrade in the prod namespace 30 days ago deleted the api-prod-tls-secret Secret. This secret was created by cert-manager but was not annotated with helm.sh/resource-policy: keep. When Helm performed its 3-way merge during upgrade, it saw the secret as unmanaged and removed it. Cert-manager's renewal process depends on the existing secret to perform key rotation. Without it, renewals failed silently for 30 days until the certificate expired.