Skip to content

Thinking Out Loud: TLS Certificates Ops

A senior SRE's internal monologue while working through a real TLS certificate incident. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

On-call page at 6 AM: "TLS certificate for api.example.com expires in 24 hours." The cert-manager renewal should have handled this automatically, but it didn't. I need to figure out why auto-renewal failed and either fix it or manually renew before the cert expires tomorrow morning.

The Monologue

Cert expiring in 24 hours. Cert-manager should renew at 30 days before expiry (the default). If we're at 24 hours, it's been trying and failing for 29 days, or it hasn't been trying at all. Let me check.

kubectl get certificate api-tls -n commerce -o yaml

Status shows Ready: False, Message: "Issuing certificate as Secret does not exist". Wait, the secret doesn't exist? Let me check.

kubectl get secret api-tls -n commerce

The secret exists. It's right there. So cert-manager says it doesn't exist but it does? Let me check the Certificate resource more carefully.

kubectl get certificate api-tls -n commerce -o jsonpath='{.spec.secretName}'

api-tls-cert. The Certificate spec references api-tls-cert but the actual secret is named api-tls. Someone renamed the secret or the Certificate spec was updated without updating the secret name. Cert-manager is looking for api-tls-cert and can't find it.

Mental Model: Follow the Name Chain

In Kubernetes, TLS automation is a chain of name references: Ingress -> Certificate -> Secret -> cert-manager Issuer. If any name in the chain is wrong, the automation silently fails. When cert renewal breaks, trace every name reference in the chain before looking at cert-manager logs.

But wait — if the secret name is wrong, how is the current cert working at all? The Ingress must be referencing the old secret name.

kubectl get ingress api-ingress -n commerce -o jsonpath='{.spec.tls[0].secretName}'

api-tls. The Ingress uses api-tls (the actual secret). The Certificate resource points to api-tls-cert (which doesn't exist). So the Ingress is using a manually-created or old cert that's about to expire, and cert-manager is trying to write to a different secret name.

This is a naming mismatch. The fix is straightforward — update the Certificate resource to use the correct secret name.

kubectl patch certificate api-tls -n commerce --type merge -p '{"spec":{"secretName":"api-tls"}}'

Now cert-manager should detect the existing secret, check its expiry, and start a renewal. Let me watch.

kubectl get certificate api-tls -n commerce -w

Status changed to Ready: False, Message: "Renewing certificate". Good, it's trying. Let me check the CertificateRequest and Order resources to see if the ACME challenge is working.

kubectl get certificaterequest -n commerce --sort-by=.metadata.creationTimestamp | tail -3
kubectl get order -n commerce --sort-by=.metadata.creationTimestamp | tail -3

There's a new CertificateRequest and an Order in pending state. Let me check the Challenge.

kubectl get challenge -n commerce

Challenge created, type http-01, state pending. Let me check if the ACME solver can complete.

kubectl describe challenge -n commerce $(kubectl get challenge -n commerce -o name | tail -1)

"Waiting for HTTP-01 challenge propagation: wrong status code '404', expected '200'." The HTTP-01 challenge is failing because the solver pod's response isn't being routed correctly. This is probably an Ingress issue — the solver needs its own Ingress to serve the challenge token.

Mental Model: ACME HTTP-01 Behind Ingress Controllers

HTTP-01 challenges require the ACME server to reach /.well-known/acme-challenge/<token> on your domain. Cert-manager creates a temporary pod and Service to serve this, but it also needs the Ingress controller to route the challenge request to the solver. If the Ingress has authentication middleware, rate limiting, or path restrictions, the challenge will fail because the ACME server can't reach the solver.

Let me check if there's middleware blocking the challenge path.

kubectl get ingress -n commerce -o yaml | grep -A 5 "well-known\|auth\|middleware"

The Ingress has an annotation for auth middleware: nginx.ingress.kubernetes.io/auth-url: "https://auth.example.com/verify". That's blocking the ACME challenge because the Let's Encrypt server can't authenticate. The cert-manager solver Ingress should be created without this auth middleware.

Let me check cert-manager's solver config in the ClusterIssuer.

kubectl get clusterissuer letsencrypt-prod -o yaml

The solver config uses http01.ingress.class: nginx but doesn't specify any annotations or ingress template. So the solver Ingress gets the same Ingress class but doesn't inherit the auth annotations — that should be fine. Unless the auth annotation is applied by a Mutating Admission Webhook...

kubectl get mutatingwebhookconfigurations

There's a default-auth-webhook that adds auth annotations to all Ingress resources. THAT's the problem. The webhook is mutating the cert-manager solver Ingress to add authentication, which blocks the ACME challenge.

I need to either exclude cert-manager's solver Ingress from the webhook or switch to DNS-01 challenges. The webhook exclusion is faster.

kubectl get mutatingwebhookconfiguration default-auth-webhook -o yaml | grep -A 10 namespaceSelector

The webhook applies to all namespaces except those labeled auth-skip: "true". Cert-manager creates solver Ingresses in the same namespace as the Certificate. I can't label the commerce namespace to skip auth — that would skip auth for all Ingresses in the namespace.

Alternative: switch to DNS-01 validation. It doesn't need HTTP access at all — it validates via a TXT record in DNS. Let me update the ClusterIssuer.

kubectl edit clusterissuer letsencrypt-prod

Actually, I don't want to edit the ClusterIssuer — other teams use it. Let me create a separate Issuer for this namespace that uses DNS-01.

But that's a bigger change. For the immediate problem (cert expires in 24 hours), let me just manually generate a cert and patch the secret. I can fix the automation afterward.

# Manual cert as emergency fix
kubectl cert-manager renew api-tls -n commerce 2>/dev/null || echo "cert-manager CLI not available"

The cert-manager kubectl plugin isn't installed. Let me try a different approach — temporarily remove the webhook's effect on the solver Ingress by adding an annotation that the webhook checks.

Actually, the fastest path: let me temporarily disable the auth webhook, trigger the renewal, wait for it to complete, then re-enable the webhook.

kubectl patch mutatingwebhookconfiguration default-auth-webhook --type=json -p='[{"op":"add","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

That changes the failure policy to Ignore, so if the webhook fails, the Ingress is created without mutation. Now let me delete the failed challenge so cert-manager creates a new one.

kubectl delete challenge -n commerce --all

Cert-manager will re-create the challenge with a new solver Ingress. Let me watch.

kubectl get challenge -n commerce -w

Challenge state: pending... valid. It worked. Let me check the certificate.

kubectl get certificate api-tls -n commerce

Ready: True. Certificate renewed. Let me verify the new expiry.

kubectl get secret api-tls -n commerce -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate

notAfter=Jun 17 2026. 90 days from now. Perfect. Let me restore the webhook.

kubectl patch mutatingwebhookconfiguration default-auth-webhook --type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Fail"}]'

Webhook restored. Now I need to fix this for the next renewal cycle. I'll file a ticket to either switch to DNS-01 or add an exemption for cert-manager solver Ingresses in the webhook configuration.

What Made This Senior-Level

Junior Would... Senior Does... Why
Manually generate a cert with openssl and call it done Fix the automation (name mismatch) so future renewals work Manual certs are a ticking time bomb — they'll expire again in 90 days
Only check cert-manager logs Trace the entire name chain: Ingress -> Certificate -> Secret -> Issuer -> Challenge The failure can be at any link in the chain
Not think about admission webhooks affecting cert-manager Check for mutating webhooks that modify the solver Ingress Admission webhooks are invisible infrastructure that can break cert-manager in non-obvious ways
Spend hours debugging the automation while the cert expires Balance: quick manual fix to extend the deadline, then fix automation properly You need to buy time before you can fix the root cause

Key Heuristics Used

  1. Follow the Name Chain: TLS automation is a chain of name references. A single name mismatch breaks the entire chain silently.
  2. Admission Webhooks Are Invisible Infrastructure: Mutating webhooks can modify cert-manager solver resources without anyone realizing it. Always check for webhooks when cert-manager challenges fail.
  3. Buy Time, Then Fix: When a cert is expiring soon, get a valid cert by any means first, then fix the automation. Don't let the perfect be the enemy of the non-expired.

Cross-References

  • Primer — TLS certificate lifecycle, ACME protocol, and cert-manager architecture
  • Street Ops — Certificate debugging commands and renewal procedures
  • Footguns — Secret name mismatches, admission webhooks blocking challenges, and the "it's automated so I don't need to monitor it" trap