cert-manager — Street-Level Ops¶

Real-world workflows for installing cert-manager, issuing certificates, diagnosing failures, and operating at scale.

Quick Diagnosis Commands¶

# Are all cert-manager pods running?
kubectl get pods -n cert-manager

# List all certificates and their ready status
kubectl get certificate -A
# NAME               READY   SECRET                     AGE
# myapp-tls          True    myapp-tls                  5d
# broken-cert        False   broken-cert-tls            2h   ← investigate this

# Check expiry of all certs (requires cert-manager kubectl plugin)
kubectl get certificate -A -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.notAfter)"'

# Check expiry from Secret directly
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -enddate
# notAfter=Apr  1 12:00:00 2024 GMT

# Check days until expiry
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -enddate | \
  awk -F= '{cmd="date -d\""$2"\" +%s"; cmd | getline exp; close(cmd); print int((exp - systime()) / 86400)" days remaining"}'

# Tail cert-manager controller logs (most debugging starts here)
kubectl logs -n cert-manager deployment/cert-manager -f --tail=200

# Watch ACME challenge resources
kubectl get challenge -A -w

Gotcha: HTTP-01 Challenge Fails on Ingress with Authentication¶

Rule: The ACME HTTP-01 challenge path (/.well-known/acme-challenge/) must be publicly accessible without authentication. If your Ingress has auth middleware (oauth2-proxy, basic auth, IP allowlist), the Let's Encrypt validation server cannot reach the challenge and the cert issuance fails.

# WRONG — auth middleware blocks ACME challenge
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/auth-url: "http://oauth2-proxy.svc/oauth2/auth"
    cert-manager.io/cluster-issuer: letsencrypt-prod  # ← will fail

# RIGHT — exclude the challenge path from auth
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/auth-url: "http://oauth2-proxy.svc/oauth2/auth"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      location ^~ /.well-known/acme-challenge/ {
        auth_request off;
        proxy_pass http://$service_name.$namespace.svc.cluster.local;
      }
    cert-manager.io/cluster-issuer: letsencrypt-prod

Alternative: use DNS-01 to avoid the HTTP reachability requirement entirely.

Gotcha: Wildcard Certs Require DNS-01¶

Rule: Let's Encrypt will not issue wildcard certificates via HTTP-01. If you try, the Challenge will immediately fail with "Wildcard domain names (*.example.com) require dns01".

# WRONG — HTTP-01 cannot issue wildcards
apiVersion: cert-manager.io/v1
kind: Certificate
spec:
  dnsNames:
    - "*.example.com"
  issuerRef:
    name: letsencrypt-http01  # ← will fail for wildcard

# RIGHT — use a DNS-01 issuer for wildcards
spec:
  dnsNames:
    - "*.example.com"
    - example.com             # apex domain — wildcard doesn't cover it
  issuerRef:
    name: letsencrypt-dns01
    kind: ClusterIssuer

Pattern: Debugging a Stuck Certificate¶

Walk down the resource chain: Certificate → CertificateRequest → Order → Challenge.

Remember: The debugging chain is always Cert → CR → Order → Challenge. If you jump straight to the Challenge, you may miss that the CertificateRequest was never approved (e.g., a policy controller like cert-manager-approver-policy denied it).

# Step 1 — Check the Certificate
kubectl describe certificate myapp-tls -n default
# Look for: Status.Conditions, Events at bottom of output

# Common status messages:
# "Issuing certificate as Secret does not exist" → normal initial state
# "Waiting for CertificateRequest to be issued" → check CertificateRequest
# "Certificate is up to date and has not expired" → healthy

# Step 2 — Check CertificateRequest
kubectl get certificaterequest -n default
# NAME                APPROVED   DENIED   READY   ISSUER   AGE
# myapp-tls-5xk8j    True                False            45m  ← not ready

kubectl describe certificaterequest myapp-tls-5xk8j -n default
# Status.Conditions → message tells you why it's stuck

# Step 3 — Check Order (ACME only)
kubectl get order -n default
kubectl describe order myapp-tls-5xk8j-xxxxx -n default
# Status.State: pending / ready / invalid / errored

# Step 4 — Check Challenge (ACME only)
kubectl get challenge -n default
kubectl describe challenge myapp-tls-xxxxx-0 -n default
# Status.State: pending / valid / invalid
# Events: "Waiting for DNS record to propagate"
#         "Error presenting challenge: AccessDenied"
#         "Error: 403 urn:ietf:params:acme:error:unauthorized"

# Step 5 — Check DNS-01 record manually (if using DNS-01)
dig _acme-challenge.myapp.example.com TXT @8.8.8.8

# Step 6 — Check controller logs filtered to the domain
kubectl logs -n cert-manager deployment/cert-manager --since=1h | grep myapp.example.com

# Step 7 — Force re-issuance by deleting the Secret
kubectl delete secret myapp-tls -n default
# cert-manager detects the missing Secret within ~30s and re-issues

Pattern: Rotating a Certificate Immediately¶

# Option 1: Force renewal via plugin
kubectl cert-manager renew myapp-tls -n default

# Option 2: Annotate the Certificate to trigger renewal
kubectl annotate certificate myapp-tls -n default \
  cert-manager.io/issuer-kind-    # remove any existing annotation (may be needed)
kubectl annotate certificate myapp-tls -n default \
  cert-manager.io/renew-before="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Option 3: Delete the Secret (most reliable — forces full re-issuance)
kubectl delete secret myapp-tls -n default

# Option 4: Delete the CertificateRequest to trigger a new one
kubectl delete certificaterequest -n default -l cert-manager.io/certificate-name=myapp-tls

# Verify new cert was issued with expected dates
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -dates

Pattern: Migrating Manually Managed Certs to cert-manager¶

You have an existing TLS Secret created manually. You want cert-manager to manage renewal going forward.

# 1. Check what the current cert looks like
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -text | grep -E "Subject|DNS|Not After"

# 2. Create the Certificate resource (cert-manager will adopt the Secret)
kubectl apply -f - << 'EOF'
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: myapp-tls
  namespace: default
spec:
  secretName: myapp-tls         # same name as existing Secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - myapp.example.com
  duration: 2160h
  renewBefore: 720h
EOF

# 3. cert-manager will check the existing Secret's expiry
# If > renewBefore remaining: it adopts the cert and waits
# If < renewBefore remaining: it immediately renews
kubectl describe certificate myapp-tls -n default

Scenario: Let's Encrypt Rate Limit Hit¶

Default trap: cert-manager defaults renewBefore to 30 days (720h). With Let's Encrypt's 90-day certificates, this means renewal attempts start at day 60. If your DNS provider has a slow API or intermittent failures, 30 days of retry buffer disappears faster than you'd expect.

Let's Encrypt has a limit of 50 certificates per registered domain per week. When you hit it:

# Symptom in cert-manager logs:
# E0318 ... "too many certificates already issued for exact set of domains"

# 1. Switch to Let's Encrypt staging for testing
kubectl apply -f - << 'EOF'
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
      - http01:
          ingress:
            class: nginx
EOF

# 2. Use staging for all test/dev environments
# Staging certs are not browser-trusted but have much higher rate limits

# 3. Check current rate limit usage at:
# https://crt.sh/?q=example.com   (shows all issued certs)

# 4. To reduce cert count: use wildcard certs where possible
# *.example.com covers all subdomains → counts as 1 certificate

# 5. Share certs across services using Subject Alternative Names
apiVersion: cert-manager.io/v1
kind: Certificate
spec:
  dnsNames:
    - app1.example.com
    - app2.example.com
    - app3.example.com   # multiple SANs in one cert

Emergency: Certificate Expired¶

A certificate has expired. Users are seeing TLS errors.

# 1. Confirm expiry
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -enddate
# notAfter=Mar 01 12:00:00 2024 GMT   ← in the past

# 2. Check why auto-renewal failed
kubectl describe certificate myapp-tls -n default
kubectl get events -n default --field-selector reason=Failed | grep myapp-tls

# 3. Force immediate re-issuance
kubectl delete secret myapp-tls -n default

# 4. Watch for issuance to complete (should take 30–120s for HTTP-01)
kubectl get certificate myapp-tls -n default -w
# NAME        READY   SECRET       AGE
# myapp-tls   False   myapp-tls    5s
# myapp-tls   True    myapp-tls    47s   ← issued

# 5. Verify new cert dates
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -dates

# 6. If Ingress is not picking up the new cert, restart the Ingress controller
# (nginx-ingress caches TLS certs in memory and may not detect the Secret update)
kubectl rollout restart deployment/ingress-nginx-controller -n ingress-nginx

# 7. Verify from outside
curl -v https://myapp.example.com 2>&1 | grep -E "SSL|expire|subject"
echo | openssl s_client -connect myapp.example.com:443 -servername myapp.example.com 2>/dev/null | \
  openssl x509 -noout -dates

Useful One-Liners¶

# List all certificates expiring in the next 30 days
kubectl get certificate -A -o json | \
  jq -r '.items[] | select(.status.notAfter != null) |
    . as $c | ($c.status.notAfter | split("T")[0]) as $exp |
    "\($c.metadata.namespace)/\($c.metadata.name): expires \($exp)"' | \
  sort

# Check the CA bundle cert-manager is using
kubectl get secret -n cert-manager letsencrypt-prod-account-key -o yaml

# List all Issuers and ClusterIssuers and their status
kubectl get clusterissuer -o wide
kubectl get issuer -A -o wide

# Decode and inspect a certificate Secret
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -text | \
  grep -E "Issuer|Subject|Not Before|Not After|DNS:"

# Count pending challenges (non-zero = something is failing)
kubectl get challenge -A --no-headers | grep -v "valid" | wc -l

# Verify cert-manager can resolve DNS (useful for DNS-01 debugging)
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
  nslookup _acme-challenge.myapp.example.com

# Check the ACME account key is present
kubectl get secret letsencrypt-prod-account-key -n cert-manager

# Force re-read of a Certificate (useful if cert-manager missed an update)
kubectl annotate certificate myapp-tls -n default \
  cert-manager.io/force-renewal="$(date +%s)" --overwrite

Quick Reference¶

Runbook: Cert Renewal Failed