cert-manager — Street-Level Ops¶
Real-world workflows for installing cert-manager, issuing certificates, diagnosing failures, and operating at scale.
Quick Diagnosis Commands¶
# Are all cert-manager pods running?
kubectl get pods -n cert-manager
# List all certificates and their ready status
kubectl get certificate -A
# NAME READY SECRET AGE
# myapp-tls True myapp-tls 5d
# broken-cert False broken-cert-tls 2h ← investigate this
# Check expiry of all certs (requires cert-manager kubectl plugin)
kubectl get certificate -A -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.notAfter)"'
# Check expiry from Secret directly
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate
# notAfter=Apr 1 12:00:00 2024 GMT
# Check days until expiry
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate | \
awk -F= '{cmd="date -d\""$2"\" +%s"; cmd | getline exp; close(cmd); print int((exp - systime()) / 86400)" days remaining"}'
# Tail cert-manager controller logs (most debugging starts here)
kubectl logs -n cert-manager deployment/cert-manager -f --tail=200
# Watch ACME challenge resources
kubectl get challenge -A -w
Gotcha: HTTP-01 Challenge Fails on Ingress with Authentication¶
Rule: The ACME HTTP-01 challenge path (/.well-known/acme-challenge/) must be publicly accessible without authentication. If your Ingress has auth middleware (oauth2-proxy, basic auth, IP allowlist), the Let's Encrypt validation server cannot reach the challenge and the cert issuance fails.
# WRONG — auth middleware blocks ACME challenge
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/auth-url: "http://oauth2-proxy.svc/oauth2/auth"
cert-manager.io/cluster-issuer: letsencrypt-prod # ← will fail
# RIGHT — exclude the challenge path from auth
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/auth-url: "http://oauth2-proxy.svc/oauth2/auth"
nginx.ingress.kubernetes.io/configuration-snippet: |
location ^~ /.well-known/acme-challenge/ {
auth_request off;
proxy_pass http://$service_name.$namespace.svc.cluster.local;
}
cert-manager.io/cluster-issuer: letsencrypt-prod
Alternative: use DNS-01 to avoid the HTTP reachability requirement entirely.
Gotcha: Wildcard Certs Require DNS-01¶
Rule: Let's Encrypt will not issue wildcard certificates via HTTP-01. If you try, the Challenge will immediately fail with "Wildcard domain names (*.example.com) require dns01".
# WRONG — HTTP-01 cannot issue wildcards
apiVersion: cert-manager.io/v1
kind: Certificate
spec:
dnsNames:
- "*.example.com"
issuerRef:
name: letsencrypt-http01 # ← will fail for wildcard
# RIGHT — use a DNS-01 issuer for wildcards
spec:
dnsNames:
- "*.example.com"
- example.com # apex domain — wildcard doesn't cover it
issuerRef:
name: letsencrypt-dns01
kind: ClusterIssuer
Pattern: Debugging a Stuck Certificate¶
Walk down the resource chain: Certificate → CertificateRequest → Order → Challenge.
Remember: The debugging chain is always Cert → CR → Order → Challenge. If you jump straight to the Challenge, you may miss that the CertificateRequest was never approved (e.g., a policy controller like cert-manager-approver-policy denied it).
# Step 1 — Check the Certificate
kubectl describe certificate myapp-tls -n default
# Look for: Status.Conditions, Events at bottom of output
# Common status messages:
# "Issuing certificate as Secret does not exist" → normal initial state
# "Waiting for CertificateRequest to be issued" → check CertificateRequest
# "Certificate is up to date and has not expired" → healthy
# Step 2 — Check CertificateRequest
kubectl get certificaterequest -n default
# NAME APPROVED DENIED READY ISSUER AGE
# myapp-tls-5xk8j True False 45m ← not ready
kubectl describe certificaterequest myapp-tls-5xk8j -n default
# Status.Conditions → message tells you why it's stuck
# Step 3 — Check Order (ACME only)
kubectl get order -n default
kubectl describe order myapp-tls-5xk8j-xxxxx -n default
# Status.State: pending / ready / invalid / errored
# Step 4 — Check Challenge (ACME only)
kubectl get challenge -n default
kubectl describe challenge myapp-tls-xxxxx-0 -n default
# Status.State: pending / valid / invalid
# Events: "Waiting for DNS record to propagate"
# "Error presenting challenge: AccessDenied"
# "Error: 403 urn:ietf:params:acme:error:unauthorized"
# Step 5 — Check DNS-01 record manually (if using DNS-01)
dig _acme-challenge.myapp.example.com TXT @8.8.8.8
# Step 6 — Check controller logs filtered to the domain
kubectl logs -n cert-manager deployment/cert-manager --since=1h | grep myapp.example.com
# Step 7 — Force re-issuance by deleting the Secret
kubectl delete secret myapp-tls -n default
# cert-manager detects the missing Secret within ~30s and re-issues
Pattern: Rotating a Certificate Immediately¶
# Option 1: Force renewal via plugin
kubectl cert-manager renew myapp-tls -n default
# Option 2: Annotate the Certificate to trigger renewal
kubectl annotate certificate myapp-tls -n default \
cert-manager.io/issuer-kind- # remove any existing annotation (may be needed)
kubectl annotate certificate myapp-tls -n default \
cert-manager.io/renew-before="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Option 3: Delete the Secret (most reliable — forces full re-issuance)
kubectl delete secret myapp-tls -n default
# Option 4: Delete the CertificateRequest to trigger a new one
kubectl delete certificaterequest -n default -l cert-manager.io/certificate-name=myapp-tls
# Verify new cert was issued with expected dates
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -dates
Pattern: Migrating Manually Managed Certs to cert-manager¶
You have an existing TLS Secret created manually. You want cert-manager to manage renewal going forward.
# 1. Check what the current cert looks like
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -text | grep -E "Subject|DNS|Not After"
# 2. Create the Certificate resource (cert-manager will adopt the Secret)
kubectl apply -f - << 'EOF'
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: myapp-tls
namespace: default
spec:
secretName: myapp-tls # same name as existing Secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- myapp.example.com
duration: 2160h
renewBefore: 720h
EOF
# 3. cert-manager will check the existing Secret's expiry
# If > renewBefore remaining: it adopts the cert and waits
# If < renewBefore remaining: it immediately renews
kubectl describe certificate myapp-tls -n default
Scenario: Let's Encrypt Rate Limit Hit¶
Default trap: cert-manager defaults
renewBeforeto 30 days (720h). With Let's Encrypt's 90-day certificates, this means renewal attempts start at day 60. If your DNS provider has a slow API or intermittent failures, 30 days of retry buffer disappears faster than you'd expect.
Let's Encrypt has a limit of 50 certificates per registered domain per week. When you hit it:
# Symptom in cert-manager logs:
# E0318 ... "too many certificates already issued for exact set of domains"
# 1. Switch to Let's Encrypt staging for testing
kubectl apply -f - << 'EOF'
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
server: https://acme-staging-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-staging-account-key
solvers:
- http01:
ingress:
class: nginx
EOF
# 2. Use staging for all test/dev environments
# Staging certs are not browser-trusted but have much higher rate limits
# 3. Check current rate limit usage at:
# https://crt.sh/?q=example.com (shows all issued certs)
# 4. To reduce cert count: use wildcard certs where possible
# *.example.com covers all subdomains → counts as 1 certificate
# 5. Share certs across services using Subject Alternative Names
apiVersion: cert-manager.io/v1
kind: Certificate
spec:
dnsNames:
- app1.example.com
- app2.example.com
- app3.example.com # multiple SANs in one cert
Emergency: Certificate Expired¶
A certificate has expired. Users are seeing TLS errors.
# 1. Confirm expiry
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate
# notAfter=Mar 01 12:00:00 2024 GMT ← in the past
# 2. Check why auto-renewal failed
kubectl describe certificate myapp-tls -n default
kubectl get events -n default --field-selector reason=Failed | grep myapp-tls
# 3. Force immediate re-issuance
kubectl delete secret myapp-tls -n default
# 4. Watch for issuance to complete (should take 30–120s for HTTP-01)
kubectl get certificate myapp-tls -n default -w
# NAME READY SECRET AGE
# myapp-tls False myapp-tls 5s
# myapp-tls True myapp-tls 47s ← issued
# 5. Verify new cert dates
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -dates
# 6. If Ingress is not picking up the new cert, restart the Ingress controller
# (nginx-ingress caches TLS certs in memory and may not detect the Secret update)
kubectl rollout restart deployment/ingress-nginx-controller -n ingress-nginx
# 7. Verify from outside
curl -v https://myapp.example.com 2>&1 | grep -E "SSL|expire|subject"
echo | openssl s_client -connect myapp.example.com:443 -servername myapp.example.com 2>/dev/null | \
openssl x509 -noout -dates
Useful One-Liners¶
# List all certificates expiring in the next 30 days
kubectl get certificate -A -o json | \
jq -r '.items[] | select(.status.notAfter != null) |
. as $c | ($c.status.notAfter | split("T")[0]) as $exp |
"\($c.metadata.namespace)/\($c.metadata.name): expires \($exp)"' | \
sort
# Check the CA bundle cert-manager is using
kubectl get secret -n cert-manager letsencrypt-prod-account-key -o yaml
# List all Issuers and ClusterIssuers and their status
kubectl get clusterissuer -o wide
kubectl get issuer -A -o wide
# Decode and inspect a certificate Secret
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -text | \
grep -E "Issuer|Subject|Not Before|Not After|DNS:"
# Count pending challenges (non-zero = something is failing)
kubectl get challenge -A --no-headers | grep -v "valid" | wc -l
# Verify cert-manager can resolve DNS (useful for DNS-01 debugging)
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
nslookup _acme-challenge.myapp.example.com
# Check the ACME account key is present
kubectl get secret letsencrypt-prod-account-key -n cert-manager
# Force re-read of a Certificate (useful if cert-manager missed an update)
kubectl annotate certificate myapp-tls -n default \
cert-manager.io/force-renewal="$(date +%s)" --overwrite
Quick Reference¶
- Runbook: Cert Renewal Failed