TLS & Certificates Ops Footguns¶
Mistakes that cause outages, security incidents, or painful debugging sessions with certificates and TLS.
1. Certificate Expiry Causing Production Outage¶
The most common TLS incident. Certificate expires at 2 AM, every client starts failing, and nobody noticed because there's no monitoring.
# Check expiry of a remote cert
echo | openssl s_client -connect example.com:443 -servername example.com 2>/dev/null | openssl x509 -noout -dates
# Check all certs on a host
find /etc/ssl /etc/pki -name "*.pem" -o -name "*.crt" | while read cert; do
expiry=$(openssl x509 -enddate -noout -in "$cert" 2>/dev/null)
echo "$cert: $expiry"
done
Fix: Monitor certificate expiry with Prometheus (ssl_exporter), Nagios, or a simple cron script. Alert at 30, 14, and 7 days before expiry. Use cert-manager with auto-renewal in Kubernetes.
War story: In 2020, Microsoft Teams went down for multiple hours because an authentication certificate expired. In 2021, Let's Encrypt's ISRG Root X1 cross-sign via DST Root CA X3 expired, breaking TLS on millions of devices running older OS versions that had not updated their root store. Certificate expiry is the most predictable outage category — and still one of the most common.
2. Missing Intermediate Certificate¶
The leaf cert is valid, browsers work (they cache intermediates), but API clients, curl, and other services fail with "unable to verify the first certificate."
# Test - this will fail if intermediates are missing
curl -v https://your-service.example.com
# Verify chain completeness
openssl s_client -connect your-service.example.com:443 -servername your-service.example.com </dev/null 2>&1 | grep -A2 "Certificate chain"
Fix: Always serve the full chain (leaf + intermediates, NOT root). Concatenate in order: cat leaf.crt intermediate.crt > fullchain.crt. Test with openssl verify -untrusted intermediate.crt leaf.crt.
3. Wildcard Certificate Not Covering Apex Domain¶
*.example.com matches www.example.com and api.example.com but does NOT match example.com (the bare domain).
Fix: Include both *.example.com and example.com as SANs in the certificate. Let's Encrypt supports this in a single cert.
4. Self-Signed Certificates in Production¶
Started as "temporary for testing," now it's been 3 years and clients have verify=False scattered everywhere. You've effectively disabled TLS security.
Fix: Use Let's Encrypt (free, automated). For internal services, set up an internal CA with step-ca or Vault PKI. Never disable certificate verification in production code.
5. Not Testing Certificate Renewal¶
Auto-renewal is set up, but nobody tested it. The renewal fails because DNS validation changed, the ACME account expired, or the web server doesn't pick up the new cert without a reload.
# Test Let's Encrypt renewal
certbot renew --dry-run
# Test cert-manager (check events)
kubectl describe certificate my-cert -n my-namespace
kubectl get events --field-selector reason=Issuing -n my-namespace
Fix: Test renewal in staging first. Monitor cert-manager Certificate resources for Ready status. Set up alerts for renewal failures. Ensure post-renewal hooks reload the service.
6. cert-manager Failing Silently in Kubernetes¶
Certificate resource exists but TLS secret is empty or stale. The Ingress serves a default/expired cert. Nobody checks cert-manager logs.
# Check certificate status
kubectl get certificates -A
kubectl describe certificate my-cert -n my-namespace
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager --tail=50
Fix: Monitor cert-manager Certificate objects for Ready=False. Set up alerts on cert-manager error logs. Check ClusterIssuer/Issuer status regularly.
7. HSTS with Short max-age Before Testing¶
You enable Strict-Transport-Security: max-age=31536000; includeSubDomains before confirming HTTPS works everywhere. Now browsers refuse HTTP for a year, and you can't undo it for affected users.
Fix: Start with max-age=300 (5 minutes). Test thoroughly. Gradually increase to 3600, then 86400, then the final value. Only add includeSubDomains after verifying ALL subdomains support HTTPS.
Default trap: Once a browser receives an HSTS header, there is no server-side way to "undo" it for that user until
max-ageexpires. If you setmax-age=31536000(1 year) and then realize a subdomain needs plain HTTP, those users are locked out for up to a year. The only client-side fix is manually clearing the HSTS entry in browser internals (e.g.,chrome://net-internals/#hsts).
8. Certificate Pinning Preventing Rotation¶
Mobile app or client pins a specific certificate or public key. When you rotate the cert (even for renewal), all pinned clients break.
Fix: If you must pin, pin the public key of your CA (not the leaf cert). Pin backup keys. Have an unpinning mechanism. Better yet: don't pin at all — proper certificate validation is sufficient for most use cases.
9. TLS 1.0 and 1.1 Still Enabled¶
Old protocols with known vulnerabilities are still accepted. Compliance scanners flag it, and it provides a false sense of security.
# Check which protocols are accepted
nmap --script ssl-enum-ciphers -p 443 example.com
# Test specific protocol
openssl s_client -connect example.com:443 -tls1_1
Fix: Disable TLS 1.0 and 1.1. Only allow TLS 1.2 and 1.3. In nginx: ssl_protocols TLSv1.2 TLSv1.3;. Test with ssllabs.com.
10. Different Certificate Between Load Balancer and Backend¶
TLS terminates at the load balancer with a valid cert, but backend connections use a different (often self-signed or expired) cert. Internal mTLS breaks, or monitoring tools report certificate errors.
Fix: Decide on your TLS architecture: terminate at LB (simpler) or end-to-end (more secure). If end-to-end, ensure backend certs are managed with the same rigor as frontend certs. For internal traffic, use a shared internal CA.
11. SNI Required But Client Doesn't Send It¶
Old clients or certain tools don't send the Server Name Indication extension. The server returns the wrong certificate (default/first configured vhost).
# Test with explicit SNI
openssl s_client -connect server:443 -servername correct.example.com
# Test without SNI (what old clients see)
openssl s_client -connect server:443
Fix: Ensure all clients support SNI (virtually everything modern does). For the rare legacy client, consider a dedicated IP per certificate. Configure a sensible default certificate on your server.
12. Weak Cipher Suites Still Accepted¶
Server accepts export-grade ciphers, RC4, 3DES, or other weak ciphers. Attackers can downgrade connections.
# Check cipher suites
nmap --script ssl-enum-ciphers -p 443 example.com | grep -E "least strength|weak|grade"
Fix: Configure strong cipher suites explicitly. Use Mozilla's SSL Configuration Generator (ssl-config.mozilla.org) for recommended settings per web server. Regularly scan with testssl.sh.