Skip to content

The 3 AM Cert Expiry

Category: The Incident Domains: tls, monitoring Read time: ~5 min


Setting the Scene

I was the sole SRE at a 40-person fintech startup. We had about a dozen microservices sitting behind an Nginx reverse proxy, fronted by a Cloudflare CDN for most traffic. But our mobile app hit our API gateway directly — no CDN, just a Let's Encrypt cert on a bare ALB. I was pretty proud of our monitoring stack: Prometheus, Grafana, PagerDuty, the works. We had a cert expiry check and everything. Life was good.

What Happened

2:47 AM — PagerDuty wakes me up. Not for the cert. For a spike in 5xx errors from the mobile API. I stumble to my laptop, open Grafana, and sure enough, our mobile error rate is at 100%. Web is fine. Internal services are fine. Just mobile.

2:53 AM — I SSH into the API gateway box. Nginx is running. I curl localhost:8080 — response is fine. I curl the public endpoint — SSL handshake failure. My stomach drops.

3:01 AMopenssl s_client -connect api.ourfintechapp.com:443 tells me the cert expired 47 minutes ago. But wait — we have a cert monitor. I check it. The check is green. Has been green for months. Because the check is hitting www.ourfintechapp.com, which is behind Cloudflare, which has its own cert. Nobody ever pointed the check at the API endpoint.

3:12 AM — I try to run certbot renew. It fails because port 80 is blocked by a security group rule someone added three weeks ago for "compliance." The ACME HTTP-01 challenge can't complete.

3:24 AM — I open the security group, run certbot, reload Nginx. Certs are valid. Mobile app recovers. I close the security group back up and open a beer at 3:30 AM on a Tuesday.

3:45 AM — I check our Slack. The on-call support person had been fielding customer complaints since 2:15 AM. The cert actually expired at 2:00 AM. We had 47 minutes of customer impact before alerting fired — and even then, it was the symptom, not the cause.

The Moment of Truth

The monitor was green the entire time. It was checking the wrong endpoint. We had cert monitoring that gave us confidence without giving us coverage. That's worse than no monitoring at all, because at least with no monitoring you know you're flying blind.

The Aftermath

I added per-endpoint cert checks for every public-facing hostname within 24 hours. We set up certbot with DNS-01 challenges so port 80 blocks couldn't break renewals. I also added a cron job that would alert 30, 14, and 7 days before any cert expired. The real postmortem action item that stuck: we created a "what does the user actually hit?" document for every service and made sure monitoring reflected that list.

The Lessons

  1. Monitor what users hit, not what's convenient: A cert check on the wrong endpoint is theater, not monitoring. Every public hostname needs its own check.
  2. Automate renewal AND test the automation: Certbot was set up but couldn't actually run because of a firewall change. Renewal automation must be tested regularly.
  3. Have a runbook for cert emergencies: At 3 AM, you don't want to be figuring out which security groups to modify. Write it down when you're awake.

What I'd Do Differently

I'd run a monthly "cert fire drill" — intentionally verify that certbot can complete a renewal for every domain, not just trust that the cron is there. I'd also set up synthetic monitoring from an external provider (like Checkly or Uptime Robot) hitting every public endpoint over HTTPS, so we get an outside-in view that doesn't share our blind spots.

The Quote

"The most dangerous monitoring is the kind that's green when it shouldn't be."

Cross-References