Skip to content

The Secret Rotation We Postponed

Category: The Hard Lesson Domains: secrets-management, security Read time: ~5 min


Setting the Scene

Our main API key for the Stripe payment integration was 26 months old. I know because I checked git log --all -S 'sk_live_' during the incident and found the commit that introduced it. It was committed directly to the repo in a .env.production file in 2021. The .gitignore entry for .env* was added three months later, but nobody ran git rm --cached on the file that was already tracked.

We had 15 services that used this key. Twelve of them loaded it from environment variables set during deploy. Three of them — and this is the part that still makes me cringe — had it hardcoded as a string literal in their source code.

What Happened

On a Thursday at 4:30 PM, our security team got an automated alert from GitHub's secret scanning: a Stripe live API key was exposed in a public repository. Not our repository — a contractor's personal GitHub account. They'd cloned our repo six months ago to work on the payment integration, and their fork was public.

The clock started. Stripe's incident response documentation says to rotate immediately. Simple enough if you have one service with the key in an environment variable. We had 15.

I started the rotation by generating a new key pair in the Stripe dashboard. Then I hit the first wall: which services use this key? There was no registry, no secrets inventory. I spent 40 minutes grepping across 15 repos. grep -r "sk_live" . in each one. Found it in environment variable configs, Docker Compose files, Terraform tfvars, and three Go source files.

The environment variable services were straightforward — update the variable in our deployment platform (a mix of AWS Parameter Store and Kubernetes Secrets), redeploy. That took about 3 hours because each service had its own deploy pipeline and some needed approval gates.

The three hardcoded services were the nightmare. One was a batch processor that ran on a bare EC2 instance with no CI/CD pipeline — deploys were done by SSH and git pull. One was a Lambda function packaged as a zip file, and the deployment tooling had been broken for months. The third was a legacy cron job running on the same server as the batch processor, started via a screen session that would die if the process was restarted.

At 2:00 AM — 9.5 hours into the rotation — we had 14 of 15 services updated. The Lambda function was the holdout. The deployment tooling required an IAM role that had been deleted during an access review. We ended up manually uploading a zip through the AWS console.

The full rotation took 72 hours. We burned the entire weekend.

The Moment of Truth

On the Saturday morning of the rotation, sitting in a conference room surrounded by pizza boxes and empty coffee cups, our CISO asked a simple question: "How often do we rotate our secrets?" The answer was never. Not once. In two years and three months, not a single secret across the organization had been rotated.

The Aftermath

We adopted HashiCorp Vault. Every service got a Vault-aware client that requested short-lived credentials at startup. The Stripe key became a Vault secret with a 90-day rotation policy. We eliminated every hardcoded secret — all three services were refactored to read from Vault or environment variables within a month.

We also ran trufflehog against every repo's full git history and found 23 additional secrets that had been committed and "removed" without history rewriting. Each one got rotated.

The Lessons

  1. Rotate secrets regularly: Unrotated secrets are a ticking clock. The longer they live, the more places they spread, and the harder they are to change when you need to change them fast.
  2. Use secret management tools: Vault, AWS Secrets Manager, even sealed Kubernetes Secrets — anything is better than strings in source code or static environment files.
  3. Hardcoded secrets are a time bomb: Every hardcoded secret is a rotation you can't automate and an incident you'll handle manually at 2 AM.

What I'd Do Differently

I'd deploy Vault (or Secrets Manager) before deploying the first service that needs a secret. I'd add trufflehog or gitleaks to every CI pipeline as a required check. I'd maintain a secrets inventory — a list of every external credential, which services use it, where it's stored, when it was last rotated, and how to rotate it. And I'd set calendar reminders for rotation deadlines, because "we'll rotate it quarterly" turns into "we'll rotate it eventually" without enforcement.

The Quote

"The key was 26 months old, committed to git, copied to a contractor's public repo, and hardcoded in three services. Other than that, our secrets management was fine."

Cross-References