Skip to content

Thinking Out Loud: Secrets Management

A senior SRE's internal monologue while working through a real secrets incident. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

Security team sends an urgent Slack message: "An AWS access key was found in a public GitHub repository. The key belongs to the deploy-bot IAM user. We need to rotate it immediately and determine the blast radius."

The Monologue

Leaked AWS key in a public repo. This is a Priority Zero. Let me think about the order of operations here. The instinct is to revoke the key immediately, but if I do that without knowing what depends on it, I might cause an outage on top of a security incident. But the key is PUBLIC — every scanner bot has already harvested it. Speed matters more than caution here.

Actually, no — AWS has already probably detected this. Let me check.

aws iam list-access-keys --user-name deploy-bot --output json

Two access keys. Let me find the leaked one.

aws iam get-access-key-last-used --access-key-id AKIA5EXAMPLE12345678

Last used 2 hours ago from IP 54.x.x.x. That's our CI server. Okay, so the key is actively in use by our infrastructure. If I revoke it, CI breaks. If I don't revoke it, attackers can use it.

Mental Model: Revoke First, Fix the Blast Radius Second

When a credential is publicly exposed, the window for exploitation is measured in MINUTES. Automated scanners constantly scrape GitHub for AWS keys. The risk of an attacker using the key is far greater than the risk of a CI outage. Revoke immediately, then deal with the fallout. A CI outage is annoying; an AWS account compromise is catastrophic.

Revoking the key now.

aws iam update-access-key --access-key-id AKIA5EXAMPLE12345678 --status Inactive --user-name deploy-bot

Key is now inactive. Not deleted — inactive. I can reactivate it if I made a mistake, but no one can use it. Now let me create a new key.

aws iam create-access-key --user-name deploy-bot --output json | jq '{AccessKeyId: .AccessKey.AccessKeyId, SecretAccessKey: .AccessKey.SecretAccessKey}'

New key created. I need to update every system that uses the old key. Let me find them.

# Check where the deploy-bot credentials are used
aws iam get-access-key-last-used --access-key-id AKIA5EXAMPLE12345678 | jq '.AccessKeyLastUsed'

Last used by arn:aws:sts::123456789:assumed-role/deploy-role. It's used to assume a deployment role. Let me check what services and CI pipelines use this.

I know the deploy-bot key is stored in three places: GitHub Actions secrets, HashiCorp Vault, and a legacy Jenkins instance. Let me update all three.

# Update GitHub Actions secret
gh secret set AWS_ACCESS_KEY_ID -b "AKIA5NEWKEY123456789" -R our-org/our-repo
gh secret set AWS_SECRET_ACCESS_KEY -b "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" -R our-org/our-repo

Wait — I shouldn't paste the secret key in the command line. It'll be in bash history. Let me do this properly.

# Use a file or env var to avoid command-line exposure
gh secret set AWS_ACCESS_KEY_ID < /dev/stdin <<< "AKIA5NEWKEY123456789"
read -s secret_key
echo "$secret_key" | gh secret set AWS_SECRET_ACCESS_KEY -R our-org/our-repo

Mental Model: Secrets Hygiene During Rotation

When rotating secrets, don't create new exposure points. Don't paste secrets in command lines (bash history), don't write them to log files, don't put them in Slack messages. Use stdin, environment variables, or files with 600 permissions. The goal is zero exposure surface during the rotation.

Now Vault.

vault kv put secret/deploy-bot/aws \
  access_key_id=@/dev/stdin \
  <<< "AKIA5NEWKEY123456789"

# For the secret key, use a file
vault kv put secret/deploy-bot/aws \
  access_key_id="AKIA5NEWKEY123456789" \
  secret_access_key=@/tmp/newkey.txt
rm -f /tmp/newkey.txt

Now the legacy Jenkins. This is the hardest one because Jenkins stores credentials in its own credential store and there's no clean CLI for it.

# Jenkins credential update via CLI
java -jar jenkins-cli.jar -s http://jenkins.internal update-credentials-by-xml system::system::deploy-bot-aws < /tmp/jenkins-cred.xml

All three updated. Let me verify CI is working with the new key by triggering a test build.

gh workflow run ci.yml -R our-org/our-repo

Now, blast radius assessment. What could an attacker have done with this key in the time it was exposed?

# Check CloudTrail for any unauthorized usage
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIA5EXAMPLE12345678 \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --query 'Events[*].{Time:EventTime,Event:EventName,Source:EventSource,IP:CloudTrailEvent}' \
  --output json | jq '.[].Event' | sort | uniq -c | sort -rn

Let me look for events from unexpected IPs.

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIA5EXAMPLE12345678 \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --output json | jq -r '.Events[] | .CloudTrailEvent' | jq -r '.sourceIPAddress' | sort | uniq -c | sort -rn

Two IPs: our CI server IP (expected) and... a Tor exit node IP. Someone already used the key. Let me check what they did.

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIA5EXAMPLE12345678 \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --output json | jq -r '.Events[] | .CloudTrailEvent' | jq -r 'select(.sourceIPAddress == "185.x.x.x") | .eventName'

ListBuckets, GetCallerIdentity, DescribeInstances, CreateAccessKey. They listed our buckets, checked the identity, enumerated EC2 instances, and... tried to create a new access key. Let me check if they succeeded.

aws iam list-access-keys --user-name deploy-bot

Three keys now. The attacker created a new key for persistence. I need to revoke ALL keys except the one I just created.

aws iam delete-access-key --user-name deploy-bot --access-key-id AKIA5ATTACKER999999
aws iam delete-access-key --user-name deploy-bot --access-key-id AKIA5EXAMPLE12345678

Both deleted. Now let me check if they did anything with the assumed role. The deploy-bot can assume the deploy-role, which has broader permissions.

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=deploy-bot \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --output json | jq -r '.Events[] | .CloudTrailEvent' | jq -r 'select(.sourceIPAddress == "185.x.x.x") | "\(.eventTime) \(.eventName) \(.requestParameters)"' | head -20

They assumed the deploy-role and used it to... let me check. AssumeRole succeeded. Then they called ListBuckets and GetObject on a bucket. Let me check which bucket.

This is now a security incident requiring the full incident response process. I need to: 1. Notify the security team with CloudTrail evidence 2. Check what data was accessed from the bucket 3. Restrict the deploy-bot IAM policy to only the minimum required permissions 4. Add an SCPthat requires MFA for the deploy-bot

But the immediate crisis is handled: all compromised keys are revoked, the attacker's persistence key is deleted, and the new key is in place.

What Made This Senior-Level

Junior Would... Senior Does... Why
Research what the key is used for before revoking it Revoke immediately, then fix the dependencies Public credential exposure is a P0 — minutes matter more than avoiding a CI outage
Paste the new secret in the command line Use stdin, files, and environment variables during rotation Rotating a secret while creating new exposure points defeats the purpose
Revoke the old key and create a new one Check CloudTrail for unauthorized usage and look for attacker persistence (extra keys) Attackers create new credentials for persistence — revoking the leaked key isn't enough
Only check for the leaked key's usage Check for assumed role usage from the compromised key If the key can assume roles, the blast radius extends to everything those roles can access

Key Heuristics Used

  1. Revoke First, Fix Later: When a credential is publicly exposed, revoke it within minutes. A CI outage is cheaper than a data breach.
  2. Secrets Hygiene During Rotation: Never create new exposure during rotation. Use stdin, files, and secure channels.
  3. Check for Persistence: Attackers create backdoors. After revoking a leaked key, check CloudTrail for new keys, IAM changes, Lambda functions, or EC2 instances created by the attacker.

Cross-References

  • Primer — Secret storage models, rotation patterns, and zero-trust principles
  • Street Ops — Vault operations, AWS IAM key rotation, and GitHub secret management
  • Footguns — Committing secrets to git, not checking CloudTrail after a leak, and long-lived IAM access keys