The Permissions Avalanche¶
Category: The Incident Domains: iam, security Read time: ~5 min
Setting the Scene¶
Large media company, about 3,000 employees. I was on the cloud platform team — seven of us managing AWS for the whole org. We had around 40 AWS accounts in an Organization, with IAM policies managed through Terraform. Our CI/CD ran on GitHub Actions using an IAM role with fairly broad permissions. The security team had been pressuring us for months to implement least-privilege. Fair enough. We started a Friday afternoon.
What Happened¶
Friday 3:30 PM — I open a PR to restrict the CI/CD role's IAM policy. The old policy had s3:*, ecr:*, ecs:*, and iam:PassRole on *. I scoped it down to specific S3 buckets, specific ECR repos, specific ECS clusters. Felt great about it. My teammate reviewed it, approved it. I merged and applied the Terraform.
Friday 3:45 PM — Nothing breaks. I close my laptop. Weekend.
Monday 8:30 AM — Slack is on fire. The deploy pipeline has been broken since Saturday morning. A team tried to deploy their service Saturday at 10 AM for a scheduled release. GitHub Actions couldn't push to ECR because I'd scoped the policy to specific repo ARNs — and their service used a repo naming pattern I didn't account for. The ECR repo was app-services/payment-gateway and my policy only allowed app-services/*-service. The word "gateway" didn't match *-service.
Monday 8:45 AM — I go to fix the IAM policy. My Terraform apply fails. Why? Because the CI/CD role also runs Terraform, and I'd removed the iam:* permissions that Terraform needs to modify the role's own policy. The CI/CD pipeline can't fix itself because I removed its ability to do so.
Monday 9:00 AM — I try to apply Terraform locally with my admin credentials. MFA prompt. My YubiKey is at the office. I'm remote. My teammate's YubiKey is at his house. He's on PTO.
Monday 9:30 AM — I find our break-glass procedure document. It's a Google Doc from 2019. It references an IAM user called emergency-admin with a password in our Vault instance. I log into Vault. The emergency-admin credentials expired 8 months ago.
Monday 9:45 AM — I call our AWS TAM. They can't modify our IAM policies (obviously). They suggest using the root account. I find the root account email in our wiki, try to log in, trigger the MFA, which is linked to a phone number belonging to our former CTO.
Monday 10:15 AM — Our current CTO calls the former CTO. She finds the old phone in a drawer. She reads us the MFA code. We log in with root, revert the IAM policy.
The Moment of Truth¶
I'd locked the CI/CD pipeline out of its own ability to deploy — including deploying the fix. And every break-glass path was broken too. We had three layers of emergency access, and all three had rotted. The IAM change was correct in principle but catastrophic in execution.
The Aftermath¶
We implemented IAM changes in a staged rollout: first in the dev account for a week, then staging, then production. We created a dedicated "IAM management" role separate from the CI/CD execution role, so restricting deploy permissions couldn't lock out infrastructure changes. The break-glass procedure got completely rebuilt with quarterly rotation drills. And we added a hard rule: no IAM changes after Wednesday, ever.
The Lessons¶
- Test IAM changes in staging first: IAM policy changes can have blast radius you can't predict from reading the policy document. Apply them in lower environments and let them bake.
- Have working break-glass procedures: Emergency access mechanisms that aren't tested regularly will be broken when you need them. Drill quarterly.
- Never make IAM changes on Friday: Or any day when you won't be around to deal with the fallout for 48 hours. IAM changes are high-risk, low-visibility — problems surface only when something tries to use the affected permissions.
What I'd Do Differently¶
I'd separate the IAM management plane from the CI/CD execution plane completely — different roles, different pipelines. I'd also implement a "shadow mode" for IAM changes: apply the new policy alongside the old one and log which requests would be denied by the new policy, without actually denying them. AWS Access Analyzer can help with this. Only after a week of clean logs would I cut over.
The Quote¶
"We made the system more secure by making it completely unusable. Mission accomplished, I guess."
Cross-References¶
- Topic Packs: AWS IAM, Security Basics, CI/CD Pipelines & Patterns
- Case Studies: Cross-Domain