Cloud Ops Footguns¶
Mistakes that blow up your bill, lock you out, or take down your infrastructure.
1. Admin IAM keys on a developer laptop¶
You created an IAM user with AdministratorAccess and put the keys in ~/.aws/credentials. Your laptop gets stolen, or the keys end up in a git commit. The attacker now has full control of your AWS account.
Fix: Use SSO/federation. No long-lived access keys. If you must use keys, scope them to minimum permissions and rotate quarterly. Use aws-vault for encrypted credential storage.
2. Public S3 buckets¶
You create an S3 bucket and set it to public because "it's just static assets." Six months later, someone uploads a database backup to the same bucket. It's now public. Bots scan for this constantly.
Fix: Enable S3 Block Public Access at the account level. Use bucket policies, not ACLs. Audit bucket permissions with AWS Config rules.
3. Deleting the only NAT Gateway¶
You're cleaning up unused resources to save money. You delete the NAT Gateway because "nothing uses it." Actually, every pod in your EKS private subnets uses it for outbound internet access (pulling images, calling APIs). Everything breaks.
Fix: Before deleting any networking resource, check route tables to see what depends on it. Tag resources with their purpose.
4. Security Group with 0.0.0.0/0 on port 22¶
You add SSH access from anywhere "temporarily" for debugging. It's still there three years later. Every server in that security group is exposed to the internet for SSH brute-force attacks.
Fix: Use SSM Session Manager or a bastion host with MFA. Never allow SSH from 0.0.0.0/0. Use VPN or IP allowlisting if direct SSH is required. Set up AWS Config to detect and alert on open security groups.
5. Single-AZ deployment¶
Your entire workload runs in us-east-1a. AWS has an AZ outage (this happens). You're completely down while multi-AZ customers keep running.
Fix: Deploy across at least 2 AZs, preferably 3. Use multi-AZ RDS. Spread EKS node groups across AZs. Test AZ failure by draining nodes in one AZ.
6. No billing alerts¶
You spin up a p4d.24xlarge (GPU instance, ~$32/hour) for testing and forget about it. You find out when the $23,000 bill arrives. Or someone's crypto-mining in your account and you don't know for two weeks.
Fix: Set up billing alerts at $100, $500, $1000, and your monthly budget. Use AWS Budgets with SNS notifications. Review Cost Explorer weekly.
7. Hardcoding region everywhere¶
You hardcode us-east-1 in your Terraform, your application config, and your scripts. When you need to deploy to a second region for disaster recovery, you have to find and update hundreds of hardcoded references.
Fix: Use variables for region. In Terraform: var.region. In apps: environment variables. In AWS SDK: use the instance metadata service for current region.
8. Not using OIDC for CI/CD¶
You create a static IAM user with long-lived access keys for your CI pipeline. Those keys are in environment variables on your CI system. A compromised CI runner or a supply chain attack gives the attacker persistent access to your AWS account.
Fix: Use OIDC federation: GitHub Actions → AWS, GitLab → AWS. Short-lived tokens, no stored keys, automatically scoped per repo and branch.
9. Deleting CloudFormation stack / Terraform state¶
You delete the CloudFormation stack or Terraform state file to "start fresh." The resources still exist in AWS but Terraform doesn't know about them. Now you have orphaned resources you can't manage, and applying creates duplicates.
Fix: Never delete state. Use terraform import to bring existing resources under management. Back up state to versioned S3.
10. EBS volumes outliving their instances¶
You terminate EC2 instances but the EBS volumes are set to DeleteOnTermination: false. Orphaned volumes accumulate silently — $0.10/GB/month adds up to thousands per year across a large account.
Fix: Set DeleteOnTermination: true for non-persistent volumes. Run regular orphaned resource audits. Use AWS Cost Explorer to track unattached EBS spend.
11. VPC IP exhaustion¶
You created a /24 subnet (256 IPs, 251 usable). Your EKS cluster assigns one IP per pod. At 50 pods per node and 5 nodes, you need 250+ IPs. Your subnet is exhausted. New pods can't schedule.
Fix: Size subnets appropriately. EKS needs large subnets — use /18 or /19 for pod subnets. Use VPC CNI prefix delegation to reduce IP consumption. Monitor with aws ec2 describe-subnets.