AWS Troubleshooting Footguns¶

Mistakes that cause outages, security incidents, or billing surprises on AWS.

1. Not checking `sts get-caller-identity` first¶

You spend 30 minutes debugging an S3 access denied error. Turns out you are authenticated as the wrong role because AWS_PROFILE was set to a different account in your shell. Every subsequent command ran against the wrong account.

Fix: Always start troubleshooting with aws sts get-caller-identity. Confirm account, role, and region before doing anything else.

One-liner: Make this your muscle memory: aws sts get-caller-identity && echo "Region: $AWS_REGION". Pin it as the first line of every troubleshooting runbook. 90% of "Access Denied" investigations start with the wrong identity or wrong account.

2. Security group allows 0.0.0.0/0 on port 22¶

You add a temporary SSH rule for debugging and forget to remove it. A bot scans it within minutes. If the key is weak or compromised, you have an intruder. This is the #1 finding in AWS security audits.

Fix: Use SSM Session Manager or EC2 Instance Connect instead of opening port 22. If you must allow SSH, restrict to your IP. Set a calendar reminder to remove the rule.

3. Forgetting NACLs are stateless¶

Your security group allows inbound port 443 and all outbound. Traffic still fails. You forgot about the NACL on the subnet — NACLs are stateless, so you need explicit allow rules for both inbound AND the ephemeral port range outbound.

Fix: Check both security groups and NACLs when traffic is blocked. Flow logs show which layer is doing the REJECT. NACLs need return traffic rules (ports 1024-65535).

4. Running `aws ec2 terminate-instances` instead of `stop-instances`¶

You meant to stop an instance for maintenance. You typed terminate. The instance is gone. If it was backed by an instance store, the data is irrecoverable. EBS volumes with DeleteOnTermination=true are also gone.

Fix: Enable termination protection on critical instances: aws ec2 modify-instance-attribute --instance-id i-abc123 --disable-api-termination. Use --dry-run for destructive commands.

5. Overly broad IAM policies in production¶

Your deploy role has "Action": "*", "Resource": "*" because it was "temporary." An application bug or compromised credential now has admin access to every service in the account.

Fix: Follow least privilege. Use iam simulate-principal-policy to verify needed permissions. Scope to specific resources. Use permission boundaries on roles.

War story: The 2019 Capital One breach exposed 100 million credit applications. The root cause: an EC2 instance role with overly broad S3 permissions. An SSRF vulnerability in a misconfigured WAF allowed the attacker to steal the role's temporary credentials from the EC2 metadata service and dump S3 data. Least privilege on IAM roles would have limited the blast radius to a single bucket.

6. Ignoring VPC endpoint policies¶

Traffic from your private subnet goes through a VPC endpoint to S3. The endpoint policy restricts access to specific buckets. Your app tries to access a different bucket and gets Access Denied. You spend an hour checking IAM before discovering the endpoint policy.

Fix: When debugging S3 Access Denied from private subnets, check the VPC endpoint policy in addition to IAM and bucket policies: aws ec2 describe-vpc-endpoints.

7. Not monitoring EBS burst credits¶

Your gp2 volume runs out of burst I/O credits during a traffic spike. Disk performance drops to baseline (100 IOPS for a 33GB volume). The database slows to a crawl. The application times out. No alert fired because nobody monitors BurstBalance.

Fix: Monitor BurstBalance in CloudWatch for all gp2 volumes. Migrate to gp3 for consistent performance without burst credits. Alert when BurstBalance drops below 20%.

Default trap: A 100GB gp2 volume gets 300 IOPS baseline and 3000 IOPS burst. Once burst credits are exhausted, you drop to 300 IOPS — a 10x performance cliff. gp3 gives you 3000 IOPS baseline at a lower price with no burst credit mechanics. There is almost no reason to use gp2 for new volumes.

8. Using `--output text` without `--query` in scripts¶

Your script parses aws ec2 describe-instances text output by column position. AWS adds a new field to the output. Column positions shift. Your script starts passing the wrong value to terminate-instances.

Fix: Always use --query with JMESPath to extract specific fields. Never rely on column position in text output.

9. Editing security groups shared across services¶

You modify a security group to fix connectivity for Service A. That same SG is attached to 15 other instances. You just opened a port that should only be open for Service A.

Fix: Use per-service or per-tier security groups. Audit SG membership before editing: aws ec2 describe-network-interfaces --filters Name=group-id,Values=sg-abc123.

10. CloudTrail not enabled in all regions¶

CloudTrail is enabled in us-east-1 but not in ap-southeast-1. An attacker launches crypto miners in an unmonitored region. You have no audit trail of what happened.

Fix: Enable a multi-region trail: aws cloudtrail create-trail --name org-trail --is-multi-region-trail --s3-bucket-name my-trail-bucket. Verify with aws cloudtrail describe-trails.

War story: Attackers deliberately launch resources in unmonitored regions. Crypto mining in ap-southeast-1 while CloudTrail only covers us-east-1 is a common pattern. A multi-region trail with an organization trail (for multi-account setups) eliminates this blind spot. The storage cost is negligible compared to one undetected crypto mining incident.