AWS EC2 Footguns¶
Mistakes that cause outages, data loss, performance degradation, or surprise bills with EC2.
1. Using t2/t3 without understanding CPU credits (baseline vs burst)¶
You deploy a production database on a t3.medium. During low-traffic hours, CPU credits accumulate. During peak traffic, the instance bursts to 100% CPU. The credits run out. The instance is now throttled to its baseline (20% for t3.medium). Your database response times go from 5ms to 500ms. Monitoring shows CPU at 20% — it looks "idle" but is actually throttled to its maximum allowed baseline.
# Check CPU credit balance
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 --metric-name CPUCreditBalance \
--dimensions Name=InstanceId,Value=i-abc123 \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 --statistics Minimum
# If this hits 0, you are being throttled
Fix: For sustained workloads, use non-burstable instance types (m, c, r families). If you must use T-family, enable unlimited mode (aws ec2 modify-instance-credit-specification --instance-credit-specifications InstanceId=i-abc123,CpuCredits=unlimited) — but be aware this charges you for surplus credits. Monitor CPUSurplusCreditBalance to catch runaway costs.
2. Instance store data loss on stop¶
You store temporary data on the instance's NVMe instance store for performance. You stop the instance for a resize or maintenance. When you start it again, the instance store volumes are empty. All data is gone — no recovery possible.
Instance store data is also lost when: - The instance is terminated - The underlying hardware fails - The instance is stopped (even briefly)
Instance store data is preserved through: - Reboots only
Fix: Never store data you cannot afford to lose on instance store. Use EBS for persistent data and instance store only for caches, scratch space, and buffers. If using instance store for performance, replicate data to EBS or S3 asynchronously. Document which data lives where.
3. Not using IMDSv2 (SSRF risk)¶
IMDSv1 is a simple HTTP GET to 169.254.169.254 — no authentication. If your application has a server-side request forgery (SSRF) vulnerability, an attacker can access the metadata service and steal IAM credentials from the instance role. This is how Capital One was breached in 2019.
# IMDSv1 (vulnerable): anyone can do this
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/role-name
# IMDSv2 (secure): requires a PUT request for a session token first
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/iam/security-credentials/role-name
Fix: Enforce IMDSv2 on all instances:
aws ec2 modify-instance-metadata-options \
--instance-id i-abc123 \
--http-tokens required \
--http-endpoint enabled
ec2-imdsv2-check to detect non-compliant instances.
4. Security group changes applying immediately to running instances¶
You update a security group in production to tighten access. The change takes effect instantly on all running instances using that group. If you accidentally remove a rule that your application depends on (say, allowing database traffic on port 5432), all active database connections are killed immediately. There is no "pending" state, no deployment window, no rollback button.
Fix: Make security group changes through IaC with code review. When making manual changes, add the new rule first, verify, then remove the old rule. Never rely on a single security group for critical path traffic — use layered groups so removing one rule does not break connectivity.
5. EBS throughput limits per instance type¶
You attach a high-performance io2 volume (64,000 IOPS) to a t3.large instance. The volume can deliver 64,000 IOPS, but the t3.large has a maximum EBS throughput of 2,780 Mbps and 15,000 IOPS. You are paying for io2 performance you can never use because the instance is the bottleneck.
# Check instance type EBS limits
aws ec2 describe-instance-types --instance-types t3.large \
--query 'InstanceTypes[].EbsInfo.{MaxBandwidth:EbsBandwidthInfo.MaximumBandwidthInMbps,MaxIOPS:EbsBandwidthInfo.MaximumIops,MaxThroughput:EbsBandwidthInfo.MaximumThroughputInMBps}'
Fix: Match your EBS volume performance to your instance type's EBS limits. There is no point provisioning more IOPS on the volume than the instance can deliver. For I/O-intensive workloads, use i-family (storage optimized) or r-family instances with higher EBS limits.
6. Not tagging instances (cost allocation nightmare)¶
You launch 50 instances across 3 environments. None of them are tagged. The monthly bill is $15,000. Finance asks "which team owns this?" and "how much does production cost vs dev?" You have no way to answer without manually checking each instance.
Fix: Enforce tagging on launch. Required tags: Environment, Team, Service, CostCenter. Use tag policies in AWS Organizations to enforce required tags. Use AWS Config rule required-tags to detect untagged resources. Set up cost allocation tags in Billing to enable per-tag cost reporting.
# Find untagged instances
aws ec2 describe-instances \
--query 'Reservations[].Instances[?!not_null(Tags[?Key==`Environment`])].{Id:InstanceId,Type:InstanceType,State:State.Name}'
7. Stopping an instance with instance store (data gone)¶
You have a c5d.xlarge with a 100 GB NVMe instance store drive. It contains a local cache that took hours to build. Someone stops the instance for "maintenance." The instance store is wiped. The cache is gone. The application takes hours to rebuild it.
This also catches people during auto-scaling events: the ASG terminates an instance, and any data on instance store is lost forever.
Fix: Design for instance store ephemerality. Pre-warm caches from S3 or a shared store on boot. Use EBS for anything that must survive stop/start. Document which instance types have instance store volumes and ensure teams understand the implications.
8. Public IP auto-assign in public subnet¶
Your VPC has a public subnet with "auto-assign public IPv4 address" enabled. A developer launches a backend service in this subnet, not realizing it gets a public IP. The service is now directly addressable from the internet. Combined with a permissive security group, this is an accidental exposure.
# Check subnet auto-assign setting
aws ec2 describe-subnets --subnet-ids subnet-abc123 \
--query 'Subnets[].MapPublicIpOnLaunch'
Fix: Disable auto-assign public IP on all subnets except those specifically designated for public-facing resources. Use private subnets for backend services. In launch templates, explicitly set AssociatePublicIpAddress: false. Use Elastic IPs only when you intentionally need a static public address.
9. Oversized instances "just in case"¶
Someone provisions r6g.4xlarge instances (128 GiB RAM) for a service that uses 8 GiB RAM. "Better safe than sorry." The monthly cost is $800/instance instead of the $100/instance that would suffice. Across a fleet of 20 instances, that is $14,000/month wasted.
Fix: Start small and scale up based on data. Use CloudWatch metrics to monitor actual CPU, memory (requires CloudWatch agent), and network utilization. Use AWS Compute Optimizer for right-sizing recommendations:
aws compute-optimizer get-ec2-instance-recommendations \
--instance-arns arn:aws:ec2:us-east-1:123456789012:instance/i-abc123 \
--query 'instanceRecommendations[].{Current:currentInstanceType,Recommended:recommendationOptions[0].instanceType,Savings:recommendationOptions[0].estimatedMonthlySavings.value}'
10. EBS DeleteOnTermination=true (the default)¶
The root EBS volume has DeleteOnTermination set to true by default. When the instance is terminated (even accidentally), the root volume is deleted too. All data on it is gone. Combined with "I ran terminate-instances instead of stop-instances," this means total data loss.
# Check DeleteOnTermination setting
aws ec2 describe-instances --instance-ids i-abc123 \
--query 'Reservations[].Instances[].BlockDeviceMappings[].{Device:DeviceName,DeleteOnTerm:Ebs.DeleteOnTermination}'
Fix: Set DeleteOnTermination=false on volumes containing important data:
aws ec2 modify-instance-attribute --instance-id i-abc123 \
--block-device-mappings '[{"DeviceName":"/dev/xvda","Ebs":{"DeleteOnTermination":false}}]'
11. Ignoring maintenance events until the instance is force-rebooted¶
AWS schedules maintenance events (hardware degradation, host retirement) and sends notifications. If you ignore them, AWS will eventually force-reboot or force-stop your instance during the maintenance window. If the instance has instance store data, it is lost. If it is a single point of failure, you have an unplanned outage.
# Check for scheduled events
aws ec2 describe-instance-status \
--filters "Name=event.code,Values=instance-reboot,instance-stop,instance-retirement" \
--query 'InstanceStatuses[].{Instance:InstanceId,Event:Events[0].Code,Before:Events[0].NotBefore}'
Fix: Monitor Health Dashboard and EC2 maintenance events. Proactively stop/start instances with upcoming events to migrate them to new hardware. For ASG instances, simply terminate and let the ASG replace them. Set up EventBridge rules for EC2 Instance State-change Notification to automate responses.
12. Running EBS-backed instances without snapshots¶
Your critical database runs on an EBS volume with no snapshots. The volume fails (rare but possible — EBS annual failure rate is 0.1-0.2%). Or an operator runs dd if=/dev/zero of=/dev/xvda by mistake. The data is gone with no recovery path.
Fix: Automate EBS snapshots with AWS Backup or Data Lifecycle Manager:
aws dlm create-lifecycle-policy \
--description "Daily snapshots, retain 7" \
--state ENABLED \
--execution-role-arn arn:aws:iam::123456789012:role/dlm-role \
--policy-details '{
"PolicyType": "EBS_SNAPSHOT_MANAGEMENT",
"ResourceTypes": ["VOLUME"],
"TargetTags": [{"Key": "Backup", "Value": "true"}],
"Schedules": [{
"Name": "Daily",
"CreateRule": {"Interval": 24, "IntervalUnit": "HOURS"},
"RetainRule": {"Count": 7}
}]
}'