AWS EC2 - Street-Level Ops¶
Real-world EC2 workflows for production environments. These are the procedures you reach for during incidents, capacity issues, and daily operations.
Debugging Launch Failures¶
When an instance refuses to launch, the error message tells you what to do — if you know where to look.
# InsufficientInstanceCapacity — AWS doesn't have capacity for this type in this AZ
# Try a different AZ, a different instance type, or use a capacity reservation
aws ec2 run-instances --instance-type m5.xlarge --subnet-id subnet-abc123 \
--image-id ami-xxx --dry-run 2>&1
# InstanceLimitExceeded — you've hit the account limit for this instance type
aws service-quotas get-service-quota \
--service-code ec2 \
--quota-code L-1216C47A # Running On-Demand Standard instances
# Request increase:
aws service-quotas request-service-quota-increase \
--service-code ec2 --quota-code L-1216C47A --desired-value 128
# Check current running instance count by type
aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].InstanceType' --output text | \
tr '\t' '\n' | sort | uniq -c | sort -rn
For InvalidParameterValue or Unsupported errors, check that the AMI, instance type, and subnet are compatible (e.g., Nitro-only AMIs on Nitro instances, arm64 AMIs on Graviton types).
Recovering Data from a Stopped/Broken Instance¶
The instance will not boot. You need data off its root volume. This is a standard operating procedure.
# Step 1: Stop the broken instance (if not already stopped)
aws ec2 stop-instances --instance-ids i-broken123
aws ec2 wait instance-stopped --instance-ids i-broken123
# Step 2: Identify and detach the root volume
ROOT_VOL=$(aws ec2 describe-instances --instance-ids i-broken123 \
--query 'Reservations[0].Instances[0].BlockDeviceMappings[?DeviceName==`/dev/xvda` || DeviceName==`/dev/sda1`].Ebs.VolumeId' \
--output text)
echo "Root volume: $ROOT_VOL"
aws ec2 detach-volume --volume-id "$ROOT_VOL"
aws ec2 wait volume-available --volume-ids "$ROOT_VOL"
# Step 3: Attach to a rescue instance as a secondary volume
aws ec2 attach-volume --volume-id "$ROOT_VOL" --instance-id i-rescue456 --device /dev/xvdf
# On the rescue instance:
# sudo mkdir /mnt/recovery
# sudo mount /dev/xvdf1 /mnt/recovery (partition number may vary)
# cp -a /mnt/recovery/important-data /home/ec2-user/
# Step 4: When done, detach and optionally reattach to original instance
aws ec2 detach-volume --volume-id "$ROOT_VOL"
aws ec2 wait volume-available --volume-ids "$ROOT_VOL"
aws ec2 attach-volume --volume-id "$ROOT_VOL" --instance-id i-broken123 --device /dev/xvda
SSH Connectivity Troubleshooting Chain¶
Cannot SSH into an instance. Work through this in order — do not skip steps.
# Step 1: Is the instance running and passing status checks?
aws ec2 describe-instance-status --instance-ids i-abc123 \
--query 'InstanceStatuses[0].{State:InstanceState.Name,System:SystemStatus.Status,Instance:InstanceStatus.Status}'
# If System=impaired → hardware issue, AWS will migrate it
# If Instance=impaired → OS-level issue, check console output
# Step 2: Does it have a reachable IP?
aws ec2 describe-instances --instance-ids i-abc123 \
--query 'Reservations[0].Instances[0].{Public:PublicIpAddress,Private:PrivateIpAddress,SubnetId:SubnetId}'
# Step 3: Is port 22 open in the security group?
SG=$(aws ec2 describe-instances --instance-ids i-abc123 \
--query 'Reservations[0].Instances[0].SecurityGroups[0].GroupId' --output text)
aws ec2 describe-security-groups --group-ids "$SG" \
--query 'SecurityGroups[0].IpPermissions[?FromPort==`22`]'
# Step 4: Is the NACL blocking SSH?
SUBNET=$(aws ec2 describe-instances --instance-ids i-abc123 \
--query 'Reservations[0].Instances[0].SubnetId' --output text)
aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=$SUBNET" \
--query 'NetworkAcls[0].Entries[?PortRange.From<=`22` && PortRange.To>=`22`]'
# Step 5: Does the route table have a path to you?
# If public subnet, needs igw route. If private, you need a bastion/VPN/SSM.
# Step 6: Check the system log for boot failures
aws ec2 get-console-output --instance-id i-abc123 --latest --output text | tail -50
# Look for: kernel panic, filesystem errors, sshd not starting, disk full
If all networking checks pass, the problem is inside the instance — disk full, sshd crashed, firewall rules (iptables), or wrong SSH key. Use SSM Session Manager as a backup path:
# SSM Session Manager — no SSH key or open port needed
aws ssm start-session --target i-abc123
# Requires SSM agent running and instance profile with AmazonSSMManagedInstanceCore
Performance Diagnosis¶
When an instance is slow, the cause is almost always one of four things: CPU credits, EBS throughput, network bandwidth, or memory pressure.
# CPU credit balance (t2/t3 burstable instances) — if 0, you're throttled
# T2 instances have a hard limit; T3 instances enter "unlimited" mode by default and accrue charges
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 --metric-name CPUCreditBalance \
--dimensions Name=InstanceId,Value=i-abc123 \
--start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
--period 300 --statistics Minimum
# EBS throughput — check if you're hitting the per-volume or per-instance limit
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS --metric-name VolumeReadOps \
--dimensions Name=VolumeId,Value=vol-abc123 \
--start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
--period 300 --statistics Sum
# Also check: VolumeThroughputPercentage, VolumeQueueLength
# QueueLength > 1 consistently = your disk is the bottleneck
# Network performance — check if you're hitting the instance bandwidth limit
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 --metric-name NetworkOut \
--dimensions Name=InstanceId,Value=i-abc123 \
--start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
--period 300 --statistics Sum
# t3.micro: baseline 5 Gbps burst / t3.xlarge: up to 5 Gbps
# m5.xlarge: up to 10 Gbps / m5.8xlarge: 10 Gbps guaranteed
On the instance itself:
# Quick triage
top -bn1 | head -20 # CPU and memory
iostat -x 1 3 # Disk I/O (iowait, await, %util)
sar -n DEV 1 3 # Network throughput
free -h # Memory pressure
dmesg | tail -30 # Kernel messages (OOM, disk errors)
AMI Baking Pipeline with Packer¶
Building golden AMIs for consistent deployments. This is the standard pattern.
# Packer template essentials (packer.pkr.hcl)
# 1. Start from a hardened base AMI (Amazon Linux 2023 or Ubuntu LTS)
# 2. Install packages, agents, configs
# 3. Run security hardening (CIS benchmarks)
# 4. Clean up (remove SSH keys, bash history, tmp files)
# 5. Create AMI with meaningful name and tags
# Build the AMI
packer build -var "version=$(git rev-parse --short HEAD)" template.pkr.hcl
# List your AMIs sorted by creation date
aws ec2 describe-images --owners self \
--query 'sort_by(Images, &CreationDate)[-10:].[ImageId,Name,CreationDate,State]' \
--output table
# Deregister old AMIs (keep last 5 per application)
OLD_AMIS=$(aws ec2 describe-images --owners self \
--filters "Name=tag:Application,Values=myapp" \
--query 'sort_by(Images, &CreationDate)[:-5].[ImageId]' --output text)
for ami in $OLD_AMIS; do
echo "Deregistering $ami"
# Get snapshot IDs before deregistering
SNAPS=$(aws ec2 describe-images --image-ids "$ami" \
--query 'Images[0].BlockDeviceMappings[].Ebs.SnapshotId' --output text)
aws ec2 deregister-image --image-id "$ami"
for snap in $SNAPS; do
aws ec2 delete-snapshot --snapshot-id "$snap"
done
done
Spot Instance Interruption Handling¶
Spot instances save 60-90% but can be reclaimed with 2 minutes notice. You must handle this.
# Check for interruption notice from instance metadata (poll every 5s)
# This endpoint returns a 404 normally, 200 when interrupted
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/spot/instance-action
# Returns: {"action": "terminate", "time": "2024-01-15T12:00:00Z"}
# In your application or init script, run this as a background check:
while true; do
ACTION=$(curl -s -o /dev/null -w "%{http_code}" \
-H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/spot/instance-action)
if [ "$ACTION" = "200" ]; then
echo "SPOT INTERRUPTION — draining"
# Deregister from load balancer
# Finish in-flight requests
# Checkpoint state to S3/EFS
# Signal ASG to launch replacement
break
fi
sleep 5
done
For auto scaling groups with spot:
# Mixed instances policy — spread across types and AZs for resilience
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names my-asg \
--query 'AutoScalingGroups[0].MixedInstancesPolicy'
# Check spot interruption history
aws ec2 describe-spot-instance-requests \
--filters "Name=status-code,Values=instance-terminated-by-price,instance-terminated-no-capacity" \
--query 'SpotInstanceRequests[].{Type:LaunchSpecification.InstanceType,AZ:LaunchedAvailabilityZone,Time:Status.UpdateTime}' \
--output table
Instance Metadata Service v2 (IMDSv2)¶
IMDSv2 is the secure way to access instance metadata. IMDSv1 is a known SSRF attack vector.
# Get a session token (required for IMDSv2)
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
# Common metadata queries
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-type
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/placement/availability-zone
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/local-ipv4
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Get the role credentials (for debugging "who am I")
ROLE=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/iam/security-credentials/)
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
"http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE" | jq .
# Enforce IMDSv2-only on an instance (disable v1)
aws ec2 modify-instance-metadata-options --instance-id i-abc123 \
--http-tokens required --http-endpoint enabled
# Find instances still allowing IMDSv1
aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].[InstanceId,MetadataOptions.HttpTokens,Tags[?Key==`Name`].Value|[0]]' \
--output table
EBS Snapshot and Restore¶
Snapshots are your backup and migration tool. Know these procedures cold.
# Create a snapshot with meaningful description
aws ec2 create-snapshot --volume-id vol-abc123 \
--description "pre-upgrade-$(date +%Y%m%d-%H%M)" \
--tag-specifications 'ResourceType=snapshot,Tags=[{Key=Purpose,Value=pre-upgrade},{Key=Expiry,Value=7d}]'
# Wait for snapshot completion
SNAP_ID=snap-abc123
aws ec2 wait snapshot-completed --snapshot-ids "$SNAP_ID"
# Restore: create a volume from snapshot in any AZ
aws ec2 create-volume --snapshot-id "$SNAP_ID" \
--availability-zone us-east-1a --volume-type gp3 \
--iops 3000 --throughput 125 \
--tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=restored-vol}]'
# Copy snapshot to another region (DR)
aws ec2 copy-snapshot --source-region us-east-1 \
--source-snapshot-id "$SNAP_ID" \
--destination-region us-west-2 \
--description "DR copy of $SNAP_ID"
# Find and clean up old snapshots (>30 days)
CUTOFF=$(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%S)
aws ec2 describe-snapshots --owner-ids self \
--query "Snapshots[?StartTime<'$CUTOFF'].[SnapshotId,VolumeId,StartTime,Description]" \
--output table
Maintenance Event Handling¶
AWS notifies you of scheduled maintenance. Here is what to do.
# Check for scheduled events on your instances
aws ec2 describe-instance-status --filters "Name=event.code,Values=*" \
--query 'InstanceStatuses[?Events].[InstanceId,Events[].{Code:Code,Before:NotBefore,After:NotAfter,Description:Description}]' \
--output table
# For "instance-retirement" — your hardware is dying
# Stop and start the instance to migrate to new hardware (reboot is not enough)
Gotcha: Stop/start changes the instance's public IP address (unless you have an Elastic IP). Any DNS records, firewall rules, or partner allowlists referencing the old public IP will break. EIPs survive stop/start; dynamic public IPs do not.
aws ec2 stop-instances --instance-ids i-abc123
aws ec2 wait instance-stopped --instance-ids i-abc123
aws ec2 start-instances --instance-ids i-abc123
# This moves the instance to new physical hardware
# For "system-reboot" — AWS needs to reboot the host
# You can reboot proactively before the scheduled window:
aws ec2 reboot-instances --instance-ids i-abc123
# For "system-maintenance" — AWS needs to do host work
# Stop/start migrates you off the host before the window
Scaling Event Debugging¶
When your Auto Scaling Group is not behaving as expected:
# Check scaling activities (most recent first)
aws autoscaling describe-scaling-activities --auto-scaling-group-name my-asg \
--query 'Activities[:10].[StartTime,StatusCode,Description,Cause]' \
--output table
# Check if instances are failing health checks
aws autoscaling describe-auto-scaling-instances \
--query 'AutoScalingInstances[?AutoScalingGroupName==`my-asg`].[InstanceId,HealthStatus,LifecycleState]' \
--output table
# Check scaling policies and current alarms
aws autoscaling describe-policies --auto-scaling-group-name my-asg \
--query 'ScalingPolicies[].{Policy:PolicyName,Type:PolicyType,Alarms:Alarms[].AlarmName}'
# Check if cooldown is preventing scaling
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names my-asg \
--query 'AutoScalingGroups[0].{Min:MinSize,Max:MaxSize,Desired:DesiredCapacity,Cooldown:DefaultCooldown,Instances:Instances[].InstanceId}'
# Common failures:
# - Launch failures: wrong AMI, SG in wrong VPC, subnet full, instance limit
# - Termination protection: instance has termination protection enabled
# - Lifecycle hooks: stuck in Pending:Wait or Terminating:Wait
Default trap: ASG lifecycle hooks default to a 3600-second (1 hour) timeout. If your hook's Lambda or SSM document fails silently, the instance sits in
Pending:Waitfor a full hour before the ASG abandons it and tries again. Set the timeout to match your actual hook duration (e.g., 300 seconds) and always send aCONTINUEorABANDONsignal explicitly. aws autoscaling describe-lifecycle-hooks --auto-scaling-group-name my-asg ```text