AWS EC2: The Virtual Server You Never See
- lesson
- ec2-instance-types
- ebs-storage
- security-groups
- instance-metadata
- spot-instances
- auto-scaling
- networking
- monitoring
- troubleshooting ---# AWS EC2 — The Virtual Server You Never See
Topics: EC2 instance types, EBS storage, security groups, instance metadata, spot instances, auto scaling, networking, monitoring, troubleshooting Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
It's Tuesday morning. PagerDuty fires: your production API server on EC2 is unreachable. Customers are getting timeouts. The last deploy was three days ago. Nothing has changed — or so everyone says.
You need to figure out why an EC2 instance stopped responding, fix it, and then make sure it never happens this way again. Along the way, you'll learn how EC2 actually works — from the naming convention that tells you everything about an instance, to the storage that vanishes when you least expect it, to the metadata service that caused one of the biggest cloud breaches in history.
By the end of this lesson you'll understand: - How to decode an instance type name and pick the right one - Why your data disappeared when you stopped an instance (and how to prevent it) - The systematic troubleshooting ladder for unreachable instances - How security groups, IMDSv2, and spot instances actually work - How to build a self-healing fleet with auto scaling
We'll build up from the basics, then use an incident to tie everything together.
Part 1: Decoding the Instance — Every Character Means Something¶
Before you can troubleshoot anything, you need to understand what you're running. Every EC2 instance type encodes its purpose in its name. Let's decode one.
c7gn.xlarge
│││││ └──── Size: xlarge (4 vCPU, 8 GiB RAM)
││││└────── Additional capability: n = network-enhanced
│││└─────── Processor: g = Graviton (ARM-based)
││└──────── Generation: 7th
│└───────── Family: c = compute-optimized
└────────── (it all starts here)
Every character is a decision. Here's how to read the whole alphabet:
| Position | What it means | Common values |
|---|---|---|
| Family | What the instance is optimized for | m = general, c = compute, r = RAM, t = burstable, i = I/O, p = GPU |
| Generation | Hardware generation (higher = newer, cheaper) | 5, 6, 7 — always pick the latest available |
| Processor | CPU architecture | g = Graviton (ARM), a = AMD, i = Intel (or no letter = Intel) |
| Capabilities | Extra features | n = network-enhanced, d = local NVMe disk, e = extra memory |
| Size | How much CPU and RAM | nano to metal (each step roughly doubles) |
Remember: The family letter tells you the optimization: Most workloads, Compute, RAM, Tiny-burst, I/O, Parallel GPU. Mnemonic: My Computers Run Things In Parallel.
# See exactly what a c7gn.xlarge gives you
aws ec2 describe-instance-types --instance-types c7gn.xlarge \
--query 'InstanceTypes[0].{
vCPU: VCpuInfo.DefaultVCpus,
MemoryMiB: MemoryInfo.SizeInMiB,
Network: NetworkInfo.NetworkPerformance,
EBSBandwidth: EbsInfo.EbsBandwidthInfo.MaximumBandwidthInMbps
}' --output table
| Size | vCPUs | Memory (GiB) | Use when... |
|---|---|---|---|
nano |
2 | 0.5 | Testing, tiny services |
small |
2 | 2 | Dev environments, small APIs |
large |
2 | 8 | Single-purpose production services |
xlarge |
4 | 16 | App servers, moderate databases |
2xlarge |
8 | 32 | Heavier workloads |
4xlarge+ |
16+ | 64+ | When you've proven you need it |
metal |
All cores | All RAM | Bare metal — no hypervisor |
Trivia: EC2 launched on August 25, 2006 with a single instance type:
m1.small(1.7 GB RAM, one virtual CPU). There was no SLA, one data center, and no load balancer integration. Today AWS offers over 750 instance types. That original m1.small cost $0.10/hour — the modern equivalent (t3.small) costs roughly $0.02/hour and is dramatically more powerful.
The Graviton decision¶
Instance types with a g suffix run on AWS Graviton processors — ARM chips designed
in-house by AWS (from the Annapurna Labs acquisition in 2015). The pitch is real: 20–40%
better price-performance for most workloads. Use Graviton unless your software specifically
requires x86 (some older compiled binaries, certain Oracle or Windows workloads).
# Compare pricing: x86 vs Graviton for general-purpose
# m7i.large (Intel) = ~$0.1008/hr in us-east-1
# m7g.large (Graviton) = ~$0.0816/hr in us-east-1
# Same specs, 19% cheaper — and Graviton3 uses 60% less energy
Interview Bridge: "When would you choose Graviton instances?" is a common AWS interview question. Answer: any workload that's ARM-compatible (most Linux workloads, containerized apps, Java, Python, Go). Exceptions: Windows, x86-specific binaries, or software with no ARM builds.
Flashcard Check #1¶
Cover the answers and test yourself.
| Question | Answer |
|---|---|
What does c7gn.xlarge mean? |
Compute-optimized, 7th gen, Graviton, network-enhanced, extra-large |
| Which family letter means "memory-optimized"? | r (think: RAM) |
| Why pick Graviton over Intel? | 20–40% better price-performance for ARM-compatible workloads |
What does the d suffix mean? |
Local NVMe instance store disk attached |
Part 2: The Storage That Vanishes — Instance Store vs. EBS¶
This is where people lose data. It's the most critical distinction in EC2, and the source of countless war stories.
Two kinds of disk, two completely different promises¶
EBS (Elastic Block Store) is network-attached storage. Think of it as a virtual SAN LUN — it persists independently of the instance. Stop the instance, start it again, your data is still there. You can snapshot it, resize it, detach it, and reattach it to a different instance.
Instance store is a physical NVMe SSD bolted to the host machine your VM runs on. Blazing fast. Free (included in the instance price). And completely ephemeral — data is lost when the instance stops, terminates, or when the underlying hardware fails.
Instance Store: EBS:
├── Blazing fast (local NVMe) ├── Persistent (survives stop/start)
├── Free (included in price) ├── Costs per GB/month + IOPS
├── DATA LOST on stop/terminate ├── Snapshots for backup
├── Cannot be detached ├── Can resize, change type online
└── Fixed size per instance type └── Up to 64 TiB per volume
War Story: A team ran a self-managed Elasticsearch cluster on i3 instances for the local NVMe performance. When AWS performed scheduled maintenance and stopped the instances, all data on the instance store volumes vanished. They had no replicas configured because "we had three nodes." All three were on the same maintenance schedule. The cluster came back empty. Recovery took days of re-indexing from source systems. The lesson: instance store data is not "probably safe" — it is guaranteed to disappear on any stop event.
Here's the survival rule:
Instance store data survives: Instance store data is LOST on:
├── Reboots (only) ├── Stop
├── Terminate
├── Hardware failure
└── Hibernate
Gotcha: Instance types with a
dsuffix (likec5d.xlarge,i3.2xlarge) include instance store volumes. If someone launches ac5dfor the CPU and doesn't realize thedmeans "ephemeral NVMe attached," they might store data on it by accident. Checklsblkafter launch — instance store volumes show up as extra NVMe devices.
EBS volume types — the decision tree¶
Not all EBS volumes are created equal. Here's how to choose:
Need persistent block storage?
│
├── Yes → Is it a database needing consistent, high IOPS?
│ ├── Yes → io2 (up to 64,000 IOPS, $$$)
│ └── No → gp3 (3,000 baseline IOPS, tune up to 16,000)
│
├── Sequential reads? (logs, big data, streaming)
│ └── st1 (throughput-optimized HDD, up to 500 MiB/s)
│
└── Cold archive? (backups, rarely accessed)
└── sc1 (cold HDD, cheapest, up to 250 MiB/s)
| Type | IOPS | Throughput | Cost model | Best for |
|---|---|---|---|---|
| gp3 | 3,000 base → 16,000 | 125 → 1,000 MiB/s | GB + provisioned IOPS/throughput | 90% of workloads |
| io2 | up to 64,000 | up to 1,000 MiB/s | GB + provisioned IOPS | Databases needing guarantees |
| st1 | N/A (throughput-based) | 40 MiB/s per TiB → 500 | GB only | Sequential big data |
| sc1 | N/A (throughput-based) | 12 MiB/s per TiB → 250 | GB only (cheapest) | Cold storage |
Under the Hood: gp3 decoupled IOPS and throughput from volume size — a major improvement over gp2, where you got 3 IOPS per GB and had to over-provision volume size just to get more IOPS. With gp3, a 100 GB volume can deliver 16,000 IOPS if you provision them. Check your existing gp2 volumes — migrating to gp3 almost always saves money.
# Create a gp3 volume with custom IOPS and throughput
aws ec2 create-volume \
--volume-type gp3 \
--size 100 \
--iops 6000 \
--throughput 400 \
--availability-zone us-east-1a
# Modify an existing volume (online — no downtime)
aws ec2 modify-volume --volume-id vol-0a1b2c3d4e5f --size 200 --iops 8000
Gotcha: Your EBS volume might be capable of 16,000 IOPS, but your instance type caps EBS throughput. A
t3.largemaxes out at 15,000 IOPS regardless of what the volume can do. You're paying for performance you can never use. Always checkdescribe-instance-typesfor EBS limits before provisioning expensive io2 volumes.
Part 3: The Stateful Firewall — Security Groups¶
Security groups are EC2's firewall, and they have one property that changes everything: they are stateful.
Stateful means: if you allow traffic in, the response is automatically allowed out. If you allow traffic out, the response is automatically allowed in. You don't need to write rules for both directions.
Compare this to NACLs (Network Access Control Lists), which are stateless — you must explicitly allow both the request and the response, including ephemeral port ranges.
Security Group (stateful): NACL (stateless):
├── Allow rules only (no Deny) ├── Allow AND Deny rules
├── Return traffic auto-allowed ├── Must allow return traffic explicitly
├── Operates at ENI level ├── Operates at subnet level
└── All rules evaluated together └── Rules evaluated by number (lowest first)
# Create a security group for a web server
aws ec2 create-security-group \
--group-name web-server-sg \
--description "Allow HTTP/HTTPS and SSH" \
--vpc-id vpc-0a1b2c3d
# Allow inbound HTTP from anywhere
aws ec2 authorize-security-group-ingress \
--group-id sg-0a1b2c3d \
--protocol tcp --port 80 --cidr 0.0.0.0/0
# Allow inbound SSH from your office only
aws ec2 authorize-security-group-ingress \
--group-id sg-0a1b2c3d \
--protocol tcp --port 22 --cidr 203.0.113.0/24
Gotcha: Security group changes take effect immediately on all instances using that group. There is no "pending" state, no deployment window, no rollback button. Remove the wrong rule in production and every active database connection using that SG drops instantly. Always add the new rule first, verify, then remove the old one.
Mental Model: Think of a security group as a bouncer at a club with a guest list (allow rules only). Anyone on the list gets in, and once inside, they can leave freely. A NACL is more like a border checkpoint — they check you on the way in AND on the way out, and they have a "banned" list (deny rules).
Part 4: IMDSv2 — Why the Metadata Service Matters¶
Every EC2 instance can query a special HTTP endpoint at 169.254.169.254 to learn about
itself: its instance ID, IP address, IAM role credentials, user data script, and more.
This is the Instance Metadata Service (IMDS).
Under the Hood:
169.254.169.254is a link-local address. It's not a real server on the network — the Nitro hypervisor intercepts packets to this IP and responds directly. That's why it works even without a default gateway configured.
Why v1 is dangerous¶
IMDSv1 is a simple unauthenticated GET request:
# IMDSv1 — any process on the instance can do this
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role
# Returns: temporary AWS credentials (access key, secret key, session token)
If your application has a Server-Side Request Forgery (SSRF) vulnerability — where an attacker can make the application send HTTP requests to arbitrary URLs — they can steal IAM credentials from the metadata service and use them from anywhere.
War Story: This is exactly what happened in the 2019 Capital One breach. An SSRF vulnerability in a misconfigured WAF allowed an attacker to query the metadata service, retrieve IAM role credentials, and access over 100 million customer records in S3. This single breach led AWS to develop IMDSv2 and eventually make it the default for new instance types. The attack vector was well-known in the security community for years before it was exploited at scale.
IMDSv2: the fix¶
IMDSv2 requires a two-step process: first a PUT request to get a session token, then use that token in subsequent requests. SSRF attacks typically can't make PUT requests, which blocks the attack vector.
# Step 1: Get a session token (PUT request — SSRF can't do this)
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
# Step 2: Use the token for metadata queries
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-type
# Check IAM role credentials (for debugging "who am I?")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/iam/security-credentials/
The hop limit¶
IMDSv2 also introduces a TTL-based hop limit. By default, the token response has a TTL of 1 (the packet can't traverse a network hop). This means containers running on the instance can't reach the metadata service through a NAT — which is exactly the point. You control the hop limit:
# Enforce IMDSv2 and set hop limit (1 = instance only, 2 = containers can reach it)
aws ec2 modify-instance-metadata-options \
--instance-id i-0a1b2c3d4e5f \
--http-tokens required \
--http-endpoint enabled \
--http-put-response-hop-limit 2
# Find instances still allowing IMDSv1
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].[InstanceId,Tags[?Key==`Name`].Value|[0]]' \
--output table
Remember: Set
--http-tokens requiredin all your launch templates. New instances should never allow IMDSv1. For existing instances, audit with the query above and remediate. Use the AWS Config ruleec2-imdsv2-checkfor continuous monitoring.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| What happens to instance store data when you stop an instance? | It's lost. Permanently. No recovery. |
| Which EBS volume type should you use for 90% of workloads? | gp3 — tune IOPS and throughput independently of size |
| Why are security groups "stateful"? | Return traffic is automatically allowed — no need to write rules for responses |
| Why is IMDSv1 dangerous? | Simple GET request — SSRF vulnerabilities can steal IAM credentials |
What does --http-put-response-hop-limit 2 do? |
Allows containers on the instance to reach the metadata service (token survives one extra hop) |
Part 5: User Data and Cloud-Init¶
When an instance boots, you can hand it a script that runs automatically. This is user data — and cloud-init is the engine that executes it.
#!/bin/bash
# user-data.sh — runs once at first boot (by default)
yum update -y
yum install -y nginx
systemctl enable nginx
systemctl start nginx
# Pull config from metadata service (IMDSv2)
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 300")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id)
AZ=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/placement/availability-zone)
echo "INSTANCE_ID=$INSTANCE_ID" >> /etc/app.conf
echo "AZ=$AZ" >> /etc/app.conf
# Launch an instance with user data
aws ec2 run-instances \
--image-id ami-0a1b2c3d4e5f67890 \
--instance-type m7g.large \
--security-group-ids sg-web \
--subnet-id subnet-pub1a \
--user-data file://user-data.sh \
--iam-instance-profile Name=ec2-app-profile \
--metadata-options "HttpTokens=required,HttpEndpoint=enabled"
Gotcha: User data runs as root but its output goes to
/var/log/cloud-init-output.log, not to your terminal. When an instance launches but your app isn't working, this log is the first place to check. The second most common issue: user data is base64-encoded on the API side. If you paste a script into the console and it doesn't execute, the encoding might be wrong.
# Debug: retrieve and decode user data from a running instance
aws ec2 describe-instance-attribute \
--instance-id i-0a1b2c3d4e5f \
--attribute userData \
--query 'UserData.Value' --output text | base64 -d
Part 6: Placement Groups — Controlling Physical Topology¶
Sometimes you need to control where your instances physically sit in the data center. Placement groups give you three strategies:
| Type | What it does | Limit | Use when |
|---|---|---|---|
| Cluster | All instances on same rack | Low latency, high throughput | HPC, tightly-coupled workloads |
| Spread | Each instance on different hardware | Max 7 per AZ | Critical instances that must survive hardware failure |
| Partition | Groups on separate racks | Up to 7 partitions per AZ | Large distributed systems (HDFS, Cassandra, Kafka) |
Mental Model: Cluster = huddle together for speed. Spread = scatter for survival. Partition = organized groups on separate foundations.
The tradeoff is clear: cluster gives you sub-millisecond latency between nodes, but if the rack loses power, everything goes down. Spread guarantees hardware isolation, but limits you to 7 instances per AZ.
Part 7: Spot Instances — 90% Cheaper, With a Catch¶
Spot instances are spare EC2 capacity sold at a steep discount — up to 90% off On-Demand pricing. The catch: AWS can reclaim them with two minutes' notice.
Trivia: Spot instances launched in December 2009, modeled after wholesale electricity markets where utilities bid on excess generating capacity. The original model was a literal auction — you set a bid price, and if spot price exceeded your bid, your instance was terminated. AWS changed to a flat discount model in November 2017 — prices still fluctuate with supply and demand, but you no longer bid. The
--spot-priceparameter still exists as a price ceiling, not a bid.
When to use spot (and when absolutely not)¶
Good for spot: batch processing, CI/CD runners, data pipelines, stateless web servers behind a load balancer, dev/test environments, any workload that can handle interruption.
Never use spot for: single-instance databases, stateful services without replication, anything where 2 minutes isn't enough to drain gracefully.
Handling interruption¶
When AWS reclaims a spot instance, it posts a notice to the instance metadata 2 minutes before termination:
# Check for interruption notice (poll this in a background loop)
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
# Returns 404 normally, 200 with action details when interrupted
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
-H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/spot/instance-action)
if [ "$RESPONSE" = "200" ]; then
echo "SPOT INTERRUPTION — initiating graceful shutdown"
# 1. Deregister from load balancer
# 2. Stop accepting new requests
# 3. Finish in-flight work
# 4. Checkpoint state to S3 or EFS
fi
Mixed ASG — the production pattern¶
The real-world approach to spot: use an Auto Scaling Group with a mixed instances policy that combines On-Demand and spot across multiple instance types and AZs.
# Create ASG with mixed instances (70% spot, 30% on-demand)
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name api-fleet \
--mixed-instances-policy '{
"LaunchTemplate": {
"LaunchTemplateSpecification": {
"LaunchTemplateName": "api-template",
"Version": "$Latest"
},
"Overrides": [
{"InstanceType": "m7g.large"},
{"InstanceType": "m6g.large"},
{"InstanceType": "m7i.large"},
{"InstanceType": "c7g.large"}
]
},
"InstancesDistribution": {
"OnDemandBaseCapacity": 2,
"OnDemandPercentageAboveBaseCapacity": 30,
"SpotAllocationStrategy": "capacity-optimized"
}
}' \
--min-size 4 --max-size 20 --desired-capacity 6 \
--vpc-zone-identifier "subnet-priv1a,subnet-priv1b,subnet-priv1c"
Why multiple instance types? If you only request m7g.large spot instances, and that
specific pool runs out in one AZ, your capacity drops. Diversifying across types and AZs
dramatically reduces interruption probability.
Remember:
capacity-optimizedis the recommended spot allocation strategy. It picks the pools with the most available capacity, reducing the chance of interruption. The olderlowest-pricestrategy saved pennies but got interrupted far more often.
Part 8: Auto Scaling — The Self-Healing Fleet¶
An Auto Scaling Group (ASG) maintains a fleet of instances. It launches new ones when demand rises, terminates excess when demand falls, and replaces unhealthy instances automatically.
Scaling policy types¶
| Policy | How it works | Best for |
|---|---|---|
| Target tracking | "Keep CPU at 60%" — ASG adjusts automatically | Most workloads (start here) |
| Step scaling | "At 70% CPU add 2, at 90% add 5" — tiered response | Fine-grained control |
| Scheduled | "Scale to 10 at 8am, back to 3 at 10pm" | Predictable traffic patterns |
| Predictive | ML-based forecasting from historical patterns | Recurring spikes (daily, weekly) |
# Target tracking: keep average CPU at 60%
aws autoscaling put-scaling-policy \
--auto-scaling-group-name api-fleet \
--policy-name cpu-target-60 \
--policy-type TargetTrackingScaling \
--target-tracking-configuration '{
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ASGAverageCPUUtilization"
},
"TargetValue": 60.0,
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
Under the Hood: Notice the asymmetric cooldowns — scale out waits only 60 seconds (respond to demand quickly) but scale in waits 300 seconds (avoid thrashing). This asymmetry is intentional: it's better to have one extra instance for 5 minutes than to keep adding and removing instances every minute.
Health checks — the self-healing mechanism¶
ASGs can use two types of health checks:
- EC2 health checks: Is the VM running and passing system/instance status checks?
- ELB health checks: Is the application responding to the load balancer's health endpoint?
ELB health checks are what you want in production. An instance can be "running" (passing EC2 checks) while the application on it is crashed, deadlocked, or out of memory. ELB health checks catch this.
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name api-fleet \
--health-check-type ELB \
--health-check-grace-period 300
Gotcha: The
health-check-grace-period(default: 300 seconds) is how long the ASG waits before checking a new instance's health. If your app takes 4 minutes to boot and this is set to 60 seconds, the ASG will terminate the instance for being "unhealthy" before it finishes starting — then launch a new one, which also gets terminated. You're stuck in a launch-terminate loop. Set the grace period longer than your worst-case boot time.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| How much notice do you get before a spot instance is reclaimed? | 2 minutes, via the instance metadata endpoint |
| What spot allocation strategy should you use? | capacity-optimized — picks pools with most available capacity |
| Why use asymmetric cooldowns in ASG scaling? | Scale out fast (respond to demand), scale in slow (avoid thrashing) |
| What's the risk of setting health-check-grace-period too low? | New instances get terminated before they finish booting — launch-terminate loop |
Part 9: EC2 Networking — ENIs, Enhanced Networking, and EFA¶
Every EC2 instance has at least one Elastic Network Interface (ENI). Think of it as a virtual network card. It carries: - One primary private IP (mandatory) - Optionally, secondary private IPs - Optionally, a public IP or Elastic IP - One or more security groups - A MAC address
You can attach multiple ENIs to an instance (up to the limit for its type) for multi-homed configurations, management networks, or failover patterns.
Enhanced networking¶
Modern instance types support Enhanced Networking, which bypasses the host CPU for network processing. You get higher bandwidth, lower latency, and lower jitter. On Nitro instances, this is enabled by default via the Elastic Network Adapter (ENA) driver.
# Verify enhanced networking is enabled
aws ec2 describe-instances --instance-ids i-0a1b2c3d4e5f \
--query 'Reservations[0].Instances[0].EnaSupport'
# Should return: true
Elastic Fabric Adapter (EFA)¶
For HPC workloads that need MPI (Message Passing Interface) or NCCL (GPU-to-GPU communication), EFA provides OS-bypass networking — the application talks directly to the network hardware, skipping the kernel network stack entirely. This cuts inter-node latency to single-digit microseconds for tightly-coupled parallel computing.
Part 10: Monitoring — Seeing What's Happening¶
CloudWatch basics¶
CloudWatch collects EC2 metrics automatically, but the default resolution is 5 minutes. For production troubleshooting, you want detailed monitoring (1-minute resolution).
# Enable detailed monitoring (1-minute metrics)
aws ec2 monitor-instances --instance-ids i-0a1b2c3d4e5f
# Key metrics to watch
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 --metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0a1b2c3d4e5f \
--start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
--period 60 --statistics Average,Maximum
Status checks — the two-tier system¶
EC2 runs two types of automated status checks:
| Check | What it tests | Failure means |
|---|---|---|
| System status | Underlying host hardware and hypervisor | AWS problem — stop/start to migrate to new host |
| Instance status | Guest OS networking, kernel, filesystem | Your problem — check console output, fix config |
# Check both status checks at once
aws ec2 describe-instance-status --instance-ids i-0a1b2c3d4e5f \
--query 'InstanceStatuses[0].{
System: SystemStatus.Status,
Instance: InstanceStatus.Status,
State: InstanceState.Name
}'
If SystemStatus is impaired, stop and start the instance to move it to healthy hardware.
If InstanceStatus is impaired, check the console output for OS-level errors:
aws ec2 get-console-output --instance-id i-0a1b2c3d4e5f --latest \
--output text | tail -50
# Look for: kernel panic, disk full, sshd failure, fsck errors
Gotcha:
describe-instance-statusonly returns results for running instances by default. If your instance is stopped, add--include-all-instancesor you'll get an empty response and think the instance doesn't exist.
Part 11: The Troubleshooting Ladder — Your Instance Is Unreachable¶
Back to our mission. The instance is unreachable. Here's the systematic approach — work through it in order, don't skip steps. Each step rules out one layer.
Step 1: Is the instance actually running?¶
aws ec2 describe-instance-status --instance-ids i-0a1b2c3d4e5f \
--query 'InstanceStatuses[0].{
State: InstanceState.Name,
SystemCheck: SystemStatus.Status,
InstanceCheck: InstanceStatus.Status
}'
"OK, it says running and both status checks pass. Good — the VM is alive. Let me check if it has an IP I can reach."
Step 2: Does it have a reachable IP?¶
aws ec2 describe-instances --instance-ids i-0a1b2c3d4e5f \
--query 'Reservations[0].Instances[0].{
PublicIP: PublicIpAddress,
PrivateIP: PrivateIpAddress,
SubnetId: SubnetId,
VpcId: VpcId
}'
"Public IP is present. The subnet is in our production VPC. Let me check if the security group allows my traffic."
Step 3: Does the security group allow the traffic?¶
SG=$(aws ec2 describe-instances --instance-ids i-0a1b2c3d4e5f \
--query 'Reservations[0].Instances[0].SecurityGroups[0].GroupId' --output text)
aws ec2 describe-security-groups --group-ids "$SG" \
--query 'SecurityGroups[0].IpPermissions[?FromPort==`443`]'
"Port 443 is allowed from 0.0.0.0/0. Security group looks fine. What about the NACL?"
Step 4: Is the NACL blocking traffic?¶
SUBNET=$(aws ec2 describe-instances --instance-ids i-0a1b2c3d4e5f \
--query 'Reservations[0].Instances[0].SubnetId' --output text)
aws ec2 describe-network-acls \
--filters "Name=association.subnet-id,Values=$SUBNET" \
--query 'NetworkAcls[0].Entries[?PortRange.From<=`443` && PortRange.To>=`443`]'
"NACL allows the traffic. Networking looks clean from the AWS side. Let me check the route table."
Step 5: Does the route table have a path?¶
# For public subnet: needs 0.0.0.0/0 -> igw-xxx
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=$SUBNET" \
--query 'RouteTables[0].Routes'
Step 6: Check the console output for OS-level problems¶
"And there it is — No space left on device. The root disk filled up. Sshd can't
write to its temp files, so it's refusing connections. The application is probably
running but can't write to logs."
Step 7: Get in through the back door¶
If SSH is dead, use SSM Session Manager — it doesn't need SSH, open ports, or key pairs:
aws ssm start-session --target i-0a1b2c3d4e5f
# Requires: SSM agent running + instance profile with AmazonSSMManagedInstanceCore
Once in, clean up the disk:
# Find what's eating space
df -h # See which filesystem is full
du -sh /var/log/* | sort -rh | head -10 # Find the largest log files
journalctl --disk-usage # Check systemd journal size
The troubleshooting ladder (summary)¶
1. Instance state + status checks → Is the VM alive?
2. IP address + subnet → Can I route to it?
3. Security group rules → Is the firewall allowing traffic?
4. NACL rules → Is the subnet-level ACL allowing traffic?
5. Route table → Does a path exist?
6. Console output → Is the OS healthy?
7. SSM Session Manager → Get in without SSH
Mental Model: Troubleshoot outside-in: start from the AWS control plane (can I even see the instance?), then networking layer by layer (SG, NACL, routes), then the OS itself (console output, SSM). Each step eliminates one possible failure domain.
Part 12: The Instance Lifecycle — What Happens on Stop/Start¶
This is the part that catches people. Stop/start is not the same as reboot.
Stop/Start: Reboot:
├── Instance moves to new host ├── Same host
├── Public IP changes ├── Same IP
├── Instance store wiped ├── Instance store preserved
├── New hardware underneath ├── Same hardware
└── EBS data preserved └── EBS data preserved
# Reboot (safe — same host, same everything)
aws ec2 reboot-instances --instance-ids i-0a1b2c3d4e5f
# Stop (instance store data LOST, public IP changes)
aws ec2 stop-instances --instance-ids i-0a1b2c3d4e5f
aws ec2 wait instance-stopped --instance-ids i-0a1b2c3d4e5f
# Start (new host, potentially new public IP)
aws ec2 start-instances --instance-ids i-0a1b2c3d4e5f
Gotcha: The root EBS volume has
DeleteOnTerminationset totrueby default. If someone runsterminate-instancesinstead ofstop-instances, the root volume is deleted too. Protect critical instances:# Enable termination protection aws ec2 modify-instance-attribute \ --instance-id i-0a1b2c3d4e5f \ --disable-api-termination # Set root volume to persist on termination aws ec2 modify-instance-attribute \ --instance-id i-0a1b2c3d4e5f \ --block-device-mappings \ '[{"DeviceName":"/dev/xvda","Ebs":{"DeleteOnTermination":false}}]'
Exercises¶
Exercise 1: Decode Instance Types (2 minutes)¶
Without looking at the reference tables, decode these instance types. Write down what each character means:
r6g.2xlarget3.microi4i.metalp5.48xlarge
Answers
1. `r6g.2xlarge` — Memory-optimized, 6th gen, Graviton, 2x large (8 vCPU, 64 GiB) 2. `t3.micro` — Burstable, 3rd gen, Intel, micro (2 vCPU, 1 GiB) 3. `i4i.metal` — Storage-optimized, 4th gen, Intel, bare metal (all cores/RAM on host) 4. `p5.48xlarge` — GPU/accelerated, 5th gen, 48x large (massive GPU training instance)Exercise 2: Build the Troubleshooting Ladder from Memory (5 minutes)¶
An EC2 instance is unreachable. Without looking at Part 11, write down the 7-step troubleshooting ladder in order. For each step, write the AWS CLI command you'd use.
Check your answer
1. Check instance state + status checks (`describe-instance-status`) 2. Verify IP address + subnet (`describe-instances`) 3. Check security group rules (`describe-security-groups`) 4. Check NACL rules (`describe-network-acls`) 5. Check route table (`describe-route-tables`) 6. Read console output (`get-console-output`) 7. Use SSM Session Manager as backup access (`ssm start-session`)Exercise 3: Design a Spot-Safe Architecture (10 minutes)¶
You're running a stateless API behind an ALB. Traffic is predictable during business hours (8am–6pm) but has occasional spikes. Current fleet is 6 x m7g.large on-demand.
Design a mixed ASG that: - Keeps at least 2 on-demand instances for baseline - Uses spot for the rest - Handles spot interruptions gracefully - Diversifies across instance types
What instance types would you include in Overrides? What would your
OnDemandBaseCapacity and OnDemandPercentageAboveBaseCapacity be? What
SpotAllocationStrategy would you choose?
Suggested approach
- `OnDemandBaseCapacity`: 2 (guaranteed baseline) - `OnDemandPercentageAboveBaseCapacity`: 20 (80% of additional instances are spot) - `SpotAllocationStrategy`: `capacity-optimized` - Overrides: `m7g.large`, `m6g.large`, `c7g.large`, `m7i.large` (mix Graviton and Intel, mix general and compute — all similar specs, different spot pools) - Health check type: ELB (catch application-level failures) - Grace period: match your app's boot time + buffer For the predictable pattern, add a scheduled scaling action to pre-scale before 8am.Cheat Sheet¶
| Task | Command |
|---|---|
| Decode instance type specs | aws ec2 describe-instance-types --instance-types TYPE |
| Check instance status | aws ec2 describe-instance-status --instance-ids ID |
| View console output | aws ec2 get-console-output --instance-id ID --latest |
| Inspect security group | aws ec2 describe-security-groups --group-ids SG-ID |
| Check EBS limits for instance type | aws ec2 describe-instance-types --instance-types TYPE --query '..EbsInfo' |
| Enforce IMDSv2 | aws ec2 modify-instance-metadata-options --instance-id ID --http-tokens required |
| Find IMDSv1 instances | aws ec2 describe-instances --query '..Instances[?MetadataOptions.HttpTokens!=\required`]'` |
| Enable termination protection | aws ec2 modify-instance-attribute --instance-id ID --disable-api-termination |
| Emergency access (no SSH) | aws ssm start-session --target ID |
| Check spot interruption | curl -s -H "X-aws-ec2-metadata-token: $TOKEN" .../spot/instance-action |
| Get IMDSv2 token | curl -s -X PUT ".../api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" |
EBS quick reference:
| Need | Volume type | Key spec |
|---|---|---|
| General purpose | gp3 | 3,000–16,000 IOPS, 125–1,000 MiB/s |
| High IOPS database | io2 | Up to 64,000 IOPS |
| Sequential throughput | st1 | Up to 500 MiB/s |
| Cold archive | sc1 | Cheapest per GB |
Instance family decoder:
| Letter | Optimized for | Mnemonic |
|---|---|---|
| m | General purpose | Most workloads |
| c | CPU-intensive | Compute |
| r | Memory-intensive | RAM |
| t | Variable/burstable | Tiny-burst |
| i | Storage I/O | I/O |
| p/g | GPU acceleration | Parallel / GPU |
Takeaways¶
-
Every character in an instance type name is a decision — family, generation, processor, capabilities, size. Learn to read them and you'll never pick blindly.
-
Instance store is ephemeral. Period. If you can't afford to lose it, it doesn't belong on instance store. Stop/start wipes it. Hardware failure wipes it. Reboot is the only safe operation.
-
IMDSv2 is not optional — it prevents the exact attack vector that caused the Capital One breach. Enforce
--http-tokens requiredon every instance and in every launch template. -
Troubleshoot outside-in — start from the AWS control plane (state, status checks), then networking layer by layer (SG, NACL, routes), then the OS (console output, SSM).
-
Spot instances work when you diversify — across instance types, across AZs, with capacity-optimized allocation. Don't put your spot eggs in one pool.
-
ASG health-check-grace-period must exceed boot time — otherwise you get a launch-terminate loop that looks like scaling is broken.
Related Lessons¶
- AWS VPC — The Network You Can't See — VPC subnets, routing, NACLs, and NAT gateways in depth
- The Disk That Filled Up — When storage problems cascade through logging, containers, and PVCs
- Connection Refused — Differential diagnosis across firewall, DNS, process, container, and Kubernetes layers
- AWS IAM — The Permissions Puzzle — Debugging access denied errors through IAM's evaluation logic
- The Cloud Bill Surprise — Finding and eliminating wasted cloud spend