AWS VPC: The Network You Can't See
- lesson
- aws-vpc
- subnets
- route-tables
- security-groups
- nacls
- nat-gateway
- internet-gateway
- vpc-peering
- transit-gateway
- vpc-endpoints
- flow-logs
- enis
- elastic-ips ---# AWS VPC: The Network You Can't See
Topics: AWS VPC, subnets, route tables, security groups, NACLs, NAT gateway, internet gateway, VPC peering, Transit Gateway, VPC endpoints, flow logs, ENIs, elastic IPs Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (CIDR math is explained; Linux networking parallels are drawn)
The Mission¶
Your company just landed a contract. The app is a classic 3-tier web application — load balancer, application servers, and a database. It needs to run on AWS. Your job: build the production VPC from scratch. Public-facing load balancer, private application servers, database in an isolated subnet, outbound internet access for package updates, and flow logs for the security team.
You could click around the console. But you need to understand what you're building. Because three months from now, at 2am, when the app can't reach the database and the on-call page says "connection timed out," you'll need to know which of the seven layers between the internet and your EC2 instance is broken.
This lesson builds that VPC piece by piece, then traces a packet through every hop.
Part 1: The Empty VPC — Your Private Data Center¶
A VPC is a virtual network. It exists inside one AWS region. It is completely isolated from every other VPC by default — no traffic in, no traffic out, no routes between them. You own the address space, the routing, and the firewall rules.
aws ec2 create-vpc --cidr-block 10.0.0.0/16 \
--tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=prod-vpc}]'
That /16 gives you 65,536 IP addresses. The VPC exists now, but it's an empty room — no
doors, no windows, no wiring.
Name Origin: VPC stands for Virtual Private Cloud. The "private" is the key word. Before VPC launched in 2009, all EC2 instances ran in a shared flat network called EC2-Classic — no isolation between customers beyond security groups. VPC gave every account its own isolated network. EC2-Classic wasn't fully retired until August 2022.
CIDR math — the 30-second version¶
If you've read The Subnet Calculator in Your Head, this is review. If not, here's the minimum you need:
/16 = 65,536 addresses (10.0.0.0 – 10.0.255.255)
/24 = 256 addresses (10.0.1.0 – 10.0.1.255)
/28 = 16 addresses (smallest AWS allows)
/32 = 1 address (a single host)
The CIDR number is how many bits are "network." The rest are "host." Bigger CIDR number =
smaller network. /16 is the largest VPC AWS allows; /28 is the smallest.
AWS reserves 5 IPs per subnet: the network address (.0), VPC router (.1), DNS server (.2),
future use (.3), and broadcast (.255). So a /24 gives you 251 usable addresses, not 256.
Remember: Mnemonic for AWS's 5 reserved IPs: "Never Very Dull, Future Broadcast" — Network (.0), VPC router (.1), DNS (.2), Future (.3), Broadcast (.255). A /28 subnet (16 IPs) gives you only 11 usable addresses.
Two things to enable immediately¶
aws ec2 modify-vpc-attribute --vpc-id vpc-abc123 \
--enable-dns-support '{"Value":true}'
aws ec2 modify-vpc-attribute --vpc-id vpc-abc123 \
--enable-dns-hostnames '{"Value":true}'
Without these, VPC endpoints won't resolve, private hosted zones won't work, and you'll spend an hour debugging DNS failures that have nothing to do with DNS.
Part 2: Subnets — Carving Up the Space¶
A subnet lives in exactly one Availability Zone. A VPC spans the region, but subnets don't. This is how you get multi-AZ redundancy — same VPC, different subnets in different AZs.
For our 3-tier app, we need six subnets:
prod-vpc (10.0.0.0/16)
├── us-east-1a
│ ├── public-1a (10.0.1.0/24) ← load balancer
│ ├── private-1a (10.0.10.0/24) ← app servers
│ └── data-1a (10.0.20.0/24) ← database
├── us-east-1b
│ ├── public-1b (10.0.2.0/24) ← load balancer
│ ├── private-1b (10.0.11.0/24) ← app servers
│ └── data-1b (10.0.21.0/24) ← database
# Public subnets
aws ec2 create-subnet --vpc-id vpc-abc123 --cidr-block 10.0.1.0/24 \
--availability-zone us-east-1a \
--tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=public-1a}]'
aws ec2 create-subnet --vpc-id vpc-abc123 --cidr-block 10.0.2.0/24 \
--availability-zone us-east-1b \
--tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=public-1b}]'
# Private subnets (app tier)
aws ec2 create-subnet --vpc-id vpc-abc123 --cidr-block 10.0.10.0/24 \
--availability-zone us-east-1a \
--tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=private-1a}]'
# ... same pattern for private-1b, data-1a, data-1b
Public vs. private — it's just routing¶
There is no "public subnet" checkbox. The difference is entirely in the route table. A public subnet has a route to an internet gateway. A private subnet doesn't. That's it.
Mental Model: Think of subnets like rooms in a building. A "public" room has a door to the street (internet gateway). A "private" room has no door to the street — but it might have a mail slot (NAT gateway) to send letters out without letting anyone in.
Linux parallel: This is the same concept as
ip routeon a Linux box. Your default route determines where packets go when there's no more specific match. A public subnet's "default route" points to the internet gateway; a private subnet's points to a NAT gateway (or nowhere).
Part 3: The Internet Gateway — The Front Door¶
An internet gateway (IGW) provides bidirectional internet access. One per VPC. It's free — no hourly charge, no data processing fee.
# Create and attach
aws ec2 create-internet-gateway \
--tag-specifications 'ResourceType=internet-gateway,Tags=[{Key=Name,Value=prod-igw}]'
aws ec2 attach-internet-gateway \
--internet-gateway-id igw-abc123 --vpc-id vpc-abc123
The IGW exists, but no traffic flows through it yet. You need route tables.
Route tables — the wiring¶
Every subnet gets associated with a route table. If you don't explicitly associate one, it uses the main route table (which by default only has the local VPC route).
# Create public route table
RTB_PUB=$(aws ec2 create-route-table --vpc-id vpc-abc123 \
--query 'RouteTable.RouteTableId' --output text)
# Add default route to IGW
aws ec2 create-route --route-table-id $RTB_PUB \
--destination-cidr-block 0.0.0.0/0 \
--gateway-id igw-abc123
# Associate public subnets
aws ec2 associate-route-table --route-table-id $RTB_PUB --subnet-id subnet-pub1a
aws ec2 associate-route-table --route-table-id $RTB_PUB --subnet-id subnet-pub1b
Now public-1a and public-1b can reach the internet. The route table says: "for
10.0.0.0/16, stay local; for everything else, go to the IGW."
Gotcha: A new subnet with no explicit route table association uses the main route table. The main route table has no IGW route by default. If you create a subnet and wonder why nothing can reach the internet, check the route table association first — it's the most common VPC misconfiguration.
# Check which route table a subnet uses
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-abc123" \
--query 'RouteTables[].RouteTableId'
# Empty result = using main route table (implicit association)
Part 4: NAT Gateway — The One-Way Door¶
Your private subnets need outbound internet access (package updates, API calls, pulling container images) but should never be directly reachable from the internet. That's what a NAT gateway does — outbound only.
A NAT gateway must live in a public subnet (because it needs to reach the IGW) and needs an Elastic IP.
# Allocate Elastic IPs (one per AZ for HA)
EIP_1A=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
EIP_1B=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
# Create NAT gateways in public subnets
NAT_1A=$(aws ec2 create-nat-gateway --subnet-id subnet-pub1a \
--allocation-id $EIP_1A --query 'NatGateway.NatGatewayId' --output text)
NAT_1B=$(aws ec2 create-nat-gateway --subnet-id subnet-pub1b \
--allocation-id $EIP_1B --query 'NatGateway.NatGatewayId' --output text)
# Wait for them (takes 1-2 minutes)
aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_1A $NAT_1B
Now wire the private subnets:
# Private route table for AZ-a
RTB_PRIV_1A=$(aws ec2 create-route-table --vpc-id vpc-abc123 \
--query 'RouteTable.RouteTableId' --output text)
aws ec2 create-route --route-table-id $RTB_PRIV_1A \
--destination-cidr-block 0.0.0.0/0 --nat-gateway-id $NAT_1A
aws ec2 associate-route-table --route-table-id $RTB_PRIV_1A --subnet-id subnet-priv1a
aws ec2 associate-route-table --route-table-id $RTB_PRIV_1A --subnet-id subnet-data1a
# Same for AZ-b with NAT_1B
RTB_PRIV_1B=$(aws ec2 create-route-table --vpc-id vpc-abc123 \
--query 'RouteTable.RouteTableId' --output text)
aws ec2 create-route --route-table-id $RTB_PRIV_1B \
--destination-cidr-block 0.0.0.0/0 --nat-gateway-id $NAT_1B
aws ec2 associate-route-table --route-table-id $RTB_PRIV_1B --subnet-id subnet-priv1b
aws ec2 associate-route-table --route-table-id $RTB_PRIV_1B --subnet-id subnet-data1b
Why one NAT gateway per AZ?¶
A NAT gateway is a single-AZ resource. If you put one NAT gateway in us-east-1a and route
all private subnets through it, an AZ-a outage kills outbound internet for every private
subnet — even the ones in AZ-b that are running fine.
War Story: A team deployed a single NAT gateway to save $32/month. During an AZ degradation event, all private subnets across three AZs lost outbound connectivity. Container images couldn't be pulled, health checks failed, ECS tasks stopped launching, and the application went down hard. The incident lasted 47 minutes. The postmortem math: $32/month saved, $15,000 estimated revenue lost. The second NAT gateway was deployed the next morning. (This pattern is documented repeatedly in AWS architecture reviews and re:Invent talks on multi-AZ resilience.)
Linux parallel: A NAT gateway is conceptually identical to a Linux box doing
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE. It rewrites the source IP of outbound packets to its own public IP, and tracks connections to route responses back. AWS just made it a managed service.
The cost reality¶
NAT gateway charges: ~$0.045/hour ($32/month just for existing) + $0.045/GB processed. A busy cluster pulling container images and shipping logs can easily push $500+/month through NAT.
Mitigation: use VPC endpoints for S3 and DynamoDB (free, covered later in this lesson).
Flashcard Check #1¶
Cover the answers. Test yourself.
| Question | Answer |
|---|---|
| What makes a subnet "public"? | Its route table has a route to an internet gateway |
| How many IPs does AWS reserve per subnet? | 5 (network, router, DNS, future, broadcast) |
| Can you attach multiple IGWs to one VPC? | No. One IGW per VPC. |
| Why deploy one NAT gateway per AZ? | Single-AZ NAT gateway creates a cross-AZ dependency; AZ failure kills all private subnets |
| NAT gateway monthly base cost? | ~$32/month ($0.045/hour) before data transfer |
Part 5: Security Groups and NACLs — The Two Firewalls¶
Your VPC has two independent firewall layers. They look similar but work differently.
Security groups — the bouncer at each door¶
A security group is attached to an ENI (elastic network interface) — which means it's per-instance, per-interface. It's stateful: if you allow inbound traffic on port 443, the return traffic is automatically allowed. You don't write a separate outbound rule for the response.
# Web tier: allow HTTPS from anywhere, HTTP for redirect
aws ec2 create-security-group --group-name web-alb-sg \
--description "ALB security group" --vpc-id vpc-abc123
aws ec2 authorize-security-group-ingress --group-id sg-alb \
--protocol tcp --port 443 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id sg-alb \
--protocol tcp --port 80 --cidr 0.0.0.0/0
The real power: security groups can reference other security groups. Instead of allowing a CIDR range, you allow traffic from anything in a specific SG:
# App tier: only allow traffic FROM the ALB security group
aws ec2 create-security-group --group-name app-sg \
--description "App server SG" --vpc-id vpc-abc123
aws ec2 authorize-security-group-ingress --group-id sg-app \
--protocol tcp --port 8080 --source-group sg-alb
# Database tier: only allow traffic FROM the app security group
aws ec2 create-security-group --group-name db-sg \
--description "Database SG" --vpc-id vpc-abc123
aws ec2 authorize-security-group-ingress --group-id sg-db \
--protocol tcp --port 5432 --source-group sg-app
This is the pattern you want. The database only talks to the app tier. The app tier only receives traffic from the ALB. If someone compromises the ALB, they can reach the app servers on port 8080 — but not the database directly.
Linux parallel: Security groups are the AWS equivalent of
iptableswith connection tracking (-m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT). You define what's allowed in, and the kernel's connection tracker automatically allows the responses out. Same principle, managed service.
NACLs — the gate around each subnet¶
A Network Access Control List operates at the subnet level and is stateless. Every packet is evaluated against the rules independently — the NACL doesn't know or care that this packet is a response to an earlier request.
| Feature | Security Group | NACL |
|---|---|---|
| Level | ENI (instance) | Subnet |
| State | Stateful | Stateless |
| Rules | Allow only | Allow AND Deny |
| Evaluation | All rules checked | First match wins (by rule number) |
| Default | Deny all inbound | Allow all both directions |
Gotcha: NACLs are stateless. If you add an inbound rule allowing port 443, you ALSO need an outbound rule allowing ephemeral ports (1024–65535) for the response. Forget this and traffic appears to be accepted (flow logs show ACCEPT on inbound) but the connection hangs because the response is silently dropped.
# NACL: allow HTTPS inbound
aws ec2 create-network-acl-entry --network-acl-id acl-abc123 \
--rule-number 100 --protocol tcp --port-range From=443,To=443 \
--cidr-block 0.0.0.0/0 --rule-action allow --ingress
# NACL: allow ephemeral ports outbound (for responses!)
aws ec2 create-network-acl-entry --network-acl-id acl-abc123 \
--rule-number 100 --protocol tcp --port-range From=1024,To=65535 \
--cidr-block 0.0.0.0/0 --rule-action allow --egress
In practice, many production teams leave NACLs at their defaults (allow all) and do all filtering at the security group level. NACLs shine when you need to deny specific traffic (something security groups can't do) — like blocking a known-bad IP range at the subnet level.
Linux parallel: NACLs are like stateless
iptablesrules without connection tracking — pure packet-by-packet evaluation. Security groups are like statefuliptableswithconntrack. If you've ever wondered whyiptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPTexists — that's the same problem NACLs have: without state, you need explicit rules for return traffic.Interview Bridge: "Explain the difference between security groups and NACLs" is one of the most common AWS interview questions. The answer in one sentence: security groups are stateful and per-instance; NACLs are stateless and per-subnet.
Part 6: The Packet Journey — Internet to EC2, Every Hop¶
A user at IP 198.51.100.42 opens https://app.example.com. DNS resolves to the ALB's
public IP 203.0.113.10. Here's every hop, in order:
Internet (198.51.100.42)
│
▼
┌─────────────────────┐
│ Internet Gateway │ 1. Receives packet, maps public IP → private
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Route Table │ 2. 10.0.1.0/24 → local (ALB is in public-1a)
│ (public subnet) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ NACL (inbound) │ 3. Rule 100: TCP 443 from 0.0.0.0/0 → ALLOW
│ (public subnet) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Security Group │ 4. sg-alb: TCP 443 from 0.0.0.0/0 → ALLOW
│ (ALB ENI) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ ALB │ 5. Terminates TLS, inspects HTTP, picks target
│ (Application LB) │ based on path/host rules
└─────────┬───────────┘
│
│ ALB opens NEW connection to app server
│ Source IP is now ALB's private IP (10.0.1.x)
▼
┌─────────────────────┐
│ Route Table │ 6. 10.0.10.0/24 → local (app server subnet)
│ (private subnet) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ NACL (inbound) │ 7. Evaluated against private subnet rules
│ (private subnet) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Security Group │ 8. sg-app: TCP 8080 from sg-alb → ALLOW
│ (app server ENI) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ ENI │ 9. Packet delivered to the virtual NIC
│ (Elastic Network │
│ Interface) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ EC2 Instance │ 10. Application receives the request
│ (app server) │
└─────────────────────┘
Ten hops. Every one is a potential failure point. When debugging connectivity, work from the outside in:
- Is the instance running? (
describe-instance-status) - Does it have the right IP/subnet? (
describe-instances) - Security group allowing the port? (
describe-security-groups) - NACL blocking anything? (
describe-network-acls) - Route table correct? (
describe-route-tables) - IGW attached? (
describe-internet-gateways)
# The full debugging sequence, one command at a time
aws ec2 describe-instance-status --instance-ids i-abc123 \
--query 'InstanceStatuses[].{State:InstanceState.Name,System:SystemStatus.Status}'
aws ec2 describe-instances --instance-ids i-abc123 \
--query 'Reservations[].Instances[].{PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress,SubnetId:SubnetId,SGs:SecurityGroups[].GroupId}'
aws ec2 describe-security-groups --group-ids sg-app \
--query 'SecurityGroups[].IpPermissions[].{Proto:IpProtocol,From:FromPort,To:ToPort,Sources:IpRanges[].CidrIp,SGSources:UserIdGroupPairs[].GroupId}'
Under the Hood: At hop 5, the ALB does something important — it terminates the original TCP connection and opens a new one to the target. The app server sees the ALB's private IP as the source, not the client's IP. To see the real client IP, check the
X-Forwarded-Forheader. This is why target security groups must allow traffic from the ALB's security group, not from0.0.0.0/0.
Part 7: Elastic Network Interfaces and Elastic IPs¶
ENIs — the virtual network card¶
Every EC2 instance has at least one ENI. An ENI carries: - One primary private IP (and optionally secondary IPs) - One Elastic IP (optional) - One or more security groups - A MAC address - Source/destination check flag
The powerful trick: ENIs can be detached from one instance and attached to another, carrying their IP address, security groups, and MAC address with them. This enables failover patterns where a "virtual IP" follows the active instance.
# Create a standalone ENI
aws ec2 create-network-interface --subnet-id subnet-priv1a \
--groups sg-app \
--description "Failover ENI for app-primary"
# Attach to an instance
aws ec2 attach-network-interface \
--network-interface-id eni-abc123 \
--instance-id i-primary \
--device-index 1
Trivia: Fargate creates a dedicated ENI for every task it runs, each with its own private IP within the VPC. This is why Fargate tasks consume subnet IP addresses — and why undersized subnets run out of IPs under load.
Elastic IPs — the static public address¶
An Elastic IP is a static public IPv4 address that persists until you release it. It stays with your account even when the associated instance is stopped.
# Allocate and associate
aws ec2 allocate-address --domain vpc
aws ec2 associate-address --instance-id i-abc123 --allocation-id eipalloc-abc
# Find unattached EIPs (you're being charged for these)
aws ec2 describe-addresses --query 'Addresses[?AssociationId==null].{EIP:PublicIp,AllocId:AllocationId}'
Gotcha: Unattached Elastic IPs cost $0.005/hour ($3.60/month). Since February 2024, AWS charges for ALL public IPv4 addresses — even attached ones — at the same rate. This change was designed to push adoption of IPv6 and reflects the genuine scarcity of IPv4 space. AWS holds over 100 million IPv4 addresses, one of the largest allocations in the world.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| Security groups are stateful. What does that mean? | Return traffic for allowed inbound connections is automatically permitted — no outbound rule needed |
| NACLs are stateless. What must you remember? | You need explicit rules for both directions, including ephemeral ports (1024–65535) for return traffic |
| What does the ALB do to the source IP? | Terminates the connection — the app server sees the ALB's private IP, not the client's. Client IP is in X-Forwarded-For |
| Can security groups have deny rules? | No. Security groups are allow-only. Use NACLs for deny rules. |
| What happens if an ENI runs out of IPs in a subnet? | New tasks/instances can't launch. Fargate tasks and Lambda functions consume subnet IPs and can exhaust small subnets. |
Part 8: VPC Peering and Transit Gateway — Connecting VPCs¶
VPC peering — the direct line¶
VPC peering creates a point-to-point connection between two VPCs. Traffic stays on AWS's backbone — never touches the public internet.
# Create peering connection
aws ec2 create-vpc-peering-connection \
--vpc-id vpc-prod \
--peer-vpc-id vpc-shared-services
# Accept (can be in a different account)
aws ec2 accept-vpc-peering-connection \
--vpc-peering-connection-id pcx-abc123
# Add routes in BOTH VPCs (this is where people forget)
aws ec2 create-route --route-table-id rtb-prod \
--destination-cidr-block 10.1.0.0/16 \
--vpc-peering-connection-id pcx-abc123
aws ec2 create-route --route-table-id rtb-shared \
--destination-cidr-block 10.0.0.0/16 \
--vpc-peering-connection-id pcx-abc123
Gotcha: VPC peering is not transitive. If VPC-A peers with VPC-B and VPC-B peers with VPC-C, A cannot reach C through B. You need a separate peering connection between A and C — or use Transit Gateway.
Gotcha: You cannot peer two VPCs with overlapping CIDR ranges. If both VPCs used
10.0.0.0/16, peering is impossible. Plan your CIDR allocation across all VPCs and accounts before you build. The default VPC CIDR is172.31.0.0/16in every account — dozens of accounts with the default VPC means dozens of unpearable networks.
Transit Gateway — the hub-and-spoke router¶
Before Transit Gateway launched in 2018, connecting N VPCs required N*(N-1)/2 peering connections. Ten VPCs meant 45 connections. Transit Gateway acts as a central hub:
Before (mesh): After (hub):
VPC-A ←→ VPC-B VPC-A ──┐
VPC-A ←→ VPC-C VPC-B ──┤── Transit Gateway
VPC-B ←→ VPC-C VPC-C ──┘
(3 peering connections) (3 attachments, scales to 5,000)
aws ec2 create-transit-gateway \
--description "Central network hub" \
--options DefaultRouteTableAssociation=enable,DefaultRouteTablePropagation=enable
aws ec2 create-transit-gateway-vpc-attachment \
--transit-gateway-id tgw-abc123 \
--vpc-id vpc-abc123 \
--subnet-ids subnet-priv1a subnet-priv1b
Trivia: The scaling difference is dramatic. VPC peering is O(N^2) connections. Transit Gateway is O(N) attachments. It also connects VPN tunnels, Direct Connect links, and other Transit Gateways for inter-region peering — a single place to manage all your network connectivity.
Part 9: VPC Endpoints — Staying Off the Internet¶
When your private EC2 instance calls S3 to pull a config file, that request would normally go: instance → NAT gateway → IGW → internet → S3. You're paying for NAT gateway data processing, and the traffic leaves your VPC (even though it comes right back to AWS).
VPC endpoints solve this. Two types:
Gateway endpoints (S3 and DynamoDB only)¶
Free. Added to your route table. Traffic goes directly from your VPC to S3 or DynamoDB over AWS's internal network.
aws ec2 create-vpc-endpoint \
--vpc-id vpc-abc123 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids $RTB_PRIV_1A $RTB_PRIV_1B
This is the single biggest NAT gateway cost reduction you can make. S3 traffic is usually the majority of outbound traffic for most workloads (logs, artifacts, backups, container image layers in ECR which uses S3 under the hood).
Interface endpoints (everything else)¶
Creates an ENI in your subnet. Costs ~$0.01/hour + $0.01/GB. Useful for services like Secrets Manager, STS, ECR, CloudWatch Logs.
aws ec2 create-vpc-endpoint \
--vpc-id vpc-abc123 \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.secretsmanager \
--subnet-ids subnet-priv1a subnet-priv1b \
--security-group-ids sg-endpoint
Gotcha: The default VPC endpoint policy allows full access to all resources of that service type — including resources in other accounts. A compromised instance can exfiltrate data to an attacker-controlled S3 bucket, and the traffic never touches the internet (bypassing network-level DLP). Restrict endpoint policies to specific buckets and actions.
Part 10: Flow Logs — Seeing the Invisible Network¶
VPC Flow Logs capture metadata about every network flow — source/destination IPs, ports, protocol, packet count, byte count, and whether it was ACCEPTED or REJECTED.
aws ec2 create-flow-logs \
--resource-ids vpc-abc123 \
--resource-type VPC \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name /vpc/flow-logs \
--deliver-logs-permission-arn arn:aws:iam::123456789012:role/flow-logs-role
A flow log record looks like this:
2 123456789012 eni-abc123 10.0.1.5 10.0.2.10 49152 3306 6 20 4000 1620000000 1620000060 ACCEPT OK
2 123456789012 eni-abc123 203.0.113.5 10.0.1.5 12345 22 6 3 180 1620000000 1620000060 REJECT OK
The fields that matter most for debugging:
| Field | Meaning |
|---|---|
srcaddr / dstaddr |
Who's talking to whom |
srcport / dstport |
Which service (3306 = MySQL, 22 = SSH, 443 = HTTPS) |
action |
ACCEPT or REJECT — did the packet pass? |
protocol |
6 = TCP, 17 = UDP, 1 = ICMP |
Finding blocked traffic¶
# CloudWatch Logs Insights query: find rejected traffic
# fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, action
# | filter action = "REJECT"
# | sort @timestamp desc
# | limit 50
Under the Hood: Flow logs capture the decision, not the payload. They tell you whether a packet was accepted or rejected, but not what was in it. They're your first tool for answering "is my traffic even reaching the instance?" versus "the traffic reaches the instance but the application isn't responding." If flow logs show ACCEPT but the connection still hangs, the problem is above the network layer — look at the application.
Part 11: Cross-AZ Traffic Costs — The Hidden Tax¶
Traffic between AZs costs $0.01/GB in each direction ($0.02 round-trip). Same-AZ traffic is free.
Same AZ: instance in 1a → instance in 1a = $0.00/GB
Cross AZ: instance in 1a → instance in 1b = $0.01/GB each way
This sounds trivial until you do the math. A microservices architecture with 10 services making cross-AZ calls at 10 TB/day of internal traffic: $200/day, $6,000/month. Just for services talking to each other.
War Story: A team running a data-intensive ETL pipeline discovered a $4,200/month cross-AZ data transfer line item that nobody could explain. Investigation with VPC flow logs and Cost Explorer (filter by usage type
DataTransfer-Regional-Bytes) revealed that their application's chatty internal communication pattern — fetching data from a cache in AZ-b while running compute in AZ-a — was generating 14 TB/month of cross-AZ traffic. The fix was enabling topology-aware routing and colocating the cache replica with the compute tier. Bill dropped to $800/month. (Pattern documented in AWS Well-Architected cost optimization pillar.)
Mitigation strategies:
- AZ-aware routing: ALB cross-zone load balancing is enabled by default — consider
disabling it if your services handle routing themselves
- Keep coupled services together: cache + compute in the same AZ
- Monitor: Cost Explorer, filter by DataTransfer-Regional-Bytes
Flashcard Check #3¶
| Question | Answer |
|---|---|
| Is VPC peering transitive? | No. A-B and B-C does not mean A can reach C through B. |
| What problem does Transit Gateway solve? | O(N^2) peering connections → O(N) hub-and-spoke attachments. Scales to 5,000 VPCs. |
| Gateway endpoints vs. interface endpoints — which is free? | Gateway endpoints (S3, DynamoDB only). Interface endpoints cost ~$0.01/hour + $0.01/GB. |
| What do VPC flow logs capture? | Metadata: source/dest IP, ports, protocol, packet/byte count, ACCEPT/REJECT decision. Not payload. |
| Cross-AZ traffic cost? | $0.01/GB each direction ($0.02 round-trip). Same-AZ is free. |
Exercises¶
Exercise 1: Read a VPC (5 minutes)¶
Inspect an existing VPC and its components. No changes, just reading.
# List VPCs
aws ec2 describe-vpcs --query 'Vpcs[].{ID:VpcId,CIDR:CidrBlock,Name:Tags[?Key==`Name`].Value|[0]}'
# Pick one and explore its subnets
aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-YOUR-ID" \
--query 'Subnets[].{Name:Tags[?Key==`Name`].Value|[0],CIDR:CidrBlockAssociation,AZ:AvailabilityZone,Public:MapPublicIpOnLaunch}'
# Check its route tables
aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-YOUR-ID" \
--query 'RouteTables[].{Name:Tags[?Key==`Name`].Value|[0],Routes:Routes[]}'
What to look for
- Which subnets have a route to an IGW? Those are public. - Which subnets route through a NAT gateway? Those are private with outbound internet. - Which subnets have no default route beyond `local`? Those are fully isolated. - Is there one NAT gateway per AZ, or a single one?Exercise 2: Trace a connection failure (15 minutes)¶
An instance in a private subnet can't reach the internet for yum update. Walk through
the debugging checklist:
- Does the subnet's route table have
0.0.0.0/0 → nat-xxx? - Is the NAT gateway in a public subnet?
- Does the NAT gateway's subnet route table have
0.0.0.0/0 → igw-xxx? - Does the NAT gateway have an Elastic IP?
- Does the security group allow outbound traffic?
- Does the NACL allow outbound AND inbound ephemeral ports?
Hint
Work from the instance outward. The most common cause: the subnet was never associated with a route table that has a NAT gateway route, so it's using the main route table which only has the local route.Exercise 3: Design a CIDR plan (10 minutes)¶
Your company needs 4 VPCs: prod, staging, dev, and shared-services. All must be peerable. Design a non-overlapping CIDR allocation using /16 blocks from 10.0.0.0/8.
Solution
Avoid `172.31.0.0/16` (default VPC), `172.17.0.0/16` (Docker default), and `192.168.0.0/16` (common on-prem, conflicts with VPN).Cheat Sheet¶
VPC Components at a Glance¶
| Component | Scope | Cost | Key Fact |
|---|---|---|---|
| VPC | Region | Free | /16 to /28 CIDR. 5 reserved IPs per subnet. |
| Subnet | Single AZ | Free | Public vs. private = route table, not a checkbox |
| Internet Gateway | VPC | Free | One per VPC. Bidirectional internet. |
| NAT Gateway | Single AZ | $0.045/hr + $0.045/GB | Deploy one per AZ. Lives in public subnet. |
| Security Group | ENI | Free | Stateful. Allow only. Up to 5 per ENI. |
| NACL | Subnet | Free | Stateless. Allow + deny. First match wins. |
| VPC Endpoint (GW) | Region | Free | S3 and DynamoDB only |
| VPC Endpoint (IF) | AZ | $0.01/hr + $0.01/GB | All other services. Creates an ENI. |
| Elastic IP | Region | $0.005/hr | Charged when unattached OR attached (since 2024) |
| VPC Peering | Region/cross-region | Free (data transfer applies) | Not transitive. CIDRs can't overlap. |
| Transit Gateway | Region | $0.05/hr + $0.02/GB | Hub-and-spoke. Scales to 5,000 VPCs. |
| Flow Logs | VPC/subnet/ENI | CloudWatch/S3 storage costs | Metadata only. ACCEPT/REJECT decisions. |
Quick Debug Commands¶
# Instance status
aws ec2 describe-instance-status --instance-ids i-XXX
# Security group rules
aws ec2 describe-security-groups --group-ids sg-XXX
# Subnet's route table
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-XXX"
# NACL for a subnet
aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=subnet-XXX"
# Flow logs — rejected traffic (CloudWatch Logs Insights)
# filter action = "REJECT" | sort @timestamp desc | limit 50
# Unattached Elastic IPs (you're paying for these)
aws ec2 describe-addresses --query 'Addresses[?AssociationId==null]'
# NAT gateway health
aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=vpc-XXX"
The Packet Path (memorize this order)¶
IGW → Route Table → NACL (inbound) → Security Group → ENI → Instance
Instance → Security Group → NACL (outbound) → Route Table → IGW
Takeaways¶
-
A VPC is a region-scoped isolated network. Subnets are AZ-scoped. Public vs. private is determined by the route table, not by any setting on the subnet itself.
-
Security groups are stateful; NACLs are stateless. Know which is which, because when you forget to add ephemeral port rules to a NACL, the symptoms are maddening — traffic appears to arrive but connections hang.
-
NAT gateway is single-AZ and expensive. Deploy one per AZ for production. Use VPC gateway endpoints for S3/DynamoDB to cut your NAT bill significantly.
-
VPC peering is not transitive. For more than a few VPCs, Transit Gateway is the answer. Plan non-overlapping CIDRs from day one — you can't fix overlapping ranges after the fact.
-
Flow logs are your X-ray vision. They show every accept/reject decision at the network level. When debugging connectivity, check flow logs before assuming the application is broken.
-
Cross-AZ traffic has a real cost. $0.01/GB each direction adds up fast in chatty architectures. Monitor it, architect around it.
Related Lessons¶
- The Subnet Calculator in Your Head — CIDR math from scratch, the mental model this lesson assumes you'll eventually internalize
- What Happens When You Click a Link — the full end-to-end trace from browser to server, covering DNS, TCP, TLS, and HTTP in detail
- iptables: Following a Packet Through the Chains — the Linux-native equivalent of security groups and NACLs, with connection tracking explained
- Why DNS Is Always the Problem — deep dive into DNS resolution failures, including VPC DNS edge cases
- The Cloud Bill Surprise — cost optimization across all AWS services, including the network cost patterns covered here