Anti-Primer: AWS Networking¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A cloud architect is building a multi-VPC networking topology for a compliance-sensitive workload. The design was sketched on a whiteboard but never formally documented. Three engineers are implementing different pieces simultaneously.
The Timeline¶
Hour 0: Overlapping CIDR Blocks¶
Creates VPCs with overlapping /16 ranges; peering fails. The deadline was looming, and this seemed like the fastest path forward. But the result is cannot peer the application VPC with the shared services VPC; architecture redesign required.
Footgun #1: Overlapping CIDR Blocks — creates VPCs with overlapping /16 ranges; peering fails, leading to cannot peer the application VPC with the shared services VPC; architecture redesign required.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: Missing Route Table Entries¶
Peers two VPCs but forgets to add routes in both direction's route tables. Under time pressure, the team chose speed over caution. But the result is traffic flows one way; health checks pass but responses never arrive.
Footgun #2: Missing Route Table Entries — peers two VPCs but forgets to add routes in both direction's route tables, leading to traffic flows one way; health checks pass but responses never arrive.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: NAT Gateway Single AZ¶
Deploys one NAT Gateway in a single AZ for cost savings. Nobody pushed back because the shortcut looked harmless in the moment. But the result is AZ failure takes down all outbound internet traffic for private subnets.
Footgun #3: NAT Gateway Single AZ — deploys one NAT Gateway in a single AZ for cost savings, leading to AZ failure takes down all outbound internet traffic for private subnets.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Security Group Self-Reference Loop¶
Creates circular security group references across peered VPCs. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is cannot delete or modify either security group; stuck in a dependency deadlock.
Footgun #4: Security Group Self-Reference Loop — creates circular security group references across peered VPCs, leading to cannot delete or modify either security group; stuck in a dependency deadlock.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Overlapping CIDR Blocks | Cannot peer the application VPC with the shared services VPC; architecture redesign required | Primer: Plan CIDR allocations centrally before creating any VPC |
| 2 | Missing Route Table Entries | Traffic flows one way; health checks pass but responses never arrive | Primer: Peering requires route table updates in both VPCs |
| 3 | NAT Gateway Single AZ | AZ failure takes down all outbound internet traffic for private subnets | Primer: NAT Gateway per AZ for high availability |
| 4 | Security Group Self-Reference Loop | Cannot delete or modify either security group; stuck in a dependency deadlock | Primer: Plan security group dependencies; avoid cross-VPC SG references |
Damage Report¶
- Downtime: 3-6 hours of degraded or unavailable cloud services
- Data loss: Possible if storage or database resources were affected
- Customer impact: API errors, failed transactions, or service unavailability for end users
- Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
- Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on overlapping cidr blocks, they would have learned: Plan CIDR allocations centrally before creating any VPC.
- Footgun #2: If the engineer had read the primer, section on missing route table entries, they would have learned: Peering requires route table updates in both VPCs.
- Footgun #3: If the engineer had read the primer, section on nat gateway single az, they would have learned: NAT Gateway per AZ for high availability.
- Footgun #4: If the engineer had read the primer, section on security group self-reference loop, they would have learned: Plan security group dependencies; avoid cross-VPC SG references.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice