AWS Networking Footguns¶

Mistakes that cause outages, security exposure, or surprise bills in AWS networking.

1. Default VPC with everything public¶

Every AWS account comes with a default VPC in each region. All subnets in the default VPC are public — they have a route to the internet gateway and instances get public IPs by default. Many tutorials and quick-start guides use the default VPC. If you launch production workloads there, your databases, internal APIs, and cache servers are all on public subnets with public IPs.

Fix: Never use the default VPC for production. Create a custom VPC with explicit public and private subnets. Consider deleting the default VPC in each region to prevent accidental use: aws ec2 delete-vpc --vpc-id <default-vpc-id>. If you must keep it, remove the IGW route from subnet route tables.

Default trap: The default VPC also has auto-assign public IPv4 address enabled on all subnets. Any EC2 instance, RDS instance, or ECS task launched in the default VPC gets a public IP by default. Combined with a permissive security group, this is the most common path to accidental public exposure of internal services on AWS.

2. Security group allowing 0.0.0.0/0 on all ports¶

You open all ports to the world "temporarily" for debugging. You forget to remove it. Now every service on that instance is exposed to the internet — databases, admin panels, debug endpoints, metrics ports. Automated scanners find these within minutes.

# Find security groups with wide-open ingress
aws ec2 describe-security-groups \
  --query 'SecurityGroups[?IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`] && FromPort==`0`]].[GroupId,GroupName]'

Fix: Never allow 0.0.0.0/0 on all ports. Allow specific ports only. Use SSM Session Manager instead of opening SSH. Set up AWS Config rule restricted-common-ports to auto-detect. Use security group references (SG-to-SG rules) for internal traffic instead of CIDR ranges.

3. NACLs are stateless (must allow return traffic)¶

You add a NACL rule allowing inbound HTTPS (port 443). Traffic still fails. You forgot that NACLs are stateless — unlike security groups, they do not automatically allow return traffic. The response from your server uses an ephemeral port (1024-65535) on the outbound direction, and if the NACL does not have an outbound rule allowing that port range, the response is dropped.

Fix: For every NACL inbound allow rule, add a corresponding outbound rule for ephemeral ports (1024-65535). For every outbound allow rule, add a corresponding inbound rule for ephemeral ports. In practice, many teams leave NACLs at their default (allow all) and rely entirely on security groups for filtering.

4. NAT Gateway is single-AZ (not HA by default)¶

You deploy one NAT Gateway in us-east-1a. Your private subnets in us-east-1b and 1c all route through it. When us-east-1a has an outage, all private subnets in all AZs lose internet access — even though their instances are running fine.

Fix: Deploy one NAT Gateway per AZ. Route each AZ's private subnet through its local NAT Gateway. Yes, this triples the cost (~$96/month for 3 NAT Gateways). But a single-AZ NAT Gateway defeats the purpose of multi-AZ deployment. For non-production environments, one NAT Gateway is acceptable.

5. VPC CIDR overlap blocking peering¶

You create VPC-A with CIDR 10.0.0.0/16 and VPC-B with CIDR 10.0.0.0/16. You try to peer them. AWS rejects the peering because the CIDR ranges overlap. This also blocks Transit Gateway connectivity. If you have 20 VPCs that all used 10.0.0.0/16 because someone copied a tutorial, you cannot peer any of them.

Fix: Plan your CIDR allocation before creating VPCs. Use a central IPAM (AWS VPC IPAM or a spreadsheet at minimum). Standard pattern: /16 per VPC, non-overlapping. Example: 10.0.0.0/16, 10.1.0.0/16, 10.2.0.0/16. Avoid 172.17.0.0/16 (Docker default). Avoid 192.168.0.0/16 (commonly used on-prem, will conflict with VPN).

6. Forgetting route table association¶

You create a new subnet and add instances to it. Nothing can reach the internet. You check security groups, NACLs, IGW — all fine. The problem: the subnet is not associated with any custom route table, so it uses the main route table, which has no route to the IGW. Or you intended it to use a private route table but never associated it.

# Check which route table a subnet uses
aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-abc123" \
  --query 'RouteTables[].RouteTableId'
# Empty result = using main route table (implicit association)

Fix: Always explicitly associate subnets with route tables. Never rely on the main route table for production traffic. Terraform and CloudFormation make this explicit, which is one reason to use IaC for networking.

7. Security group referencing itself without understanding¶

You see a security group rule that allows all traffic from itself (sg-abc123 allows sg-abc123). You think "this allows instances in the group to talk to each other." That is correct, but it also means any new instance added to this security group gets full network access to every other instance in the group — on every port. If one instance is compromised, the attacker has unrestricted lateral movement to all others.

Fix: Use specific port rules even for self-referencing groups. Instead of "allow all from sg-abc123," use "allow TCP 8080 from sg-abc123." This limits the blast radius if one instance is compromised. Separate security groups by tier (web, app, data) and only allow the specific ports each tier needs.

8. Elastic IP charges when not attached¶

You allocate an Elastic IP for a project. The project is decommissioned. The EC2 instance is terminated. The EIP is not released. AWS charges $0.005/hour ($3.60/month) per unattached EIP. This is small per EIP, but it adds up across accounts and regions — and it is the kind of cost that nobody notices until someone reviews the bill months later.

# Find unattached EIPs across all regions
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
  unattached=$(aws ec2 describe-addresses --region $region \
    --query 'Addresses[?AssociationId==null].PublicIp' --output text)
  if [ -n "$unattached" ]; then
    echo "$region: $unattached"
  fi
done

Fix: Release EIPs when they are no longer needed. Set up a Lambda function to scan for unattached EIPs weekly. Use AWS Config rule eip-attached for continuous monitoring.

9. Subnet sizing too small¶

You create subnets with /28 (11 usable IPs) or /27 (27 usable IPs) thinking "we only need a few instances." Then the team grows, more services are deployed, and you hit the IP limit. Worse: Lambda functions, EKS pods, and interface VPC endpoints all consume IPs from your subnets. A /27 private subnet can be exhausted by a single Lambda function under load.

Fix: Use /24 (251 usable IPs) as the minimum for private subnets. For EKS worker node subnets, use /20 or /19 — EKS pods consume IPs from the subnet via the VPC CNI plugin. You cannot resize a subnet after creation — you must create a new one. Plan for 3-5x your current needs.

10. VPC endpoint policy left wide open¶

You create a VPC endpoint for S3 to avoid NAT Gateway costs. The default endpoint policy allows full access to all S3 buckets in all accounts. An application bug or compromised instance in your VPC can now exfiltrate data to any S3 bucket in any account — including an attacker-controlled bucket — and the traffic never touches the internet, bypassing your network-level data loss prevention.

Fix: Restrict the VPC endpoint policy to specific buckets:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "AllowSpecificBuckets",
    "Effect": "Allow",
    "Principal": "*",
    "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::my-app-bucket",
      "arn:aws:s3:::my-app-bucket/*"
    ]
  }]
}

11. Cross-AZ traffic costs ignored in architecture¶

You deploy a microservices architecture across 3 AZs. Service A in us-east-1a calls Service B in us-east-1b, which calls Service C in us-east-1c. Each call transfers data cross-AZ at $0.01/GB each direction. At scale, a system processing 10 TB/day of internal traffic accumulates $200/day in data transfer fees — $6,000/month just for internal communication.

Fix: Profile your cross-AZ traffic with VPC flow logs and Cost Explorer. Use AZ-aware routing where possible (Kubernetes topology-aware routing, ALB AZ affinity). Keep tightly coupled services in the same AZ. Disable ALB cross-zone load balancing if your services handle it at the application layer. Monitor the DataTransfer-Regional-Bytes usage type in Cost Explorer.

12. Security group changes apply immediately with no rollback¶

You update a security group rule in production — say, changing the allowed CIDR for port 443. The change takes effect immediately on all instances using that security group. There is no "apply" button, no deployment window, no automatic rollback. If you accidentally remove a rule, connections using that rule start failing within seconds.

Fix: Treat security group changes as infrastructure changes — use IaC (Terraform, CloudFormation) with code review and staged rollouts. If you must make manual changes, add the new rule first, verify it works, then remove the old rule. Never edit security group rules directly in production without a rollback plan.

Remember: Security groups are stateful (return traffic is automatically allowed), but NACLs are stateless (you must explicitly allow return traffic). When debugging connectivity, check both layers. A common pattern: the security group allows the traffic, but a NACL denies the ephemeral port range for the return path. VPC Flow Logs with REJECT action will show you exactly which layer is blocking.