Skip to content

AWS Networking - Street-Level Ops

Real-world networking debugging and operational workflows for AWS production environments.

Debugging "Can't Reach Instance" — The Checklist

When an instance is unreachable, work through the layers in order. Do not skip ahead.

# Step 1: Is the instance running?
aws ec2 describe-instance-status --instance-ids i-abc123 \
  --query 'InstanceStatuses[].{State:InstanceState.Name,System:SystemStatus.Status,Instance:InstanceStatus.Status}'
# If SystemStatus=impaired → hardware issue, AWS side
# If InstanceStatus=impaired → OS/config issue, your side

# Step 2: Does the instance have a reachable IP?
aws ec2 describe-instances --instance-ids i-abc123 \
  --query 'Reservations[].Instances[].{PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress,SubnetId:SubnetId,VpcId:VpcId,SGs:SecurityGroups[].GroupId}'

# Step 3: Security group — is the port open?
SG_ID=sg-abc123
aws ec2 describe-security-groups --group-ids $SG_ID \
  --query 'SecurityGroups[].IpPermissions[].{Proto:IpProtocol,FromPort:FromPort,ToPort:ToPort,Sources:IpRanges[].CidrIp,SGSources:UserIdGroupPairs[].GroupId}'

# Step 4: NACL — is the subnet blocking traffic?
SUBNET_ID=subnet-abc123
NACL_ID=$(aws ec2 describe-network-acls \
  --filters "Name=association.subnet-id,Values=$SUBNET_ID" \
  --query 'NetworkAcls[0].NetworkAclId' --output text)
aws ec2 describe-network-acls --network-acl-ids $NACL_ID \
  --query 'NetworkAcls[].Entries[?RuleAction==`deny`]'

# Step 5: Route table — does traffic know where to go?
RTB_ID=$(aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=$SUBNET_ID" \
  --query 'RouteTables[0].RouteTableId' --output text)
aws ec2 describe-route-tables --route-table-ids $RTB_ID \
  --query 'RouteTables[].Routes[]'
# Public subnet should have: 0.0.0.0/0 → igw-xxx
# Private subnet should have: 0.0.0.0/0 → nat-xxx (if internet needed)

# Step 6: Internet Gateway — is it attached?
aws ec2 describe-internet-gateways \
  --filters "Name=attachment.vpc-id,Values=vpc-abc123" \
  --query 'InternetGateways[].{IGW:InternetGatewayId,State:Attachments[0].State}'

# Step 7: NAT Gateway — is it healthy? (for private subnets)
aws ec2 describe-nat-gateways \
  --filter "Name=vpc-id,Values=vpc-abc123" \
  --query 'NatGateways[].{Id:NatGatewayId,State:State,SubnetId:SubnetId,AZ:NatGatewayAddresses[0].PublicIp}'

VPC Flow Log Analysis

Flow logs tell you what traffic was accepted or rejected at the network layer.

# Query flow logs with CloudWatch Logs Insights
aws logs start-query \
  --log-group-name /vpc/flow-logs \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string '
    fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action, bytes
    | filter action = "REJECT"
    | sort @timestamp desc
    | limit 50
  '

# Get query results (wait a few seconds after starting)
aws logs get-query-results --query-id <query-id-from-above>

# Find top rejected source IPs (potential scanning or misconfiguration)
# Query:
# fields srcAddr
# | filter action = "REJECT"
# | stats count(*) as rejections by srcAddr
# | sort rejections desc
# | limit 20

# Find traffic between two specific IPs
# fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, action
# | filter srcAddr = "10.0.1.50" and dstAddr = "10.0.2.100"
# | sort @timestamp desc

# Filter by rejected traffic to a specific port (e.g., database port)
# fields @timestamp, srcAddr, action
# | filter dstPort = 5432 and action = "REJECT"
# | sort @timestamp desc

Setting Up Private Subnets with NAT

Production pattern: private subnets for workloads, NAT gateway per AZ for outbound.

# Create Elastic IPs for NAT gateways (one per AZ)
EIP_1A=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
EIP_1B=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)

# Create NAT gateways in public subnets
NAT_1A=$(aws ec2 create-nat-gateway --subnet-id subnet-pub1a \
  --allocation-id $EIP_1A --query 'NatGateway.NatGatewayId' --output text)
NAT_1B=$(aws ec2 create-nat-gateway --subnet-id subnet-pub1b \
  --allocation-id $EIP_1B --query 'NatGateway.NatGatewayId' --output text)

# Wait for NAT gateways to become available
aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_1A $NAT_1B

# Create route tables for private subnets (one per AZ)
RTB_PRIV_1A=$(aws ec2 create-route-table --vpc-id vpc-abc123 \
  --query 'RouteTable.RouteTableId' --output text)
RTB_PRIV_1B=$(aws ec2 create-route-table --vpc-id vpc-abc123 \
  --query 'RouteTable.RouteTableId' --output text)

# Add default routes through respective NAT gateways
aws ec2 create-route --route-table-id $RTB_PRIV_1A \
  --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $NAT_1A
aws ec2 create-route --route-table-id $RTB_PRIV_1B \
  --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $NAT_1B

# Associate private subnets with their AZ-specific route table
aws ec2 associate-route-table --route-table-id $RTB_PRIV_1A --subnet-id subnet-priv1a
aws ec2 associate-route-table --route-table-id $RTB_PRIV_1B --subnet-id subnet-priv1b

VPC Peering Route Troubleshooting

Default trap: VPC peering does NOT support transitive routing. If VPC-A peers with VPC-B and VPC-B peers with VPC-C, VPC-A cannot reach VPC-C through VPC-B. Use Transit Gateway for hub-and-spoke topologies.

Peering is established but traffic is not flowing. The route tables are almost always the problem.

# Check peering connection status
aws ec2 describe-vpc-peering-connections \
  --vpc-peering-connection-ids pcx-abc123 \
  --query 'VpcPeeringConnections[].{Status:Status.Code,Requester:RequesterVpcInfo.CidrBlock,Accepter:AccepterVpcInfo.CidrBlock}'

# Check route tables in BOTH VPCs
# Requester VPC: needs route to accepter CIDR via peering connection
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-requester" \
  --query 'RouteTables[].{RTB:RouteTableId,Routes:Routes[?VpcPeeringConnectionId!=null]}'

# Accepter VPC: needs route to requester CIDR via peering connection
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-accepter" \
  --query 'RouteTables[].{RTB:RouteTableId,Routes:Routes[?VpcPeeringConnectionId!=null]}'

# Common problems:
# 1. Routes added to main route table but subnets use custom route tables
# 2. Routes exist but point to wrong peering connection
# 3. Security groups in accepter VPC don't allow traffic from requester CIDR
# 4. NACLs blocking the peered traffic

# Verify security groups allow cross-VPC traffic
aws ec2 describe-security-groups --group-ids sg-accepter-app \
  --query 'SecurityGroups[].IpPermissions[?contains(IpRanges[].CidrIp, `10.0.`)]'

DNS Resolution Failures in VPC

When DNS queries fail inside a VPC, the issue is usually configuration, not infrastructure.

# Check if DNS support is enabled on the VPC
aws ec2 describe-vpc-attribute --vpc-id vpc-abc123 --attribute enableDnsSupport
aws ec2 describe-vpc-attribute --vpc-id vpc-abc123 --attribute enableDnsHostnames

# If both are not true, fix them:
aws ec2 modify-vpc-attribute --vpc-id vpc-abc123 --enable-dns-support '{"Value":true}'
aws ec2 modify-vpc-attribute --vpc-id vpc-abc123 --enable-dns-hostnames '{"Value":true}'

# Check DHCP options set (custom DNS servers can cause issues)
aws ec2 describe-dhcp-options \
  --dhcp-options-ids $(aws ec2 describe-vpcs --vpc-ids vpc-abc123 \
    --query 'Vpcs[0].DhcpOptionsId' --output text) \
  --query 'DhcpOptions[].DhcpConfigurations[]'

# Test DNS from inside an instance
# The VPC DNS resolver is at VPC CIDR base + 2 (e.g., 10.0.0.2)
dig @10.0.0.2 my-service.internal
nslookup my-rds-endpoint.abc123.us-east-1.rds.amazonaws.com

# Check Route 53 private hosted zone association
aws route53 list-hosted-zones-by-vpc --vpc-id vpc-abc123 --vpc-region us-east-1

# If a private hosted zone is not associated with the VPC, queries will fail:
aws route53 associate-vpc-with-hosted-zone \
  --hosted-zone-id Z123456 \
  --vpc VPCRegion=us-east-1,VPCId=vpc-abc123

ALB 502/504 Debugging

Gotcha: ALB target security groups must allow traffic from the ALB's security group, not from client IPs. The ALB terminates the client connection and opens a new one to the target -- the source IP the target sees is the ALB's private IP, not the client's.

502 (Bad Gateway) and 504 (Gateway Timeout) from an ALB usually mean the targets are unhealthy or unresponsive.

# Check target health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/app-tg/... \
  --query 'TargetHealthDescriptions[].{Target:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason}'

# Common unhealthy reasons:
# Elb.InitialHealthChecking — still warming up
# Target.FailedHealthChecks — health check endpoint returns non-200
# Target.Timeout — health check timed out
# Target.ConnectionRefused — nothing listening on the port

# Check ALB access logs (if enabled)
# Logs are in S3: s3://alb-logs/AWSLogs/{account-id}/elasticloadbalancing/{region}/...
aws s3 ls s3://alb-logs/AWSLogs/123456789012/elasticloadbalancing/us-east-1/$(date +%Y/%m/%d)/

# For 502:
# - Is the target application running and listening on the configured port?
# - Is the security group between ALB and targets allowing health check traffic?
# - Is the target returning a valid HTTP response (not just a TCP connection)?
# - Check: target SG must allow traffic from ALB SG on the target port

# For 504:
# - Is the target taking too long to respond?
# - ALB idle timeout default is 60 seconds — increase if needed
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:...:loadbalancer/app/web-alb/... \
  --attributes Key=idle_timeout.timeout_seconds,Value=120

# - Check if the target application has its own timeout set lower than ALB's
# - Verify no NACL is blocking return traffic on ephemeral ports

Cross-Region Connectivity

Connecting VPCs across regions for disaster recovery or multi-region architectures.

# Option 1: Transit Gateway peering (recommended for hub-and-spoke)
# Create TGW in each region, then peer them
aws ec2 create-transit-gateway-peering-attachment \
  --transit-gateway-id tgw-us-east-1 \
  --peer-transit-gateway-id tgw-eu-west-1 \
  --peer-region eu-west-1 \
  --peer-account-id 123456789012

# Option 2: VPC peering (simpler for one-to-one)
aws ec2 create-vpc-peering-connection \
  --vpc-id vpc-us-east-1 \
  --peer-vpc-id vpc-eu-west-1 \
  --peer-region eu-west-1

# Verify inter-region latency
# From an instance in us-east-1:
ping -c 10 <private-ip-in-eu-west-1>
# Typical: 70-90ms US East to EU West

Cost Optimization

Scale note: A single NAT Gateway costs ~$32/month idle plus $0.045/GB processed. A busy cluster pulling container images and sending logs can easily push $500+/month through NAT. S3 and DynamoDB gateway endpoints are free and eliminate a huge chunk of that traffic.

NAT Gateway and cross-AZ traffic are common surprise costs.

# Find NAT Gateway usage (expensive: $0.045/hour + $0.045/GB)
aws ec2 describe-nat-gateways \
  --filter "Name=state,Values=available" \
  --query 'NatGateways[].{Id:NatGatewayId,Subnet:SubnetId,Created:CreateTime}'

# Check NAT Gateway data processing via CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-abc123 \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%s) \
  --period 86400 \
  --statistics Sum

# Alternatives to reduce NAT costs:
# 1. VPC endpoints for S3 and DynamoDB (free gateway endpoints)
# 2. Interface endpoints for frequently accessed AWS services
# 3. NAT instance (t4g.nano ~$3/month) for low-traffic workloads
# 4. Consolidate internet-bound traffic through fewer NAT gateways

# Find cross-AZ data transfer costs
# Check Cost Explorer: filter by Usage Type containing "DataTransfer-Regional"
aws ce get-cost-and-usage \
  --time-period Start=$(date -u -d '30 days ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["USE1-DataTransfer-Regional-Bytes"]
    }
  }'

# Check for unused Elastic IPs (charged when not attached)
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].{EIP:PublicIp,AllocId:AllocationId}'
# Release them:
# aws ec2 release-address --allocation-id eipalloc-abc123