Networking Footguns¶

Mistakes that cause outages, silent packet drops, or hours of debugging the wrong thing.

1. NetworkPolicy that blocks DNS¶

You write a default-deny egress policy. Everything stops working because pods can't resolve DNS. You spend an hour debugging the application when it's a network issue.

Fix: Always allow DNS egress in every NetworkPolicy:

egress:
- ports:
  - port: 53
    protocol: UDP
  - port: 53
    protocol: TCP

2. Assuming Services are load-balanced evenly¶

You create a Service with 3 backends. One pod handles 80% of traffic. Why? kube-proxy in iptables mode uses random selection, but long-lived connections (gRPC, WebSockets) stick to one pod. Or one pod is slower so connections queue up there.

Fix: Use sessionAffinity: None (default). For gRPC, use client-side load balancing or a service mesh. Consider IPVS mode for better distribution algorithms.

3. Headless service returning pod IPs that don't exist¶

You use a headless service (clusterIP: None) for StatefulSet DNS. A pod gets rescheduled to a new node with a new IP. DNS caching means clients still try the old IP for minutes.

Fix: Set low TTL for DNS records. Use readiness gates. Understand that ndots:5 (K8s default) means every short hostname generates 5 DNS queries before trying the FQDN.

4. `ndots: 5` killing DNS performance¶

Kubernetes sets ndots: 5 in pods by default. A lookup for api.example.com (3 dots) first tries api.example.com.default.svc.cluster.local, then .svc.cluster.local, then .cluster.local, then .example.com, then finally the actual domain. Five failed queries before the real one.

Fix: For external domains, use FQDN with trailing dot: api.example.com.. Or reduce ndots in the pod spec:

Debug clue: To see the DNS query amplification in action, run tcpdump -i any port 53 inside a pod while curling an external domain. You'll see 4-5 queries for .svc.cluster.local, .cluster.local, etc. before the real query. At scale, this can overwhelm CoreDNS — a cluster with 1,000 pods making 10 external DNS queries/second generates 50,000 wasted queries/second.
dnsConfig:
  options:
  - name: ndots
    value: "2"

5. Security Groups that allow everything from VPC¶

You set the security group to allow all traffic from 10.0.0.0/16 because "it's internal." A compromised pod in any namespace can now reach your database, your Redis, your internal APIs.

Fix: Use least-privilege security groups. Allow specific ports from specific sources. Use separate security groups for databases, caches, and application tiers.

6. CIDR overlap between VPCs¶

You create VPC A with 10.0.0.0/16 and VPC B with 10.0.0.0/16. Now you need them to communicate via peering or Transit Gateway. They can't — overlapping CIDRs are un-routable.

Fix: Plan CIDR allocations upfront. Use a CIDR allocation spreadsheet. Give each VPC a unique range. /16 per VPC with /20 subnets is a common pattern.

Gotcha: This is not just a peering problem. Transit Gateway, VPN connections, and even DNS resolution across VPCs all fail with overlapping CIDRs. And once workloads are deployed, re-IPing a VPC is essentially impossible without downtime. The 10.0.0.0/8 space gives you 256 /16 networks — plan allocation upfront even if you think you'll "only ever have 2 VPCs."

7. NAT Gateway as a single point of failure¶

You put one NAT Gateway in one AZ. That AZ has an outage. All private subnets across all AZs lose internet access because they all route through the one NAT Gateway.

Fix: Deploy one NAT Gateway per AZ. Route each AZ's private subnet through its own NAT Gateway. Yes, it costs more. The alternative is an outage.

8. Ingress path matching surprises¶

You set path: /api with pathType: Prefix. It matches /api, /api/v1, and also /api-docs and /api-keys-leaked. The prefix match is on path segments, but some ingress controllers match on string prefix.

Fix: Check your ingress controller's behavior. Use pathType: Exact for exact matches. Use pathType: Prefix with trailing slash awareness: /api/ is safer than /api.

9. Forgetting that NodePort exposes on ALL nodes¶

You create a NodePort service for debugging. It opens port 30080 on every node in the cluster, including your public-facing nodes. If your security groups allow that port range from the internet, you just exposed your service publicly.

Fix: NodePort range (30000-32767) should be blocked from internet in security groups. Use LoadBalancer or Ingress for public access. Only use NodePort for dev/testing.

10. Health check timing disasters¶

Your ALB health check interval is 5 seconds with a healthy threshold of 2. Your app takes 30 seconds to start. The ALB marks the target healthy, sends traffic, the pod isn't ready, 502s everywhere. Meanwhile the pod fails health checks and gets deregistered before it's fully up.

Fix: Set the health check path to the readiness probe path. Increase initialDelaySeconds. Match the ALB deregistration delay with the pod's graceful shutdown time.

11. MTU mismatch causing silent packet drops¶

You're in a VPC with jumbo frames (MTU 9001). Your VPN tunnel has MTU 1500. Large packets get dropped silently. Small requests work, large responses fail. SSH works, SCP doesn't. Health checks pass, real traffic fails.

Fix: Test with ping -s 8972 -M do target to verify MTU. Set --cluster-cidr MTU appropriately. For VPN/overlay networks, account for encapsulation overhead.

Under the hood: The reason small packets work and large ones fail is Path MTU Discovery (PMTUD). When a packet exceeds a link's MTU and has the Don't Fragment (DF) bit set, the router should return an ICMP "Fragmentation Needed" message. But many firewalls and security groups block ICMP, so the sender never learns about the MTU limit. This is called a "PMTU black hole" — the TCP connection hangs on large transfers while small packets flow fine.