Skip to content

Anti-Primer: Kubernetes Networking

Everything that can go wrong, will — and in this story, it does.

The Setup

A platform team is migrating from a flat network to Kubernetes NetworkPolicies to satisfy a compliance audit. The engineer writes policies over a weekend and deploys them Monday morning, assuming 'deny all then allow' is straightforward.

The Timeline

Hour 0: Default Deny Blocks DNS

Applies a default-deny ingress and egress policy without allowing DNS (port 53) to kube-dns. The deadline was looming, and this seemed like the fastest path forward. But the result is every pod in the namespace loses DNS resolution; all services fail simultaneously.

Footgun #1: Default Deny Blocks DNS — applies a default-deny ingress and egress policy without allowing DNS (port 53) to kube-dns, leading to every pod in the namespace loses DNS resolution; all services fail simultaneously.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Label Selector Typo

NetworkPolicy selector has app: paymnet instead of app: payment. Under time pressure, the team chose speed over caution. But the result is policy does not match any pods; the payment service remains wide open despite the 'restriction'.

Footgun #2: Label Selector Typo — networkPolicy selector has app: paymnet instead of app: payment, leading to policy does not match any pods; the payment service remains wide open despite the 'restriction'.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Forgetting Cross-Namespace Traffic

Policies only allow traffic within the namespace; monitoring in a different namespace is blocked. Nobody pushed back because the shortcut looked harmless in the moment. But the result is Prometheus cannot scrape metrics; alerting goes dark; team does not notice for days.

Footgun #3: Forgetting Cross-Namespace Traffic — policies only allow traffic within the namespace; monitoring in a different namespace is blocked, leading to Prometheus cannot scrape metrics; alerting goes dark; team does not notice for days.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: CIDR Rule Blocks Cloud Metadata

Allows egress to 0.0.0.0/0 but a more specific deny blocks the cloud metadata endpoint. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is IAM roles for service accounts stop working; pods cannot authenticate to cloud APIs.

Footgun #4: CIDR Rule Blocks Cloud Metadata — allows egress to 0.0.0.0/0 but a more specific deny blocks the cloud metadata endpoint, leading to IAM roles for service accounts stop working; pods cannot authenticate to cloud APIs.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Default Deny Blocks DNS Every pod in the namespace loses DNS resolution; all services fail simultaneously Primer: Always include a DNS egress allow rule when implementing default-deny
2 Label Selector Typo Policy does not match any pods; the payment service remains wide open despite the 'restriction' Primer: Test policies in a staging namespace; verify with kubectl describe networkpolicy
3 Forgetting Cross-Namespace Traffic Prometheus cannot scrape metrics; alerting goes dark; team does not notice for days Primer: Include namespaceSelector rules for cross-namespace dependencies
4 CIDR Rule Blocks Cloud Metadata IAM roles for service accounts stop working; pods cannot authenticate to cloud APIs Primer: Explicitly allow cloud metadata endpoint (169.254.169.254) in egress rules

Damage Report

  • Downtime: 2-4 hours of pod-level or cluster-wide disruption
  • Data loss: Risk of volume data loss if StatefulSets were affected
  • Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
  • Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
  • Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on default deny blocks dns, they would have learned: Always include a DNS egress allow rule when implementing default-deny.
  • Footgun #2: If the engineer had read the primer, section on label selector typo, they would have learned: Test policies in a staging namespace; verify with kubectl describe networkpolicy.
  • Footgun #3: If the engineer had read the primer, section on forgetting cross-namespace traffic, they would have learned: Include namespaceSelector rules for cross-namespace dependencies.
  • Footgun #4: If the engineer had read the primer, section on cidr rule blocks cloud metadata, they would have learned: Explicitly allow cloud metadata endpoint (169.254.169.254) in egress rules.

Cross-References