Anti-Primer: Kubernetes Networking¶

Everything that can go wrong, will — and in this story, it does.

The Setup¶

A platform team is migrating from a flat network to Kubernetes NetworkPolicies to satisfy a compliance audit. The engineer writes policies over a weekend and deploys them Monday morning, assuming 'deny all then allow' is straightforward.

The Timeline¶

Hour 0: Default Deny Blocks DNS¶

Applies a default-deny ingress and egress policy without allowing DNS (port 53) to kube-dns. The deadline was looming, and this seemed like the fastest path forward. But the result is every pod in the namespace loses DNS resolution; all services fail simultaneously.

Footgun #1: Default Deny Blocks DNS — applies a default-deny ingress and egress policy without allowing DNS (port 53) to kube-dns, leading to every pod in the namespace loses DNS resolution; all services fail simultaneously.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Label Selector Typo¶

NetworkPolicy selector has app: paymnet instead of app: payment. Under time pressure, the team chose speed over caution. But the result is policy does not match any pods; the payment service remains wide open despite the 'restriction'.

Footgun #2: Label Selector Typo — networkPolicy selector has app: paymnet instead of app: payment, leading to policy does not match any pods; the payment service remains wide open despite the 'restriction'.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Forgetting Cross-Namespace Traffic¶

Policies only allow traffic within the namespace; monitoring in a different namespace is blocked. Nobody pushed back because the shortcut looked harmless in the moment. But the result is Prometheus cannot scrape metrics; alerting goes dark; team does not notice for days.

Footgun #3: Forgetting Cross-Namespace Traffic — policies only allow traffic within the namespace; monitoring in a different namespace is blocked, leading to Prometheus cannot scrape metrics; alerting goes dark; team does not notice for days.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: CIDR Rule Blocks Cloud Metadata¶

Allows egress to 0.0.0.0/0 but a more specific deny blocks the cloud metadata endpoint. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is IAM roles for service accounts stop working; pods cannot authenticate to cloud APIs.

Footgun #4: CIDR Rule Blocks Cloud Metadata — allows egress to 0.0.0.0/0 but a more specific deny blocks the cloud metadata endpoint, leading to IAM roles for service accounts stop working; pods cannot authenticate to cloud APIs.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem¶

Root Cause Chain¶

#	Mistake	Consequence	Could Have Been Prevented By
1	Default Deny Blocks DNS	Every pod in the namespace loses DNS resolution; all services fail simultaneously	Primer: Always include a DNS egress allow rule when implementing default-deny
2	Label Selector Typo	Policy does not match any pods; the payment service remains wide open despite the 'restriction'	Primer: Test policies in a staging namespace; verify with `kubectl describe networkpolicy`
3	Forgetting Cross-Namespace Traffic	Prometheus cannot scrape metrics; alerting goes dark; team does not notice for days	Primer: Include namespaceSelector rules for cross-namespace dependencies
4	CIDR Rule Blocks Cloud Metadata	IAM roles for service accounts stop working; pods cannot authenticate to cloud APIs	Primer: Explicitly allow cloud metadata endpoint (169.254.169.254) in egress rules

Damage Report¶

Downtime: 2-4 hours of pod-level or cluster-wide disruption
Data loss: Risk of volume data loss if StatefulSets were affected
Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification

What the Primer Teaches¶

Footgun #1: If the engineer had read the primer, section on default deny blocks dns, they would have learned: Always include a DNS egress allow rule when implementing default-deny.
Footgun #2: If the engineer had read the primer, section on label selector typo, they would have learned: Test policies in a staging namespace; verify with kubectl describe networkpolicy.
Footgun #3: If the engineer had read the primer, section on forgetting cross-namespace traffic, they would have learned: Include namespaceSelector rules for cross-namespace dependencies.
Footgun #4: If the engineer had read the primer, section on cidr rule blocks cloud metadata, they would have learned: Explicitly allow cloud metadata endpoint (169.254.169.254) in egress rules.

Cross-References¶

Primer — The right way
Footguns — The mistakes catalogued
Street Ops — How to do it in practice