Thinking Out Loud: Kubernetes Networking¶
A senior SRE's internal monologue while working through a real Kubernetes networking issue. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
A developer reports that their new microservice order-validator can reach the inventory-service from their local laptop via port-forward, but when deployed to the cluster, the service-to-service call times out. Both services are in the commerce namespace. It "worked in staging" and this is the production cluster.
The Monologue¶
Okay, classic "works locally but not in-cluster" networking issue. These are usually one of three things: NetworkPolicy blocking traffic, DNS resolution failure, or a service/selector mismatch. Let me figure out which one.
First, let me verify the basics — are both services actually running and healthy?
kubectl get pods -n commerce -l app=order-validator -o wide
kubectl get pods -n commerce -l app=inventory-service -o wide
Both running. Good. Now let me check if the Service object for inventory-service is actually selecting the right pods.
kubectl get svc inventory-service -n commerce -o yaml | grep -A 5 selector
kubectl get endpoints inventory-service -n commerce
Service selector is app: inventory-service, and the endpoints show 3 IPs. Those match the pod IPs I saw above. So the Service -> Pod wiring is correct. Not a selector mismatch.
Now let me test DNS from inside the order-validator pod.
kubectl exec -it order-validator-5d8f7b-k9x2m -n commerce -- nslookup inventory-service.commerce.svc.cluster.local
DNS resolves fine. Returns the ClusterIP 10.96.118.42. So DNS is working.
Let me try the actual connection from inside the pod.
kubectl exec -it order-validator-5d8f7b-k9x2m -n commerce -- curl -v --connect-timeout 5 http://inventory-service:8080/health
Connection timed out. So the pod can resolve the DNS name but can't reach the service on port 8080. Let me try reaching the pod IP directly to bypass the Service.
kubectl exec -it order-validator-5d8f7b-k9x2m -n commerce -- curl -v --connect-timeout 5 http://10.244.2.15:8080/health
That also times out. So it's not a kube-proxy/iptables issue — the pod network itself can't reach the target. This narrows it down to either a NetworkPolicy or a CNI issue.
Mental Model: The Network Debugging Ladder¶
Test connectivity in layers: DNS resolution -> ClusterIP -> Pod IP -> same-node pod vs cross-node pod. Each layer that fails tells you where the problem is. If ClusterIP fails but Pod IP works, it's kube-proxy. If Pod IP fails, it's CNI or NetworkPolicy. If same-node works but cross-node fails, it's the overlay or underlay network.
Let me check if there are NetworkPolicies in this namespace.
Ah ha. There are two policies: default-deny-ingress and allow-frontend-to-api. So there's a default deny policy in production. This is the "worked in staging" clue — staging probably doesn't have the deny-all policy.
The allow-frontend-to-api policy probably only allows traffic from the frontend pods. The new order-validator service needs its own ingress rule on inventory-service.
Yep. The policy allows ingress on port 8080 from pods with label role: frontend. The order-validator pod doesn't have that label. So the NetworkPolicy is blocking the traffic.
I almost just slapped the role: frontend label on the order-validator pod, but that's wrong — it's not a frontend service, and that label might grant access to other things too. I need to create a proper policy.
Mental Model: Principle of Least Privilege in Network Policies¶
Never widen an existing policy to fit a new service. Create a specific policy that grants exactly the access needed. Reusing labels like "role: frontend" for services that aren't frontends creates implicit trust relationships that will bite you during the next security audit.
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-order-validator-to-inventory
namespace: commerce
spec:
podSelector:
matchLabels:
app: inventory-service
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: order-validator
ports:
- port: 8080
protocol: TCP
EOF
Let me verify it worked.
kubectl exec -it order-validator-5d8f7b-k9x2m -n commerce -- curl -v --connect-timeout 5 http://inventory-service:8080/health
200 OK. Traffic is flowing. Let me also verify the developer's actual use case — the gRPC call their service makes, not just the health endpoint.
Good — no more connection timeout errors. Orders are being validated.
Now, before I call this done, I need to think about what else might be affected. Is there a CI/CD pipeline that deploys NetworkPolicies? I don't want my manual policy to get blown away on the next deploy.
kubectl get networkpolicy allow-order-validator-to-inventory -n commerce -o jsonpath='{.metadata.annotations}'
No Helm or ArgoCD annotations. The other policies do have Helm annotations. So the commerce namespace policies are managed by Helm but I just created one manually. That's going to cause drift.
I need to add this policy to the Helm chart. Let me note that. For now the policy is in place and working, but tomorrow I'll PR the chart update. I've been burned too many times by "I'll add it to the chart later" becoming "three months later during an incident we discover a manually-applied policy that nobody remembers."
Actually, one more thing — let me make sure there isn't a similar issue for the order-validator's own ingress. Other services might need to call it too.
kubectl get networkpolicy -n commerce -o custom-columns='NAME:.metadata.name,POD-SELECTOR:.spec.podSelector.matchLabels'
There's no ingress policy that targets order-validator pods. If anything needs to call order-validator, it'll fail the same way. I'll flag this to the developer.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Check only DNS and assume the network is fine | Test each layer: DNS, ClusterIP, Pod IP, and check for NetworkPolicies | Systematic elimination identifies the blocked layer quickly |
Add role: frontend label to the new service to match the existing policy |
Create a specific NetworkPolicy for the new service-to-service path | Reusing labels creates implicit trust — principle of least privilege |
| Apply the fix and close the ticket | Check how existing policies are managed (Helm, ArgoCD) and flag the drift risk | Manual policies get blown away by the next automated deploy |
| Not think about the reverse direction | Check whether the new service also needs ingress policies for callers | NetworkPolicies are directional — you need to think about both sides |
Key Heuristics Used¶
- Network Debugging Ladder: Test DNS -> ClusterIP -> Pod IP -> same-node vs cross-node to isolate which layer is broken.
- Default Deny Means Explicit Allow: When a namespace has a default-deny NetworkPolicy, every new service-to-service path needs its own allow rule.
- Check the Management Layer: Before manually applying resources, check how existing resources are managed (Helm, ArgoCD, Terraform) to avoid drift.
Cross-References¶
- Primer — Kubernetes networking model, CNI, and how Services route to Pods
- Street Ops — NetworkPolicy debugging commands and connectivity testing
- Footguns — The "worked in staging" trap when staging lacks production NetworkPolicies