Skip to content

Kubernetes Networking - Street-Level Ops

Real-world workflows for diagnosing and fixing networking issues in production clusters.

DNS Troubleshooting

# Is CoreDNS running?
kubectl get pods -n kube-system -l k8s-app=kube-dns
# NAME                       READY   STATUS    RESTARTS   AGE
# coredns-5d78c9869d-4x7k2   1/1     Running   0          45d
# coredns-5d78c9869d-r8m3p   1/1     Running   0          45d

# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Test DNS from inside a pod
kubectl exec -it debug-pod -- nslookup kubernetes.default
kubectl exec -it debug-pod -- nslookup backend-api.production.svc.cluster.local

# Check the pod's resolv.conf
kubectl exec -it debug-pod -- cat /etc/resolv.conf
# nameserver 10.96.0.10
# search production.svc.cluster.local svc.cluster.local cluster.local
# ndots:5

# If DNS is slow, ndots:5 causes 4 extra lookups for external domains
# Fix: add dnsConfig to pod spec to lower ndots

Under the hood: With ndots:5 (the Kubernetes default), a lookup for api.stripe.com tries api.stripe.com.production.svc.cluster.local, then .svc.cluster.local, then .cluster.local, then . before the real query. That is 5 DNS queries instead of 1. Set ndots:2 or append a trailing dot (api.stripe.com.) to skip the search path.

Debug clue: If DNS lookups are slow but CoreDNS is healthy, check if CoreDNS is forward-resolving through a chain. Run kubectl get configmap coredns -n kube-system -o yaml and look at the forward directive — it should point to a fast upstream, not a corporate resolver behind a VPN.

Service Connectivity

# Check service endpoints (are backend pods registered?)
kubectl get endpoints backend-api -n production
# NAME          ENDPOINTS                                         AGE
# backend-api   10.244.1.15:8080,10.244.2.22:8080,10.244.3.8:8080   30d

# No endpoints? Check selector match
kubectl get svc backend-api -n production -o jsonpath='{.spec.selector}'
kubectl get pods -n production -l app=backend

# Test connectivity from another pod
kubectl exec -it debug-pod -- curl -v --connect-timeout 5 http://backend-api.production.svc:80

# Check kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# View iptables rules for a service (on the node)
iptables -t nat -L KUBE-SERVICES -n | grep backend-api

# IPVS mode: view virtual servers
ipvsadm -Ln | grep -A 3 10.96.45.123

Gotcha: kube-proxy in iptables mode creates O(n) rules — at 10,000+ Services, iptables rule programming takes minutes and causes connection drops during updates. Switch to IPVS mode (mode: ipvs in kube-proxy ConfigMap) for O(1) lookups at scale.

Remember: K8s networking debug ladder mnemonic: D-E-P-N — DNS resolves? Endpoints populated? Pod IP reachable? NetworkPolicy blocking? Work through these four in order and you will find the problem 90% of the time.

Scale note: In IPVS mode, each Service adds a virtual server entry in the kernel's IPVS table with O(1) lookup cost. In iptables mode, every packet traverses the full chain linearly. At 5,000+ Services, iptables rule sync alone can take 30+ seconds and cause noticeable packet loss during updates.

Pod-to-Pod Connectivity

# Get pod IPs
kubectl get pods -n production -o wide
# NAME                    READY   IP            NODE
# frontend-abc123         1/1     10.244.1.15   worker-01
# backend-def456          1/1     10.244.2.22   worker-02

# Test cross-node pod connectivity
kubectl exec -it frontend-abc123 -- ping -c 3 10.244.2.22

# If pods cannot reach each other, check the CNI plugin
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get pods -n kube-system -l k8s-app=cilium

# Check CNI pod logs on the affected node
kubectl logs -n kube-system calico-node-x7k2m --tail=50

Debug clue: If cross-node pod traffic fails but same-node pods communicate fine, the CNI overlay (VXLAN/IPIP/WireGuard) is broken. Check that the overlay port (VXLAN: UDP 4789, Calico IPIP: IP protocol 4) is not blocked by a cloud security group or host firewall. tcpdump -i eth0 udp port 4789 on both nodes will confirm whether encapsulated packets are flowing.

NetworkPolicy Debugging

# List all network policies in a namespace
kubectl get networkpolicies -n production

# Check if a default-deny policy is blocking traffic
kubectl get networkpolicies -n production -o yaml | grep -A5 "policyTypes"

# Common issue: egress policy blocks DNS
# Fix: explicitly allow UDP/TCP port 53 in egress rules
# Target kube-system namespace for DNS egress — not 0.0.0.0/0

# Describe a specific policy to see its rules
kubectl describe networkpolicy allow-frontend-to-backend -n production

# Quick test: temporarily delete the policy and see if traffic flows
# (only in non-production environments)
kubectl delete networkpolicy allow-frontend-to-backend -n production

Default trap: NetworkPolicies are additive (whitelist-only). A namespace with no NetworkPolicies allows all traffic. The moment you add ONE policy, all non-matching traffic is denied. This catches teams who deploy a single ingress policy and accidentally block all egress.

Gotcha: A NetworkPolicy with an empty podSelector: {} in a namespace acts as a default-deny for all pods in that namespace. This is the recommended security baseline, but teams forget to add explicit allow rules for DNS egress (UDP/TCP port 53 to kube-system), breaking all name resolution.

Ingress Debugging

# List all ingress resources
kubectl get ingress -A

# Check ingress controller pods
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# View ingress controller logs for routing errors
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100

# Check if the ingress has an external IP/hostname assigned
kubectl get ingress app-ingress -n production -o jsonpath='{.status.loadBalancer.ingress[0]}'

# Test from inside the cluster
kubectl exec -it debug-pod -- curl -H "Host: app.example.com" http://ingress-nginx-controller.ingress-nginx.svc

Under the hood: When EXTERNAL-IP stays <pending>, the cloud controller manager (CCM) is responsible for provisioning the load balancer. On EKS, the AWS Load Balancer Controller handles this; on GKE, it is built-in. Check CCM logs, not kube-proxy logs. The most common cause is missing subnet tags (kubernetes.io/role/elb=1 for public, kubernetes.io/role/internal-elb=1 for internal).

Packet Capture

# Ephemeral debug container with network tools (K8s 1.23+)
kubectl debug -it problem-pod --image=nicolaka/netshoot --target=app-container

# Inside the debug container:
tcpdump -i eth0 -nn port 8080 -c 50
tcpdump -i eth0 -nn host 10.244.2.22

# Capture and save for Wireshark analysis
tcpdump -i eth0 -nn -w /tmp/capture.pcap port 8080 -c 1000

Gotcha: Ephemeral debug containers (kubectl debug) share the target pod's network namespace but get their own filesystem. Tools you install inside the debug container can see all the same network interfaces, IPs, and connections as the target container — making tcpdump, ss, and ip addr work as expected.

One-liner: Spin up a disposable network debug pod: kubectl run netdebug --rm -it --image=nicolaka/netshoot -- bash — has tcpdump, dig, curl, nmap, iperf3, and every other network tool you need.

NodePort and LoadBalancer

# Check NodePort allocation
kubectl get svc -n production -o wide | grep NodePort

# Test NodePort from outside the cluster
curl -v http://<node-ip>:31080

# Check LoadBalancer external IP
kubectl get svc public-api -n production
# NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)
# public-api   LoadBalancer   10.96.45.123   203.0.113.50     443:31443/TCP

# If EXTERNAL-IP is <pending>, check cloud provider LB controller logs
kubectl logs -n kube-system -l app=cloud-controller-manager --tail=50

Quick Network Diagnosis Checklist

# 1. Can the pod resolve DNS?
kubectl exec -it $POD -- nslookup kubernetes.default

# 2. Can the pod reach the service ClusterIP?
kubectl exec -it $POD -- curl -v --connect-timeout 5 http://$SVC_CLUSTER_IP:$PORT

# 3. Can the pod reach the backend pod IP directly?
kubectl exec -it $POD -- curl -v --connect-timeout 5 http://$POD_IP:$TARGET_PORT

# 4. Are endpoints populated?
kubectl get endpoints $SVC_NAME -n $NS

# 5. Are network policies blocking traffic?
kubectl get networkpolicies -n $NS

Quick Reference