Kubernetes Networking - Street-Level Ops¶
Real-world workflows for diagnosing and fixing networking issues in production clusters.
DNS Troubleshooting¶
# Is CoreDNS running?
kubectl get pods -n kube-system -l k8s-app=kube-dns
# NAME READY STATUS RESTARTS AGE
# coredns-5d78c9869d-4x7k2 1/1 Running 0 45d
# coredns-5d78c9869d-r8m3p 1/1 Running 0 45d
# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Test DNS from inside a pod
kubectl exec -it debug-pod -- nslookup kubernetes.default
kubectl exec -it debug-pod -- nslookup backend-api.production.svc.cluster.local
# Check the pod's resolv.conf
kubectl exec -it debug-pod -- cat /etc/resolv.conf
# nameserver 10.96.0.10
# search production.svc.cluster.local svc.cluster.local cluster.local
# ndots:5
# If DNS is slow, ndots:5 causes 4 extra lookups for external domains
# Fix: add dnsConfig to pod spec to lower ndots
Under the hood: With
ndots:5(the Kubernetes default), a lookup forapi.stripe.comtriesapi.stripe.com.production.svc.cluster.local, then.svc.cluster.local, then.cluster.local, then.before the real query. That is 5 DNS queries instead of 1. Setndots:2or append a trailing dot (api.stripe.com.) to skip the search path.Debug clue: If DNS lookups are slow but CoreDNS is healthy, check if CoreDNS is forward-resolving through a chain. Run
kubectl get configmap coredns -n kube-system -o yamland look at theforwarddirective — it should point to a fast upstream, not a corporate resolver behind a VPN.
Service Connectivity¶
# Check service endpoints (are backend pods registered?)
kubectl get endpoints backend-api -n production
# NAME ENDPOINTS AGE
# backend-api 10.244.1.15:8080,10.244.2.22:8080,10.244.3.8:8080 30d
# No endpoints? Check selector match
kubectl get svc backend-api -n production -o jsonpath='{.spec.selector}'
kubectl get pods -n production -l app=backend
# Test connectivity from another pod
kubectl exec -it debug-pod -- curl -v --connect-timeout 5 http://backend-api.production.svc:80
# Check kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
# View iptables rules for a service (on the node)
iptables -t nat -L KUBE-SERVICES -n | grep backend-api
# IPVS mode: view virtual servers
ipvsadm -Ln | grep -A 3 10.96.45.123
Gotcha:
kube-proxyin iptables mode creates O(n) rules — at 10,000+ Services, iptables rule programming takes minutes and causes connection drops during updates. Switch to IPVS mode (mode: ipvsin kube-proxy ConfigMap) for O(1) lookups at scale.Remember: K8s networking debug ladder mnemonic: D-E-P-N — DNS resolves? Endpoints populated? Pod IP reachable? NetworkPolicy blocking? Work through these four in order and you will find the problem 90% of the time.
Scale note: In IPVS mode, each Service adds a virtual server entry in the kernel's IPVS table with O(1) lookup cost. In iptables mode, every packet traverses the full chain linearly. At 5,000+ Services, iptables rule sync alone can take 30+ seconds and cause noticeable packet loss during updates.
Pod-to-Pod Connectivity¶
# Get pod IPs
kubectl get pods -n production -o wide
# NAME READY IP NODE
# frontend-abc123 1/1 10.244.1.15 worker-01
# backend-def456 1/1 10.244.2.22 worker-02
# Test cross-node pod connectivity
kubectl exec -it frontend-abc123 -- ping -c 3 10.244.2.22
# If pods cannot reach each other, check the CNI plugin
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl get pods -n kube-system -l k8s-app=cilium
# Check CNI pod logs on the affected node
kubectl logs -n kube-system calico-node-x7k2m --tail=50
Debug clue: If cross-node pod traffic fails but same-node pods communicate fine, the CNI overlay (VXLAN/IPIP/WireGuard) is broken. Check that the overlay port (VXLAN: UDP 4789, Calico IPIP: IP protocol 4) is not blocked by a cloud security group or host firewall.
tcpdump -i eth0 udp port 4789on both nodes will confirm whether encapsulated packets are flowing.
NetworkPolicy Debugging¶
# List all network policies in a namespace
kubectl get networkpolicies -n production
# Check if a default-deny policy is blocking traffic
kubectl get networkpolicies -n production -o yaml | grep -A5 "policyTypes"
# Common issue: egress policy blocks DNS
# Fix: explicitly allow UDP/TCP port 53 in egress rules
# Target kube-system namespace for DNS egress — not 0.0.0.0/0
# Describe a specific policy to see its rules
kubectl describe networkpolicy allow-frontend-to-backend -n production
# Quick test: temporarily delete the policy and see if traffic flows
# (only in non-production environments)
kubectl delete networkpolicy allow-frontend-to-backend -n production
Default trap: NetworkPolicies are additive (whitelist-only). A namespace with no NetworkPolicies allows all traffic. The moment you add ONE policy, all non-matching traffic is denied. This catches teams who deploy a single ingress policy and accidentally block all egress.
Gotcha: A
NetworkPolicywith an emptypodSelector: {}in a namespace acts as a default-deny for all pods in that namespace. This is the recommended security baseline, but teams forget to add explicit allow rules for DNS egress (UDP/TCP port 53 tokube-system), breaking all name resolution.
Ingress Debugging¶
# List all ingress resources
kubectl get ingress -A
# Check ingress controller pods
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# View ingress controller logs for routing errors
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100
# Check if the ingress has an external IP/hostname assigned
kubectl get ingress app-ingress -n production -o jsonpath='{.status.loadBalancer.ingress[0]}'
# Test from inside the cluster
kubectl exec -it debug-pod -- curl -H "Host: app.example.com" http://ingress-nginx-controller.ingress-nginx.svc
Under the hood: When
EXTERNAL-IPstays<pending>, the cloud controller manager (CCM) is responsible for provisioning the load balancer. On EKS, the AWS Load Balancer Controller handles this; on GKE, it is built-in. Check CCM logs, not kube-proxy logs. The most common cause is missing subnet tags (kubernetes.io/role/elb=1for public,kubernetes.io/role/internal-elb=1for internal).
Packet Capture¶
# Ephemeral debug container with network tools (K8s 1.23+)
kubectl debug -it problem-pod --image=nicolaka/netshoot --target=app-container
# Inside the debug container:
tcpdump -i eth0 -nn port 8080 -c 50
tcpdump -i eth0 -nn host 10.244.2.22
# Capture and save for Wireshark analysis
tcpdump -i eth0 -nn -w /tmp/capture.pcap port 8080 -c 1000
Gotcha: Ephemeral debug containers (
kubectl debug) share the target pod's network namespace but get their own filesystem. Tools you install inside the debug container can see all the same network interfaces, IPs, and connections as the target container — makingtcpdump,ss, andip addrwork as expected.One-liner: Spin up a disposable network debug pod:
kubectl run netdebug --rm -it --image=nicolaka/netshoot -- bash— hastcpdump,dig,curl,nmap,iperf3, and every other network tool you need.
NodePort and LoadBalancer¶
# Check NodePort allocation
kubectl get svc -n production -o wide | grep NodePort
# Test NodePort from outside the cluster
curl -v http://<node-ip>:31080
# Check LoadBalancer external IP
kubectl get svc public-api -n production
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# public-api LoadBalancer 10.96.45.123 203.0.113.50 443:31443/TCP
# If EXTERNAL-IP is <pending>, check cloud provider LB controller logs
kubectl logs -n kube-system -l app=cloud-controller-manager --tail=50
Quick Network Diagnosis Checklist¶
# 1. Can the pod resolve DNS?
kubectl exec -it $POD -- nslookup kubernetes.default
# 2. Can the pod reach the service ClusterIP?
kubectl exec -it $POD -- curl -v --connect-timeout 5 http://$SVC_CLUSTER_IP:$PORT
# 3. Can the pod reach the backend pod IP directly?
kubectl exec -it $POD -- curl -v --connect-timeout 5 http://$POD_IP:$TARGET_PORT
# 4. Are endpoints populated?
kubectl get endpoints $SVC_NAME -n $NS
# 5. Are network policies blocking traffic?
kubectl get networkpolicies -n $NS
Quick Reference¶
- Deep Dive: Kubernetes Networking