networking
l1
runbook
dns
networking-troubleshooting --- Portal | Level: L1: Foundations | Topics: DNS, Kubernetes Networking, Networking Troubleshooting | Domain: Networking

Runbook: DNS Resolution Failure¶

Field	Value
Domain	Networking
Alert	`dns_lookup_failures_total > 0` for >2 min or service name not resolving
Severity	P1
Est. Resolution Time	15-30 minutes
Escalation Timeout	20 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, ability to exec into pods, cluster-admin role

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl get pods -n kube-system -l k8s-app=kube-dns

If output shows: all CoreDNS pods Running → Skip to Step 4 (check ConfigMap or service selectors) If output shows: pods in CrashLoopBackOff or Pending → This is a CoreDNS crash, continue from Step 2

Step 1: Test DNS from Within a Pod¶

Why: DNS failures outside the cluster mean nothing — Kubernetes DNS only applies inside the cluster network.

# Exec into any running pod in the affected namespace
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- sh

# Once inside the pod, test lookup of an internal service
nslookup kubernetes.default.svc.cluster.local

# Also test an external name
nslookup google.com

# If nslookup is not available, use dig or curl
dig kubernetes.default.svc.cluster.local @<KUBE_DNS_SERVICE_IP>

Expected output:

Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

If this fails: Both internal and external failing → CoreDNS is down or unreachable. Proceed to Step 2. If external works but internal fails: Service or namespace issue. Skip to Step 5.

Step 2: Check CoreDNS Pod Status¶

Why: If CoreDNS pods are not running, all DNS resolution inside the cluster fails.

kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide

# Check events for crash reasons
kubectl describe pods -n kube-system -l k8s-app=kube-dns

Expected output:

NAME                       READY   STATUS    RESTARTS   AGE
coredns-xxxxxxxxx-xxxxx    1/1     Running   0          2d
coredns-xxxxxxxxx-yyyyy    1/1     Running   0          2d

If this fails: Pods in CrashLoopBackOff — proceed to Step 3 for logs. Pods in Pending — check node resources with kubectl describe node <NODE_NAME>.

Step 3: Check CoreDNS Logs for Errors¶

Why: Logs reveal whether CoreDNS is crashing due to a config error, upstream failure, or resource exhaustion.

# Get logs from the first CoreDNS pod
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# If there are multiple pods, check a specific one
kubectl logs -n kube-system <COREDNS_POD_NAME> --previous

Expected output:

[INFO] plugin/reload: Running configuration...
.:53
[INFO] plugin/ready: https://localhost:8181/ready
[INFO] Reloading complete

If this fails: Look for lines containing [ERROR] or FATAL. A config parse error means the ConfigMap is broken — go to Step 4. An upstream timeout means the forwarding target is unreachable.

Step 4: Check CoreDNS ConfigMap¶

Why: A misconfigured Corefile will cause CoreDNS to crash or silently drop queries.

kubectl get configmap coredns -n kube-system -o yaml

Expected output:

data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
        }
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

If this fails: If the kubernetes block is missing or the forward directive points to a bad IP, edit the ConfigMap:

kubectl edit configmap coredns -n kube-system
# After saving, CoreDNS picks up the change automatically via the reload plugin

Step 5: Check Service Selectors Match Pod Labels¶

Why: If a service's selector does not match any pod labels, the service has no endpoints and DNS resolves to a name with no addresses.

# Get the service selector for the service that is failing
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o jsonpath='{.spec.selector}'

# List pods and their labels in the same namespace
kubectl get pods -n <NAMESPACE> --show-labels

# Check that endpoints exist for the service
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>

Expected output:

NAME           ENDPOINTS                     AGE
my-service     10.244.1.5:8080,10.244.2.3:8080   5d

If this fails: If ENDPOINTS shows <none>, the selector does not match. Fix the service selector or the pod labels.

Step 6: Verify kube-dns Service Exists and Has a ClusterIP¶

Why: Pods resolve DNS by querying the kube-dns ClusterIP. If this service is missing or has the wrong IP, all DNS breaks.

kubectl get svc kube-dns -n kube-system

# Verify the IP matches what pods use for DNS
kubectl get pods -n <NAMESPACE> <POD_NAME> -o jsonpath='{.spec.dnsConfig}'
# Also check the resolv.conf inside a pod
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- cat /etc/resolv.conf

Expected output:

NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP   30d

If this fails: If the ClusterIP does not match the nameserver in /etc/resolv.conf, something modified the kube-dns service. Recreate it from the cluster's original manifests or restore from backup.

Verification¶

# Confirm the issue is resolved — exec into an affected pod and test
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- nslookup <SERVICE_NAME>.<NAMESPACE>.svc.cluster.local

Success looks like: Returns the correct ClusterIP for the service with no errors. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 20 min	Platform/Networking on-call	"Cluster-wide DNS failure, CoreDNS investigation ongoing, all service discovery down"
CoreDNS pods cannot be scheduled	Cluster Infrastructure team	"CoreDNS pods unschedulable, possible node or tainting issue, all DNS resolution down"
Scope expanding to multiple clusters	SRE lead	"Multi-cluster DNS failure, possible shared infrastructure issue (CNI/node networking)"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete

Common Mistakes¶

Testing DNS from outside the cluster: nslookup my-service run on your laptop will fail because cluster DNS is internal-only. Always exec into a pod first.
Checking the wrong namespace: A service in namespace-a is not visible as my-service from a pod in namespace-b — the FQDN must include the namespace: my-service.namespace-a.svc.cluster.local.
Confusing the FQDN format: The correct format is <service>.<namespace>.svc.cluster.local. Missing the svc segment or using the wrong domain suffix causes lookup failures that look like CoreDNS bugs.

Tips and Gotchas¶

The full internal DNS name is <svc>.<namespace>.svc.cluster.local. Short names work because of ndots:5 and the search domains in /etc/resolv.conf.
ndots:5 means any name with fewer than 5 dots triggers search-domain expansion first — this can cause surprising lookup chains and latency.
CoreDNS caches responses for 30s by default; after fixing a Service, DNS may still return stale data for up to 30 seconds.
A NetworkPolicy with egress rules that forgets port 53/UDP silently blocks DNS for affected pods.

Cross-References¶

Topic Pack: Kubernetes DNS and Service Discovery (deep background)
Related Runbook: Network Policy Block
Exercise: training/interactive/exercises/levels/level-23/k8s-dns/
Incident Scenario: training/interactive/incidents/scenarios/dns-bad-service-name.sh

Case Study: CoreDNS Timeout Pod DNS (Case Study, L2) — DNS, Kubernetes Networking
Ops Archaeology: The 5% That Can't Resolve (Case Study, L2) — DNS, Kubernetes Networking
API Gateways & Ingress (Topic Pack, L2) — Kubernetes Networking
AWS Route 53 (Topic Pack, L2) — DNS
Case Study: CNI Broken After Restart (Case Study, L2) — Kubernetes Networking
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Networking
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — DNS
Case Study: DNS Resolution Slow (Case Study, L1) — DNS
Case Study: DNS Split Horizon Confusion (Case Study, L2) — DNS
Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Kubernetes Networking

Runbook: DNS Resolution Failure¶

Quick Assessment (30 seconds)¶

Step 1: Test DNS from Within a Pod¶

Step 2: Check CoreDNS Pod Status¶

Step 3: Check CoreDNS Logs for Errors¶

Step 4: Check CoreDNS ConfigMap¶

Step 5: Check Service Selectors Match Pod Labels¶

Step 6: Verify kube-dns Service Exists and Has a ClusterIP¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Tips and Gotchas¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: DNS Resolution Failure¶

Quick Assessment (30 seconds)¶

Step 1: Test DNS from Within a Pod¶

Step 2: Check CoreDNS Pod Status¶

Step 3: Check CoreDNS Logs for Errors¶

Step 4: Check CoreDNS ConfigMap¶

Step 5: Check Service Selectors Match Pod Labels¶

Step 6: Verify kube-dns Service Exists and Has a ClusterIP¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Tips and Gotchas¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶