Skip to content

Runbook: DNS Resolution Failure

Field Value
Domain Networking
Alert dns_lookup_failures_total > 0 for >2 min or service name not resolving
Severity P1
Est. Resolution Time 15-30 minutes
Escalation Timeout 20 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, ability to exec into pods, cluster-admin role

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get pods -n kube-system -l k8s-app=kube-dns
If output shows: all CoreDNS pods Running → Skip to Step 4 (check ConfigMap or service selectors) If output shows: pods in CrashLoopBackOff or Pending → This is a CoreDNS crash, continue from Step 2

Step 1: Test DNS from Within a Pod

Why: DNS failures outside the cluster mean nothing — Kubernetes DNS only applies inside the cluster network.

# Exec into any running pod in the affected namespace
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- sh

# Once inside the pod, test lookup of an internal service
nslookup kubernetes.default.svc.cluster.local

# Also test an external name
nslookup google.com

# If nslookup is not available, use dig or curl
dig kubernetes.default.svc.cluster.local @<KUBE_DNS_SERVICE_IP>
Expected output:
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1
If this fails: Both internal and external failing → CoreDNS is down or unreachable. Proceed to Step 2. If external works but internal fails: Service or namespace issue. Skip to Step 5.

Step 2: Check CoreDNS Pod Status

Why: If CoreDNS pods are not running, all DNS resolution inside the cluster fails.

kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide

# Check events for crash reasons
kubectl describe pods -n kube-system -l k8s-app=kube-dns
Expected output:
NAME                       READY   STATUS    RESTARTS   AGE
coredns-xxxxxxxxx-xxxxx    1/1     Running   0          2d
coredns-xxxxxxxxx-yyyyy    1/1     Running   0          2d
If this fails: Pods in CrashLoopBackOff — proceed to Step 3 for logs. Pods in Pending — check node resources with kubectl describe node <NODE_NAME>.

Step 3: Check CoreDNS Logs for Errors

Why: Logs reveal whether CoreDNS is crashing due to a config error, upstream failure, or resource exhaustion.

# Get logs from the first CoreDNS pod
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# If there are multiple pods, check a specific one
kubectl logs -n kube-system <COREDNS_POD_NAME> --previous
Expected output:
[INFO] plugin/reload: Running configuration...
.:53
[INFO] plugin/ready: https://localhost:8181/ready
[INFO] Reloading complete
If this fails: Look for lines containing [ERROR] or FATAL. A config parse error means the ConfigMap is broken — go to Step 4. An upstream timeout means the forwarding target is unreachable.

Step 4: Check CoreDNS ConfigMap

Why: A misconfigured Corefile will cause CoreDNS to crash or silently drop queries.

kubectl get configmap coredns -n kube-system -o yaml
Expected output:
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
        }
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
If this fails: If the kubernetes block is missing or the forward directive points to a bad IP, edit the ConfigMap:
kubectl edit configmap coredns -n kube-system
# After saving, CoreDNS picks up the change automatically via the reload plugin

Step 5: Check Service Selectors Match Pod Labels

Why: If a service's selector does not match any pod labels, the service has no endpoints and DNS resolves to a name with no addresses.

# Get the service selector for the service that is failing
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o jsonpath='{.spec.selector}'

# List pods and their labels in the same namespace
kubectl get pods -n <NAMESPACE> --show-labels

# Check that endpoints exist for the service
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>
Expected output:
NAME           ENDPOINTS                     AGE
my-service     10.244.1.5:8080,10.244.2.3:8080   5d
If this fails: If ENDPOINTS shows <none>, the selector does not match. Fix the service selector or the pod labels.

Step 6: Verify kube-dns Service Exists and Has a ClusterIP

Why: Pods resolve DNS by querying the kube-dns ClusterIP. If this service is missing or has the wrong IP, all DNS breaks.

kubectl get svc kube-dns -n kube-system

# Verify the IP matches what pods use for DNS
kubectl get pods -n <NAMESPACE> <POD_NAME> -o jsonpath='{.spec.dnsConfig}'
# Also check the resolv.conf inside a pod
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- cat /etc/resolv.conf
Expected output:
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP   30d
If this fails: If the ClusterIP does not match the nameserver in /etc/resolv.conf, something modified the kube-dns service. Recreate it from the cluster's original manifests or restore from backup.

Verification

# Confirm the issue is resolved — exec into an affected pod and test
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- nslookup <SERVICE_NAME>.<NAMESPACE>.svc.cluster.local
Success looks like: Returns the correct ClusterIP for the service with no errors. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 20 min Platform/Networking on-call "Cluster-wide DNS failure, CoreDNS investigation ongoing, all service discovery down"
CoreDNS pods cannot be scheduled Cluster Infrastructure team "CoreDNS pods unschedulable, possible node or tainting issue, all DNS resolution down"
Scope expanding to multiple clusters SRE lead "Multi-cluster DNS failure, possible shared infrastructure issue (CNI/node networking)"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Testing DNS from outside the cluster: nslookup my-service run on your laptop will fail because cluster DNS is internal-only. Always exec into a pod first.
  2. Checking the wrong namespace: A service in namespace-a is not visible as my-service from a pod in namespace-b — the FQDN must include the namespace: my-service.namespace-a.svc.cluster.local.
  3. Confusing the FQDN format: The correct format is <service>.<namespace>.svc.cluster.local. Missing the svc segment or using the wrong domain suffix causes lookup failures that look like CoreDNS bugs.

Tips and Gotchas

  • The full internal DNS name is <svc>.<namespace>.svc.cluster.local. Short names work because of ndots:5 and the search domains in /etc/resolv.conf.
  • ndots:5 means any name with fewer than 5 dots triggers search-domain expansion first — this can cause surprising lookup chains and latency.
  • CoreDNS caches responses for 30s by default; after fixing a Service, DNS may still return stale data for up to 30 seconds.
  • A NetworkPolicy with egress rules that forgets port 53/UDP silently blocks DNS for affected pods.

Cross-References


Wiki Navigation