Skip to content

On-Call Survival: Networking

Print this. Pin it. Read it at 3 AM.


Alert: DNS Resolution Failure

Severity: P1 (service-wide) / P2 (single service)

First command:

kubectl exec -n <ns> <pod> -- nslookup <failing-hostname>
What you're looking for: NXDOMAIN (name does not exist), connection timed out (DNS server unreachable), or SERVFAIL.

Decision tree:

Is it NXDOMAIN (name not found)?
├── Yes  Wrong hostname in config? Check service name and namespace.
         kubectl get svc -n <ns> | grep <name>
         Internal: use <svc>.<ns>.svc.cluster.local
└── No  Is DNS server unreachable (timeout)?
    ├── Yes  kubectl get pods -n kube-system -l k8s-app=kube-dns
             CoreDNS pods down?  kubectl rollout restart deploy/coredns -n kube-system
    └── No  Is it SERVFAIL?
        ├── Yes  kubectl logs -n kube-system -l k8s-app=kube-dns | tail -30
                 Upstream DNS issue?  Check /etc/resolv.conf on node
        └── No  Escalate to platform team: "DNS failing for <hostname> from pod <pod>"

Escalation trigger: CoreDNS pods cannot be restarted; node-level DNS broken; external DNS unreachable across all nodes.

Safe actions: nslookup, dig, check CoreDNS logs and pod status.

Dangerous actions: Restart CoreDNS deployment (brief DNS blip for all pods in cluster).


Alert: TLS Certificate Error (503/SSL handshake failure)

Severity: P1 (user-facing HTTPS broken)

First command:

echo | openssl s_client -connect <hostname>:443 2>/dev/null | openssl x509 -noout -dates
What you're looking for: notAfter= date (is the cert expired?), or Verify return code: 0 vs error code.

Decision tree:

Is the certificate expired?
├── Yes  Is cert-manager managing it?
         kubectl get cert,certificaterequest -n <ns>
         kubectl describe cert <name> -n <ns> | grep -A 10 Status
         Cert not renewing?  kubectl delete certificaterequest <name> -n <ns> (forces reissue)
└── No  Is the cert for the wrong hostname (CN mismatch)?
    ├── Yes  Wrong cert attached to ingress? Check ingress TLS spec.
             kubectl get ingress -n <ns> -o yaml | grep -A 5 tls
    └── No  Is the CA not trusted?
        ├── Yes  Self-signed CA? Add to trust store or use public CA.
        └── No  Escalate: "TLS handshake failing on <hostname>, openssl output: <paste>"

Escalation trigger: Cert expired and cert-manager cannot renew (ACME challenge failing, DNS not delegated); load balancer TLS termination broken.

Safe actions: Check cert dates, describe cert-manager resources, read ingress TLS config.

Dangerous actions: Delete CertificateRequest (triggers re-issue, brief gap), modify ingress TLS spec.


Alert: Ingress 404 / 502 / 503

Severity: P1 (all traffic affected) / P2 (partial)

First command:

kubectl get ingress -n <ns>
kubectl describe ingress <name> -n <ns>
What you're looking for: Backend service name and port, any misconfigured path rules, ingress class.

Decision tree:

Is it 404 (not found)?
├── Yes  Path mismatch in ingress rules? Compare URL to ingress path spec.
         Service exists? kubectl get svc <backend-svc> -n <ns>
└── No  Is it 502/503 (bad gateway / service unavailable)?
    ├── Yes  Are backend pods running?
             kubectl get pods -n <ns> -l <selector from svc>
             No pods ready?  Fix pods first (see Kubernetes guide)
    └── No (502 with pods running)  Is it a timeout?
        ├── Yes  App too slow? kubectl top pod; check app metrics.
                 Readiness probe failing? kubectl describe pod | grep Readiness
        └── No  Check ingress controller logs:
                 kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx | tail -50
                 Escalate if controller errors: "Ingress 502 on <host>, pods healthy, controller logs: <paste>"

Escalation trigger: Ingress controller pods down or crashing; all backends reporting unhealthy; load balancer health checks failing.

Safe actions: Describe ingress, check pod readiness, read ingress controller logs.

Dangerous actions: Edit ingress rules (may break other routes), restart ingress controller.


Alert: Load Balancer / Service Unreachable

Severity: P1

First command:

kubectl get svc -n <ns> <svc-name>
What you're looking for: EXTERNAL-IP — is it <pending> or assigned? Is the port correct?

Decision tree:

Is EXTERNAL-IP <pending>?
├── Yes  Cloud load balancer not provisioned.
         kubectl describe svc <svc-name> -n <ns> | grep -A 5 Events
         Cloud API quota hit? Check cloud console. Escalate to infra.
└── No (IP assigned but unreachable)?
    ├── Network policy blocking?  kubectl get netpol -n <ns>
       kubectl describe netpol <name> -n <ns>
    └── Security group / firewall rule missing?
         Check cloud console for LB security group inbound rules.
         Escalate to infra: "LB IP <ip> assigned but port <port> unreachable"

Escalation trigger: Cloud provider API errors, network policy changes needed, firewall rules blocked.

Safe actions: Get/describe service, check network policies, check endpoints.

Dangerous actions: Delete and recreate service (new IP, breaks DNS/firewall rules), edit network policies.


Quick Reference

Most Useful Commands

# DNS lookup from inside a pod
kubectl exec -n <ns> <pod> -- nslookup <hostname>
kubectl exec -n <ns> <pod> -- dig <hostname>

# Check certificate dates
echo | openssl s_client -connect <host>:443 2>/dev/null | openssl x509 -noout -dates

# Check ingress rules
kubectl get ingress -A
kubectl describe ingress <name> -n <ns>

# Check service endpoints (are pods actually registered?)
kubectl get endpoints <svc> -n <ns>

# Check network policies
kubectl get netpol -n <ns>

# Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Cert-manager certificate status
kubectl get cert -A
kubectl describe cert <name> -n <ns>

# CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Trace a packet (from pod perspective)
kubectl exec -n <ns> <pod> -- curl -v http://<service>:<port>/health

Escalation Contacts

Situation Team Channel
CoreDNS down Platform #infra-oncall
Cert-manager broken / ACME failure Platform / Security #infra-oncall
Cloud LB / firewall Infra #infra-oncall
Ingress controller crash Platform PagerDuty: platform-critical

Safe vs Dangerous Actions

Safe (do without asking) Dangerous (get approval)
Describe ingress / service Edit ingress routing rules
Read cert-manager status Delete CertificateRequest
Check network policies Edit network policies
Read ingress controller logs Restart ingress controller
nslookup / dig from pods Restart CoreDNS
Check endpoints Delete and recreate LB service

Shift Handoff Template

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]