On-Call Survival: Networking¶

Print this. Pin it. Read it at 3 AM.

Alert: DNS Resolution Failure¶

Severity: P1 (service-wide) / P2 (single service)

First command:

kubectl exec -n <ns> <pod> -- nslookup <failing-hostname>

What you're looking for: NXDOMAIN (name does not exist), connection timed out (DNS server unreachable), or SERVFAIL.

Decision tree:

Is it NXDOMAIN (name not found)?
├── Yes → Wrong hostname in config? Check service name and namespace.
│         kubectl get svc -n <ns> | grep <name>
│         Internal: use <svc>.<ns>.svc.cluster.local
└── No → Is DNS server unreachable (timeout)?
    ├── Yes → kubectl get pods -n kube-system -l k8s-app=kube-dns
    │         CoreDNS pods down? → kubectl rollout restart deploy/coredns -n kube-system
    └── No → Is it SERVFAIL?
        ├── Yes → kubectl logs -n kube-system -l k8s-app=kube-dns | tail -30
        │         Upstream DNS issue? → Check /etc/resolv.conf on node
        └── No → Escalate to platform team: "DNS failing for <hostname> from pod <pod>"

Escalation trigger: CoreDNS pods cannot be restarted; node-level DNS broken; external DNS unreachable across all nodes.

Safe actions: nslookup, dig, check CoreDNS logs and pod status.

Dangerous actions: Restart CoreDNS deployment (brief DNS blip for all pods in cluster).

Alert: TLS Certificate Error (503/SSL handshake failure)¶

Severity: P1 (user-facing HTTPS broken)

First command:

echo | openssl s_client -connect <hostname>:443 2>/dev/null | openssl x509 -noout -dates

What you're looking for: notAfter= date (is the cert expired?), or Verify return code: 0 vs error code.

Decision tree:

Is the certificate expired?
├── Yes → Is cert-manager managing it?
│         kubectl get cert,certificaterequest -n <ns>
│         kubectl describe cert <name> -n <ns> | grep -A 10 Status
│         Cert not renewing? → kubectl delete certificaterequest <name> -n <ns> (forces reissue)
└── No → Is the cert for the wrong hostname (CN mismatch)?
    ├── Yes → Wrong cert attached to ingress? Check ingress TLS spec.
    │         kubectl get ingress -n <ns> -o yaml | grep -A 5 tls
    └── No → Is the CA not trusted?
        ├── Yes → Self-signed CA? Add to trust store or use public CA.
        └── No → Escalate: "TLS handshake failing on <hostname>, openssl output: <paste>"

Escalation trigger: Cert expired and cert-manager cannot renew (ACME challenge failing, DNS not delegated); load balancer TLS termination broken.

Safe actions: Check cert dates, describe cert-manager resources, read ingress TLS config.

Dangerous actions: Delete CertificateRequest (triggers re-issue, brief gap), modify ingress TLS spec.

Alert: Ingress 404 / 502 / 503¶

Severity: P1 (all traffic affected) / P2 (partial)

First command:

kubectl get ingress -n <ns>
kubectl describe ingress <name> -n <ns>

What you're looking for: Backend service name and port, any misconfigured path rules, ingress class.

Decision tree:

Is it 404 (not found)?
├── Yes → Path mismatch in ingress rules? Compare URL to ingress path spec.
│         Service exists? kubectl get svc <backend-svc> -n <ns>
└── No → Is it 502/503 (bad gateway / service unavailable)?
    ├── Yes → Are backend pods running?
    │         kubectl get pods -n <ns> -l <selector from svc>
    │         No pods ready? → Fix pods first (see Kubernetes guide)
    └── No (502 with pods running) → Is it a timeout?
        ├── Yes → App too slow? kubectl top pod; check app metrics.
        │         Readiness probe failing? kubectl describe pod | grep Readiness
        └── No → Check ingress controller logs:
                 kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx | tail -50
                 Escalate if controller errors: "Ingress 502 on <host>, pods healthy, controller logs: <paste>"

Escalation trigger: Ingress controller pods down or crashing; all backends reporting unhealthy; load balancer health checks failing.

Safe actions: Describe ingress, check pod readiness, read ingress controller logs.

Dangerous actions: Edit ingress rules (may break other routes), restart ingress controller.

Alert: Load Balancer / Service Unreachable¶

Severity: P1

First command:

kubectl get svc -n <ns> <svc-name>

What you're looking for: EXTERNAL-IP — is it <pending> or assigned? Is the port correct?

Decision tree:

Is EXTERNAL-IP <pending>?
├── Yes → Cloud load balancer not provisioned.
│         kubectl describe svc <svc-name> -n <ns> | grep -A 5 Events
│         Cloud API quota hit? Check cloud console. Escalate to infra.
└── No (IP assigned but unreachable)?
    ├── Network policy blocking? → kubectl get netpol -n <ns>
    │   kubectl describe netpol <name> -n <ns>
    └── Security group / firewall rule missing?
        → Check cloud console for LB security group inbound rules.
        → Escalate to infra: "LB IP <ip> assigned but port <port> unreachable"

Escalation trigger: Cloud provider API errors, network policy changes needed, firewall rules blocked.

Safe actions: Get/describe service, check network policies, check endpoints.

Dangerous actions: Delete and recreate service (new IP, breaks DNS/firewall rules), edit network policies.

Quick Reference¶

Most Useful Commands¶

# DNS lookup from inside a pod
kubectl exec -n <ns> <pod> -- nslookup <hostname>
kubectl exec -n <ns> <pod> -- dig <hostname>

# Check certificate dates
echo | openssl s_client -connect <host>:443 2>/dev/null | openssl x509 -noout -dates

# Check ingress rules
kubectl get ingress -A
kubectl describe ingress <name> -n <ns>

# Check service endpoints (are pods actually registered?)
kubectl get endpoints <svc> -n <ns>

# Check network policies
kubectl get netpol -n <ns>

# Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Cert-manager certificate status
kubectl get cert -A
kubectl describe cert <name> -n <ns>

# CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Trace a packet (from pod perspective)
kubectl exec -n <ns> <pod> -- curl -v http://<service>:<port>/health

Escalation Contacts¶

Situation	Team	Channel
CoreDNS down	Platform	#infra-oncall
Cert-manager broken / ACME failure	Platform / Security	#infra-oncall
Cloud LB / firewall	Infra	#infra-oncall
Ingress controller crash	Platform	PagerDuty: platform-critical

Safe vs Dangerous Actions¶

Safe (do without asking)	Dangerous (get approval)
Describe ingress / service	Edit ingress routing rules
Read cert-manager status	Delete CertificateRequest
Check network policies	Edit network policies
Read ingress controller logs	Restart ingress controller
nslookup / dig from pods	Restart CoreDNS
Check endpoints	Delete and recreate LB service

Shift Handoff Template¶

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]