Skip to content

Portal | Level: L2: Operations | Topics: Kubernetes Debugging, Kubernetes Core | Domain: Kubernetes

Kubernetes Debugging Playbook - Primer

Why This Matters

Kubernetes does not tell you what is wrong. It tells you what it tried and where it stopped. Your job is to read the signals and trace back to root cause. Most K8s debugging follows predictable patterns once you know where to look.

Remember: The K8s debugging order: "SIRCN" — Scheduling, Image pull, Runtime, Config, Network. Each stage produces distinct error statuses. If the pod is Pending, it is stuck at Scheduling. If it is ImagePullBackOff, it is stuck at Image. If it is CrashLoopBackOff, the container starts but the Runtime/Config is wrong. If the pod is Running but not working, it is a Network or application issue.

Core Concepts

1. The Failure Cascade

Kubernetes failures flow through layers:

Scheduling -> Image Pull -> Container Start
  -> Runtime -> Readiness/Liveness -> Networking

Each layer produces distinct signals. Start at the beginning and work forward.

2. Pod Not Starting: ImagePullBackOff

Under the hood: Kubernetes uses exponential backoff when retrying image pulls: 10s, 20s, 40s, up to 5 minutes. The status cycles between ErrImagePull (active failure) and ImagePullBackOff (waiting before retry). If you fix the issue (add imagePullSecrets, fix the tag), the pod will recover automatically on the next retry — you do not need to delete it.

The kubelet cannot pull the container image.

kubectl describe pod <name> -n <ns>
# Look at Events section for the pull error

Common causes: wrong image name or tag, private registry without imagePullSecrets, registry down or rate-limited, network policy blocking egress.

3. Pod Not Starting: CrashLoopBackOff

The container starts, exits, and Kubernetes keeps restarting it with exponential backoff.

kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns>

Common causes and exit codes: - 137 = OOMKilled (memory limit exceeded) - 1 = application error (read the logs) - 126 = permission denied on binary - 127 = binary not found

Remember: Exit code mnemonic: "137 = OOM, 143 = TERM, 1 = app, 127 = not found." 137 = 128 + signal 9 (SIGKILL from OOM killer). 143 = 128 + signal 15 (SIGTERM from graceful shutdown). Any exit code above 128 means the process was killed by a signal — subtract 128 to get the signal number.

If the container exits too fast for logs:

kubectl run debug --image=<image> --command \
  -- sleep 3600
kubectl exec -it debug -- sh

4. Pod Not Starting: Pending

The scheduler cannot place the pod on any node.

kubectl describe pod <pod> -n <ns>
kubectl describe node <node>

Common causes: insufficient CPU/memory for requests, node selector or affinity matching no nodes, taints without tolerations, unbound PVC, resource quotas exceeded.

5. The kubectl Debug Workflow

The universal first three commands:

kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl get events -n <ns> --sort-by=.lastTimestamp

These answer 80% of questions. Read the Events section of describe output first.

Deeper investigation:

kubectl logs <pod> -c <container> -n <ns>
kubectl logs <pod> --previous --all-containers
kubectl logs -f <pod> -n <ns>

6. Exec and Ephemeral Debug Containers

Fun fact: Ephemeral debug containers (the kubectl debug command) were added in Kubernetes v1.23 (GA in v1.25). Before this feature, debugging distroless containers required building a new image with shell tools baked in — defeating the purpose of minimal images. Ephemeral containers share the PID and network namespace of the target pod but have their own filesystem, so they can carry debugging tools without polluting the production image.

Get a shell in a running container:

kubectl exec -it <pod> -n <ns> -- /bin/sh

Inside, check: ps aux, env | sort, curl localhost:<port>/health, nslookup <service-name>.

For distroless images without a shell:

kubectl debug -it <pod> -n <ns> \
  --image=busybox --target=<container>

For node-level debugging:

kubectl debug node/<node> -it --image=ubuntu
# Host filesystem is at /host

7. Resource Limits and OOMKill

When a container exceeds its memory limit, the kernel OOM killer terminates it (exit code 137).

kubectl describe pod <pod> -n <ns>
# Look for: Reason: OOMKilled
kubectl top pod <pod> -n <ns>

Fix: increase memory limits, fix memory leaks, or check that requests match actual needs. Requests affect scheduling, limits affect runtime enforcement.

8. Network Debugging

Gotcha: Kubernetes DNS failures are one of the sneakiest debugging targets. If CoreDNS pods are not running (or are crashing), every service-to-service call fails with DNS resolution errors. Always check kubectl get pods -n kube-system -l k8s-app=kube-dns early in your debugging flow. If CoreDNS is unhealthy, fix that first — everything else is a symptom.

DNS resolution:

nslookup <service>.<namespace>.svc.cluster.local
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Service connectivity:

kubectl get endpoints <service> -n <ns>
# Empty = no pods match the selector
kubectl get svc <svc> -n <ns> -o yaml
kubectl get pods -n <ns> --show-labels

NetworkPolicy:

kubectl get networkpolicy -n <ns>
If any policy selects a pod, all non-matching traffic is denied by default.

9. Node-Level Debugging

kubectl describe node <node>
# Check: MemoryPressure, DiskPressure, PIDPressure
journalctl -u kubelet --since "10 minutes ago"
kubectl top node

DiskPressure triggers pod eviction. NotReady means kubelet lost contact with the API server.

What Experienced People Know

  • describe Events is the single most useful output. Read it before anything else.
  • --previous shows logs from the last crash. Without it you see current (possibly empty) logs.
  • Exit code 137 = OOMKill. 143 = SIGTERM. 1 = app error.
  • Empty Endpoints = no pods match the Service selector. Check labels carefully.
  • DNS problems usually trace to CoreDNS, not app code.
  • kubectl debug with ephemeral containers is the proper way to debug distroless images.
  • Requests affect scheduling. Limits affect runtime. Set both. Requests without limits let pods consume unbounded node resources.
  • kubectl get events --sort-by=.lastTimestamp shows the failure timeline.
  • Pod stuck in Terminating = finalizer blocking or container ignoring SIGTERM.
  • Increase verbosity with kubectl get pods -v=6 to see the API calls being made.

Wiki Navigation

Prerequisites

  • Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1)