Portal | Level: L2: Operations | Topics: Kubernetes Debugging, Kubernetes Core | Domain: Kubernetes

Kubernetes Debugging Playbook - Primer¶

Why This Matters¶

Kubernetes does not tell you what is wrong. It tells you what it tried and where it stopped. Your job is to read the signals and trace back to root cause. Most K8s debugging follows predictable patterns once you know where to look.

Remember: The K8s debugging order: "SIRCN" — Scheduling, Image pull, Runtime, Config, Network. Each stage produces distinct error statuses. If the pod is Pending, it is stuck at Scheduling. If it is ImagePullBackOff, it is stuck at Image. If it is CrashLoopBackOff, the container starts but the Runtime/Config is wrong. If the pod is Running but not working, it is a Network or application issue.

Core Concepts¶

1. The Failure Cascade¶

Kubernetes failures flow through layers:

Scheduling -> Image Pull -> Container Start
  -> Runtime -> Readiness/Liveness -> Networking

Each layer produces distinct signals. Start at the beginning and work forward.

2. Pod Not Starting: ImagePullBackOff¶

Under the hood: Kubernetes uses exponential backoff when retrying image pulls: 10s, 20s, 40s, up to 5 minutes. The status cycles between ErrImagePull (active failure) and ImagePullBackOff (waiting before retry). If you fix the issue (add imagePullSecrets, fix the tag), the pod will recover automatically on the next retry — you do not need to delete it.

The kubelet cannot pull the container image.

kubectl describe pod <name> -n <ns>
# Look at Events section for the pull error

Common causes: wrong image name or tag, private registry without imagePullSecrets, registry down or rate-limited, network policy blocking egress.

3. Pod Not Starting: CrashLoopBackOff¶

The container starts, exits, and Kubernetes keeps restarting it with exponential backoff.

kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns>

Common causes and exit codes: - 137 = OOMKilled (memory limit exceeded) - 1 = application error (read the logs) - 126 = permission denied on binary - 127 = binary not found

Remember: Exit code mnemonic: "137 = OOM, 143 = TERM, 1 = app, 127 = not found." 137 = 128 + signal 9 (SIGKILL from OOM killer). 143 = 128 + signal 15 (SIGTERM from graceful shutdown). Any exit code above 128 means the process was killed by a signal — subtract 128 to get the signal number.

If the container exits too fast for logs:

kubectl run debug --image=<image> --command \
  -- sleep 3600
kubectl exec -it debug -- sh

4. Pod Not Starting: Pending¶

The scheduler cannot place the pod on any node.

kubectl describe pod <pod> -n <ns>
kubectl describe node <node>

Common causes: insufficient CPU/memory for requests, node selector or affinity matching no nodes, taints without tolerations, unbound PVC, resource quotas exceeded.

5. The kubectl Debug Workflow¶

The universal first three commands:

kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl get events -n <ns> --sort-by=.lastTimestamp

These answer 80% of questions. Read the Events section of describe output first.

Deeper investigation:

kubectl logs <pod> -c <container> -n <ns>
kubectl logs <pod> --previous --all-containers
kubectl logs -f <pod> -n <ns>

6. Exec and Ephemeral Debug Containers¶

Fun fact: Ephemeral debug containers (the kubectl debug command) were added in Kubernetes v1.23 (GA in v1.25). Before this feature, debugging distroless containers required building a new image with shell tools baked in — defeating the purpose of minimal images. Ephemeral containers share the PID and network namespace of the target pod but have their own filesystem, so they can carry debugging tools without polluting the production image.

Get a shell in a running container:

kubectl exec -it <pod> -n <ns> -- /bin/sh

Inside, check: ps aux, env | sort, curl localhost:<port>/health, nslookup <service-name>.

For distroless images without a shell:

kubectl debug -it <pod> -n <ns> \
  --image=busybox --target=<container>

For node-level debugging:

kubectl debug node/<node> -it --image=ubuntu
# Host filesystem is at /host

7. Resource Limits and OOMKill¶

When a container exceeds its memory limit, the kernel OOM killer terminates it (exit code 137).

kubectl describe pod <pod> -n <ns>
# Look for: Reason: OOMKilled
kubectl top pod <pod> -n <ns>

Fix: increase memory limits, fix memory leaks, or check that requests match actual needs. Requests affect scheduling, limits affect runtime enforcement.

8. Network Debugging¶

Gotcha: Kubernetes DNS failures are one of the sneakiest debugging targets. If CoreDNS pods are not running (or are crashing), every service-to-service call fails with DNS resolution errors. Always check kubectl get pods -n kube-system -l k8s-app=kube-dns early in your debugging flow. If CoreDNS is unhealthy, fix that first — everything else is a symptom.

DNS resolution:

nslookup <service>.<namespace>.svc.cluster.local
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Service connectivity:

kubectl get endpoints <service> -n <ns>
# Empty = no pods match the selector
kubectl get svc <svc> -n <ns> -o yaml
kubectl get pods -n <ns> --show-labels

NetworkPolicy:

kubectl get networkpolicy -n <ns>

If any policy selects a pod, all non-matching traffic is denied by default.

9. Node-Level Debugging¶

kubectl describe node <node>
# Check: MemoryPressure, DiskPressure, PIDPressure
journalctl -u kubelet --since "10 minutes ago"
kubectl top node

DiskPressure triggers pod eviction. NotReady means kubelet lost contact with the API server.

What Experienced People Know¶

describe Events is the single most useful output. Read it before anything else.
--previous shows logs from the last crash. Without it you see current (possibly empty) logs.
Exit code 137 = OOMKill. 143 = SIGTERM. 1 = app error.
Empty Endpoints = no pods match the Service selector. Check labels carefully.
DNS problems usually trace to CoreDNS, not app code.
kubectl debug with ephemeral containers is the proper way to debug distroless images.
Requests affect scheduling. Limits affect runtime. Set both. Requests without limits let pods consume unbounded node resources.
kubectl get events --sort-by=.lastTimestamp shows the failure timeline.
Pod stuck in Terminating = finalizer blocking or container ignoring SIGTERM.
Increase verbosity with kubectl get pods -v=6 to see the API calls being made.

Prerequisites¶

Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1)

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core

Kubernetes Debugging Playbook - Primer¶

Why This Matters¶

Core Concepts¶

1. The Failure Cascade¶

2. Pod Not Starting: ImagePullBackOff¶

3. Pod Not Starting: CrashLoopBackOff¶

4. Pod Not Starting: Pending¶

5. The kubectl Debug Workflow¶

6. Exec and Ephemeral Debug Containers¶

7. Resource Limits and OOMKill¶

8. Network Debugging¶

9. Node-Level Debugging¶

What Experienced People Know¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Kubernetes Debugging Playbook - Primer¶

Why This Matters¶

Core Concepts¶

1. The Failure Cascade¶

2. Pod Not Starting: ImagePullBackOff¶

3. Pod Not Starting: CrashLoopBackOff¶

4. Pod Not Starting: Pending¶

5. The kubectl Debug Workflow¶

6. Exec and Ephemeral Debug Containers¶

7. Resource Limits and OOMKill¶

8. Network Debugging¶

9. Node-Level Debugging¶

What Experienced People Know¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶