Kubernetes Debugging -- Street Ops¶

The Pod Failure Triage Flowchart¶

Pod not working?
|
+-- What phase is it in? (kubectl get pod)
    |
    +-- Pending
    |   +-- Check events: scheduling failures?
    |   +-- Insufficient CPU/memory? -> Check resource requests vs node capacity
    |   +-- No matching nodes? -> Check nodeSelector, affinity, taints/tolerations
    |   +-- PVC not bound? -> Check StorageClass, PV availability
    |
    +-- ImagePullBackOff / ErrImagePull
    |   +-- Wrong image name/tag? -> Check spelling, registry URL
    |   +-- Auth failure? -> Check imagePullSecrets, registry credentials
    |   +-- Image doesn't exist? -> Verify in registry directly
    |
    +-- CrashLoopBackOff
    |   +-- Check logs: kubectl logs <pod> --previous
    |   +-- App crashing on startup? -> Config error, missing env var, bad mount
    |   +-- OOMKilled? -> Check last state in describe output, raise memory limit
    |   +-- Liveness probe failing? -> Check probe config, app health endpoint
    |
    +-- Running but not Ready
    |   +-- Readiness probe failing? -> Check probe endpoint, port, timing
    |   +-- App slow to start? -> Adjust initialDelaySeconds or use startupProbe
    |
    +-- Running and Ready but not working
        +-- Service selector mismatch? -> Compare labels
        +-- Wrong port in service? -> Check targetPort vs containerPort
        +-- NetworkPolicy blocking traffic? -> Check policies in namespace
        +-- DNS issues? -> Test with nslookup from inside pod

What to Check First -- Decision Tree¶

Step 1: Get the status.

kubectl get pod <name> -o wide

The STATUS column and RESTARTS column tell you the category of problem.

Step 2: Read events.

kubectl describe pod <name> | tail -30

Events are chronological. Read from bottom (newest) up.

Step 3: Branch based on status.

Pending Pods¶

Pending means the scheduler cannot place the pod.

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check PVC status
kubectl get pvc -n <namespace>

Common causes: - Resource requests exceed available node capacity - All nodes tainted, pod has no matching toleration - PVC references nonexistent StorageClass - nodeSelector matches no nodes - Pod affinity/anti-affinity rules unsatisfiable

Heuristic: If a pod was Pending for > 5 minutes, the scheduler has given up retrying quickly. Check events for the FailedScheduling reason.

Debug clue: When describe pod shows "0/3 nodes are available: 3 Insufficient cpu," the bottleneck is resource requests, not actual usage. Nodes might be at 20% real CPU but 100% allocated. Check the gap between allocated and actual with kubectl top nodes vs kubectl describe node | grep -A5 "Allocated". If there is a big gap, your resource requests are over-provisioned.

CrashLoopBackOff¶

The container starts and exits repeatedly. Backoff timer increases: 10s, 20s, 40s, up to 5 minutes.

# Current logs (if container is momentarily running)
kubectl logs <pod> -c <container>

# Previous crash logs (the money command)
kubectl logs <pod> -c <container> --previous

# Check exit code
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Exit code cheat sheet: - 0: App exited cleanly but K8s expected it to keep running - 1: Generic app error - 2: Shell misuse (bad command in entrypoint) - 126: Command not executable - 127: Command not found (wrong entrypoint/cmd) - 137: SIGKILL (OOMKilled or external kill) - 139: SIGSEGV (segfault) - 143: SIGTERM (graceful shutdown triggered)

ImagePullBackOff¶

# Check the exact image reference
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].image}'

# Check pull secrets
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'

# Verify secret exists and has correct data
kubectl get secret <secret-name> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

Common causes: - Typo in image name or tag - Tag does not exist (someone pushed to latest, you reference v1.2.3) - Private registry, no imagePullSecret configured - Secret references wrong registry URL - Image was deleted from registry

OOMKilled¶

# Confirm OOM
kubectl describe pod <pod> | grep -A 3 "Last State"
# Look for: Reason: OOMKilled

# Check limits
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'

Fix approach: 1. Check if app has a memory leak (monitor over time) 2. If legitimate usage, raise memory limit 3. Check if JVM/runtime has its own memory limit conflicting with container limit 4. For JVMs: -XX:MaxRAMPercentage=75.0 to respect cgroup limits

Eviction¶

# Check for eviction events
kubectl get events --field-selector reason=Evicted -n <namespace>

# Check node pressure
kubectl describe node <node> | grep -A 5 Conditions

Eviction happens when node runs low on disk, memory, or PIDs. Fix the resource hog or add node capacity.

Reading Events Effectively¶

Events decay after 1 hour by default. If you need history, use a monitoring stack.

# All events, newest first
kubectl get events -n <ns> --sort-by='.lastTimestamp'

# Filter by pod
kubectl get events -n <ns> --field-selector involvedObject.name=<pod>

# Watch live
kubectl get events -n <ns> --watch

Key event reasons to watch for: - FailedScheduling -- scheduling problem - FailedMount / FailedAttachVolume -- storage issue - Pulling / Pulled / Failed -- image pull lifecycle - Unhealthy -- probe failures - BackOff -- crash loop or image pull backoff - Evicted -- resource pressure - OOMKilling -- memory limit breach

Using Ephemeral Containers¶

When a container has no shell (distroless, scratch), you cannot exec into it. Ephemeral containers solve this.

# Attach a debug container to a running pod
kubectl debug -it <pod> --image=busybox --target=<container>

# Debug with a full toolkit
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>

# Copy the pod with a different image (for debugging crashed pods)
kubectl debug <pod> -it --copy-to=debug-pod --container=debug --image=busybox

# Debug a node
kubectl debug node/<node-name> -it --image=busybox

The --target flag shares the process namespace with the specified container, so you can see its processes and filesystem at /proc/1/root/.

Debugging Without Exec Access¶

When RBAC blocks exec or the container has no shell:

Logs are your primary tool. Increase app log verbosity if you can (env var, configmap change).
Ephemeral containers (if cluster version >= 1.25 and feature enabled).

Port-forward to test connectivity:

kubectl port-forward pod/<pod> 8080:8080
curl localhost:8080/healthz

Temporary debug pod in same namespace:

kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
# Now you can curl services, dig DNS, etc.

Check configmaps and secrets the pod uses:

kubectl get pod <pod> -o jsonpath='{.spec.containers[*].envFrom}'
kubectl get configmap <cm> -o yaml

Node-Level Debugging with crictl¶

When kubectl is not giving you enough info, SSH to the node.

# List containers on this node
crictl ps -a

# Inspect a container
crictl inspect <container-id>

# Get container logs
crictl logs <container-id>

# List pods
crictl pods

# Check images
crictl images

# Check runtime status
crictl info

Use crictl when: - Kubelet is unhealthy and kubectl cannot reach the node - You need container-level details not exposed by the API - Debugging runtime-level issues (containerd/CRI-O)

Common Pitfalls¶

Not checking --previous logs. Current logs are empty because the container just restarted. The crash info is in previous logs.
Confusing resource requests and limits. Requests affect scheduling. Limits affect runtime killing. A pod can be scheduled (requests fit) but OOMKilled (exceeds limit at runtime).
Forgetting namespace. kubectl get pods shows default namespace only. Always use -n or -A.
Ignoring init containers. If an init container fails, the main container never starts. Check init container status and logs separately.
Readiness vs liveness confusion. Readiness gates traffic. Liveness restarts the pod. A bad liveness probe causes restart loops. A bad readiness probe causes the pod to be excluded from service endpoints.
Not checking service selectors. Pod is Running and Ready but service returns 503. The service selector labels do not match the pod labels.
DNS caching. If you changed a service, old DNS records may be cached in pods. The default TTL is 30s but some apps cache longer.
Assuming the problem is in the pod. Sometimes the issue is the node (disk pressure, network, kubelet crash), the control plane (API server overloaded), or a network policy.

One-liner: The two most frequently missed debugging steps: (1) checking --previous logs on a CrashLoopBackOff pod, and (2) checking if a service's label selector actually matches the pod's labels. Together these account for roughly half of all "I can't figure out why it's broken" escalations.

Decision Matrix: Which Tool When¶

Symptom	First Tool	Second Tool
Pod not starting	`describe pod`	`get events`
Pod crashing	`logs --previous`	`describe pod`
Pod running, not working	`port-forward` + curl	`exec` or `debug`
Service unreachable	`get endpoints`	`run debug pod`
Node issues	`describe node`	`crictl` (SSH)
Slow pod	`exec` + top/strace	`kubectl top pod`
Intermittent failures	`get events --watch`	logs with timestamps

The 5-Minute Incident Checklist¶

When paged about a K8s issue:

kubectl get pods -n <ns> -o wide -- scope the blast radius
kubectl describe pod <broken-pod> -- read Events section
kubectl logs <pod> --previous -- if CrashLoopBackOff
kubectl get events -n <ns> --sort-by='.lastTimestamp' -- broader context
kubectl top pods -n <ns> -- resource usage (if metrics-server installed)
kubectl get nodes -- any nodes NotReady?

If none of that explains it, escalate to network/storage/control-plane investigation.