Skip to content

Kubernetes Debugging: When Pods Won't Behave

  • lesson
  • kubernetes-pod-lifecycle
  • container-debugging
  • probes
  • scheduling
  • resource-management
  • dns
  • storage
  • rbac
  • network-policies ---# Kubernetes Debugging: When Pods Won't Behave

Topics: Kubernetes pod lifecycle, container debugging, probes, scheduling, resource management, DNS, storage, RBAC, network policies Level: L1-L2 (Foundations to Operations) Time: 75-90 minutes Prerequisites: None (everything is explained inline)


The Mission

You just joined the on-call rotation for a Kubernetes cluster running five microservices. It's Monday morning. You open your terminal and run kubectl get pods -n production:

NAME                              READY   STATUS             RESTARTS      AGE
cart-service-7f8b9c4d5-x2k9m     1/1     Running            0             3d
inventory-api-5d6e7f8a9-q4r7p    0/1     Pending            0             47m
order-processor-8a9b0c1d2-m3n4   0/1     CrashLoopBackOff   7 (3m ago)    22m
payment-gateway-3e4f5a6b7-h8j9   0/1     ImagePullBackOff   0             15m
shipping-tracker-9c0d1e2f3-k5l6  1/1     Running            0             6h

Five pods. Three of them are in trouble. One is stuck and never scheduled. One keeps crashing. One can't even pull its image. And somewhere in this cluster, two more problems are hiding that kubectl get pods won't tell you about.

Your job: triage all five failures. Systematically.

By the end of this lesson, you'll have a mental framework for looking at any sick pod and knowing exactly what to check, in what order, and why.


Part 1: The Debugging Ladder

Before we touch a single pod, let's talk about sequence. Kubernetes debugging has a natural order, and fighting the order wastes time.

Step 1: kubectl get pods -o wide          → What's broken, where is it?
Step 2: kubectl describe pod <name>       → Why is it broken? (Events section)
Step 3: kubectl logs <pod> --previous     → What did the app say before it died?
Step 4: kubectl exec / kubectl debug      → Get inside and look around
Step 5: kubectl get events --sort-by=...  → What else happened in this namespace?
Step 6: kubectl describe node <name>      → Is the node sick?

Remember: The debugging ladder mnemonic is GDLEEN: Get, Describe, Logs, Exec, Events, Node. Start at the top. Most problems resolve by step 2 or 3. If you're at step 6, something unusual is happening.

The instinct is to jump straight to logs. Resist it. kubectl describe pod gives you the infrastructure story: scheduling decisions, image pulls, probe failures, mount errors. Logs give you the application story. You need the infrastructure story first because if the container never started, there are no logs.

Let's climb the ladder for each broken pod.


Part 2: The Pending Pod — inventory-api

kubectl describe pod inventory-api-5d6e7f8a9-q4r7p -n production

Go straight to the Events section at the bottom:

Warning  FailedScheduling  47m  default-scheduler  0/4 nodes are available:
         1 node(s) had untolerated taint {gpu=true: NoSchedule},
         2 node(s) had insufficient cpu, 1 node(s) had insufficient memory.

There it is. The scheduler tried every node and none worked. Let's decode that message:

Count Reason What it means
1 node untolerated taint gpu=true:NoSchedule This node is reserved for GPU workloads; your pod doesn't have a matching toleration
2 nodes insufficient cpu Your pod's CPU request exceeds what's available on these nodes
1 node insufficient memory Your pod's memory request exceeds what's available

Notice it says "insufficient cpu" — not "the node is out of CPU." These are requests, not actual usage. The nodes might be at 15% real CPU but 100% allocated.

# See the gap between requested and actual
kubectl top nodes
kubectl describe node worker-1 | grep -A 5 "Allocated resources"
Allocated resources:
  Resource           Requests     Limits
  --------           --------     ------
  cpu                3800m (95%)  8000m (200%)
  memory             6Gi (78%)   12Gi (150%)

95% of CPU requested but kubectl top shows the node at 22% actual CPU. This is the classic over-provisioning trap: teams set CPU requests too high, the scheduler thinks the cluster is full, but the actual workload is light.

Gotcha: Requests affect scheduling. Limits affect runtime enforcement. A pod can be Pending (requests don't fit) even when the cluster has plenty of actual capacity. The fix is either to right-size requests (use VPA recommendations) or add node capacity.

What about that taint?

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
NAME        TAINTS
worker-1    <none>
worker-2    <none>
worker-3    [map[effect:NoSchedule key:gpu value:true]]
worker-4    <none>

The GPU node is tainted. Your pod doesn't have a toleration for it, so the scheduler skips it. If that node should run general workloads too, either remove the taint or add a toleration to the pod spec.

Fix: reduce CPU request or add capacity

# Before: over-provisioned          # After: right-sized
resources:                           resources:
  requests:                            requests:
    cpu: 2000m  # 2 full cores           cpu: 500m   # matches actual usage
    memory: 4Gi                          memory: 1Gi
                                       limits:
                                         cpu: 2000m  # can burst
                                         memory: 4Gi

Under the Hood: The scheduler runs two phases: filtering (eliminate nodes that can't fit the pod) and scoring (rank the remaining ones). Taints, affinity, and resource requests are all filters. If every node gets filtered out, the pod stays Pending forever. The scheduler retries but never relaxes constraints.

Flashcard check

Question Answer
A pod is Pending. What do you check first? kubectl describe pod — look for FailedScheduling in Events.
What causes "0/N nodes are available: N insufficient cpu"? The pod's CPU request exceeds available (unrequested) CPU on every node.
A tainted node shows in the FailedScheduling message. What's missing? The pod spec needs a matching toleration.

Part 3: The CrashLoopBackOff — order-processor

This is the one everyone dreads. The container starts, dies, and Kubernetes retries — slower each time. Seven restarts in 22 minutes. Let's understand the timing.

The exponential backoff

When a container crashes, the kubelet waits before restarting it:

Crash 1 → wait 10s  → restart
Crash 2 → wait 20s  → restart
Crash 3 → wait 40s  → restart
Crash 4 → wait 80s  → restart
Crash 5 → wait 160s → restart
Crash 6 → wait 300s → restart (capped at 5 minutes)
Crash 7 → wait 300s → restart (stays at 5 minutes)
...

The backoff doubles each time: 10s, 20s, 40s, 80s, 160s, then caps at 300s (5 minutes). After 10 minutes of running successfully, the counter resets. This is why you see (3m ago) in the RESTARTS column — the pod is currently in the 5-minute waiting period before its next attempt.

Trivia: The exponential backoff with a 5-minute cap has been in Kubernetes since the very early versions. The choice of 5 minutes (not 10, not 2) is a pragmatic balance: long enough to avoid hammering a failing dependency, short enough that a transient issue resolves within a reasonable time once the fix is in place. The backoff timer lives in the kubelet's in-memory state — restarting the kubelet resets it.

Now let's find out why it's crashing. Step 3 on the ladder — logs:

kubectl logs order-processor-8a9b0c1d2-m3n4 -n production --previous
2026-03-23T08:14:22Z INFO  Starting order-processor v3.2.1
2026-03-23T08:14:22Z INFO  Connecting to RabbitMQ at amqp://rabbitmq.production.svc:5672
2026-03-23T08:14:23Z FATAL Cannot connect to RabbitMQ: connection refused
2026-03-23T08:14:23Z FATAL Exiting with code 1

The application can't reach RabbitMQ and exits. But is RabbitMQ actually down?

kubectl get endpoints rabbitmq -n production   # → 10.244.2.15:5672 — RabbitMQ is healthy
kubectl run debug-net --rm -it --image=busybox -n production -- nc -zv 10.244.2.15 5672
# → open — other pods can connect fine

Something specific to order-processor is blocked. Time to check NetworkPolicy:

kubectl get networkpolicy -n production
# → restrict-egress   app=order-processor   2h

kubectl describe networkpolicy restrict-egress -n production

The policy allows egress only on port 443/TCP. RabbitMQ is on port 5672. Someone created this policy two hours ago — right when the crashes started. The pod can't reach its broker.

Mental Model: NetworkPolicy works like a firewall — but with an important twist. If any NetworkPolicy selects a pod, all traffic not explicitly allowed is denied by default. This is the "default deny" behavior. A pod with zero NetworkPolicies is wide open. A pod with one NetworkPolicy that only allows port 443 is blocked on every other port.

Exit codes: your first clue

Before you even look at logs, the exit code narrows the search:

Exit code Signal Meaning
0 Clean exit. If restartPolicy is Always, this still CrashLoops.
1 Generic application error. Read the logs.
2 Shell misuse (bad command in entrypoint).
126 Command exists but is not executable (permission issue).
127 Command not found (wrong entrypoint/CMD in Dockerfile).
137 SIGKILL (9) OOMKilled or external kill. Check describe for reason.
139 SIGSEGV (11) Segfault. Native code bug.
143 SIGTERM (15) Graceful shutdown. Usually normal during rollouts.

Remember: Any exit code above 128 = killed by a signal. Subtract 128 to get the signal number. 137 = 128 + 9 (SIGKILL). 143 = 128 + 15 (SIGTERM). This convention comes from the original Unix shell, not from Kubernetes.

# Get the exit code without parsing describe output
kubectl get pod order-processor-8a9b0c1d2-m3n4 -n production \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Flashcard check

Question Answer
What are the CrashLoopBackOff backoff intervals? 10s, 20s, 40s, 80s, 160s, 300s (capped at 5 minutes). Resets after 10 minutes of running.
Exit code 137 means what? SIGKILL (128 + 9). Usually OOMKilled.
A pod CrashLoops with exit code 0. Why? restartPolicy: Always (the default) restarts even on clean exit. Use OnFailure or Never for batch work.

Part 4: The ImagePullBackOff — payment-gateway

kubectl describe pod payment-gateway-3e4f5a6b7-h8j9 -n production

Events show:

Warning  Failed  15m  kubelet  Failed to pull image "registry.internal.corp/payments/gateway:v4.1.0":
         rpc error: code = Unknown desc = failed to pull and unpack image: 401 Unauthorized

The key phrase: 401 Unauthorized. The image exists, but the node can't authenticate.

# What imagePullSecrets does the pod reference?
kubectl get pod payment-gateway-3e4f5a6b7-h8j9 -n production \
  -o jsonpath='{.spec.imagePullSecrets}'
# → [{"name":"registry-creds"}]

# Does that secret exist?
kubectl get secret registry-creds -n production
# → Error from server (NotFound): secrets "registry-creds" not found

The secret is referenced but doesn't exist in this namespace. Someone created it in default and forgot to copy it to production.

Gotcha: ImagePullSecrets are namespace-scoped. A secret in default is invisible to pods in production. This bites teams hard after namespace migrations. The pod references the secret by name, and if the name exists but contains stale or wrong credentials, the error is the same 401 — making it look like the secret doesn't exist when it's actually pointing at the wrong registry.

ImagePullBackOff also uses exponential backoff (10s to 5 minutes). The status alternates between ErrImagePull (active failure) and ImagePullBackOff (waiting). Once you fix the issue, recovery is automatic on the next retry.

Under the Hood: When no imagePullSecrets are specified, the kubelet uses the node's container runtime credentials (~/.docker/config.json). This silently breaks when the autoscaler provisions new nodes — fresh nodes have no cached credentials. Always use explicit imagePullSecrets or a mutating webhook to inject them.


Part 5: The Silent Killer — OOMKilled

Cart-service looks healthy — Running, zero restarts. But users are reporting intermittent 502 errors. Let's look closer:

kubectl describe pod cart-service-7f8b9c4d5-x2k9m -n production
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 23 Mar 2026 02:14:00 +0000
      Finished:     Mon, 23 Mar 2026 08:01:12 +0000
    Ready:          True
    Restart Count:  0

Restart count 0 but OOMKilled? The pod was replaced by the Deployment controller after the OOM — this is a new pod that inherited the deployment's replica count but not the old pod's restart counter.

# What are the resource limits?
kubectl get pod cart-service-7f8b9c4d5-x2k9m -n production \
  -o jsonpath='{.spec.containers[0].resources}'
{"limits":{"memory":"256Mi"},"requests":{"memory":"128Mi"}}

256Mi memory limit. Let's check actual usage:

kubectl top pod cart-service-7f8b9c4d5-x2k9m -n production
NAME                              CPU(cores)   MEMORY(bytes)
cart-service-7f8b9c4d5-x2k9m     45m          231Mi

231Mi out of a 256Mi limit. That's 90% utilization. Any traffic spike or garbage collection pause will push it over.

Under the Hood: When a container exceeds its memory limit, it's not Kubernetes that kills it — it's the Linux kernel's OOM killer. The container's memory limit is enforced by a cgroup. The kernel monitors the cgroup's RSS (Resident Set Size), and when it exceeds the cgroup ceiling, the kernel sends SIGKILL (signal 9). That's why the exit code is 137 (128 + 9). There's no warning, no SIGTERM, no graceful shutdown. The process is dead.

Gotcha: kubectl top pod shows MEMORY(bytes) which corresponds to container_memory_working_set_bytes in Prometheus — this is what the OOM killer actually evaluates. Do not use container_memory_usage_bytes for capacity planning — it includes reclaimable page cache and overstates true pressure.

The JVM trap

A JVM with -Xmx512m in a container limited to 256Mi tries to allocate more heap than the cgroup allows and gets killed before it finishes starting. Fix: java -XX:MaxRAMPercentage=75.0 -jar app.jar — this sizes the heap relative to the container's cgroup limit, leaving 25% for non-heap memory (class metadata, thread stacks, NIO buffers).


Part 6: The Probe Problem — shipping-tracker

Shipping-tracker is Running, zero restarts. But the monitoring dashboard shows it dropped off for 45 seconds this morning and came back. What happened?

kubectl get events -n production --field-selector involvedObject.name=shipping-tracker-9c0d1e2f3-k5l6

Events show Liveness probe failed: HTTP probe failed with statuscode: 503, followed by Container shipping-tracker failed liveness probe, will be restarted. The probe config:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 1
  failureThreshold: 3
startupProbe: null
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

The three probes and when each goes wrong

Probe Question it answers Failure consequence Common misconfiguration
Startup "Has the app finished initializing?" Blocks liveness/readiness until success Not using one for slow-starting apps
Liveness "Is the process alive and functional?" Container is killed and restarted Timeout too short, checking dependencies
Readiness "Can this instance serve traffic right now?" Removed from Service endpoints (no traffic) Not distinguishing from liveness

War Story: A team deployed a Java service that loaded ML models at startup (90 seconds). Liveness probe: initialDelaySeconds: 10, failureThreshold: 3, periodSeconds: 10. First check at 10s, fails. Second at 20s, fails. Third at 30s — killed. New container starts, takes 90s, killed at 30s. Infinite loop. Fix: a startup probe with failureThreshold: 30 and periodSeconds: 10 (300 seconds to start). Three lines of YAML.

Back to shipping-tracker. The liveness probe checks /healthz with a 1-second timeout. A GC pause or heavy query during the check window times out the probe. Three timeouts (failureThreshold: 3 at periodSeconds: 10) = 30 seconds, and the kubelet kills it.

# Better: startup probe for initialization, generous liveness timeout
startupProbe:
  httpGet: { path: /healthz, port: 8080 }
  failureThreshold: 30          # Up to 300 seconds to start
  periodSeconds: 10
livenessProbe:
  httpGet: { path: /healthz, port: 8080 }
  periodSeconds: 10
  timeoutSeconds: 5             # 5 seconds, not 1
  failureThreshold: 6           # 60 seconds before killing
readinessProbe:
  httpGet: { path: /ready, port: 8080 }
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

Gotcha: Never make a liveness probe check external dependencies (database, message queue, downstream API). If the database is down, your liveness probe fails, Kubernetes restarts your pod, the new pod also can't reach the database, it gets restarted too — now you have a fleet of CrashLooping pods and the database recovers to find zero healthy backends. Liveness should check "is this process healthy?" — not "is the world healthy?" Put dependency checks in readiness probes instead.

Flashcard check

Question Answer
Liveness probe fails. What happens? Container is killed and restarted.
Readiness probe fails. What happens? Pod is removed from Service endpoints — no traffic routed to it, but it keeps running.
What is a startup probe for? Slow-starting apps. It blocks liveness and readiness probes until the app is initialized.
Should a liveness probe check database connectivity? No. Liveness should only check the process itself. Put dependency checks in readiness probes.

Part 7: The Hidden Problems

We've fixed three obvious failures. Now let's find the two hidden ones.

Init container failure

There's a pod we missed:

notification-svc-4b5c6d7e8-f9a0  0/1     Init:Error    0   35m

The STATUS says Init:Error. The main container never started because an init container failed. Init containers run sequentially — if any fails, the pod never reaches Running.

kubectl describe pod notification-svc-4b5c6d7e8-f9a0 -n production
Init Containers:
  wait-for-db:    State: Terminated  Reason: Completed  Exit Code: 0
  run-migrations: State: Terminated  Reason: Error      Exit Code: 1

wait-for-db passed, but run-migrations failed.

# Get the migration logs
kubectl logs notification-svc-4b5c6d7e8-f9a0 -n production -c run-migrations
Running database migrations...
ERROR: permission denied for table notifications
DETAIL: User "app_user" does not have INSERT privilege on "schema_migrations"

An RBAC issue — not Kubernetes RBAC, but database-level permissions. The migration user needs INSERT privileges on the schema_migrations table.

Gotcha: Init container failures are invisible if you only watch the main containers. kubectl get pods shows Init:Error or Init:CrashLoopBackOff in the STATUS column, but it's easy to miss. Always check init containers when a pod won't start and the main container has no logs.

PVC mount failure

One more hiding in the events:

kubectl get events -n production --sort-by='.lastTimestamp' | grep -i "fail"
5m  Warning  FailedAttachVolume  pod/analytics-worker-1a2b3c4d5-e6f7  Multi-Attach error
    for volume "pvc-a1b2c3d4" Volume is already exclusively attached to one node

Multi-Attach error: the PV is ReadWriteOnce (RWO), mountable on only one node. During a rolling update, the new pod landed on a different node and both pods tried to claim the volume.

Fixes: use strategy: Recreate instead of RollingUpdate for RWO workloads, switch to ReadWriteMany (RWX) if storage supports it, or use a StatefulSet.


Part 8: DNS — The Problem That Doesn't Look Like DNS

Cart-service is Running and Ready, but logging DNS failures:

2026-03-23T09:22:01Z ERROR Failed to reach api.stripe.com: dial tcp: lookup api.stripe.com: i/o timeout

The debugging ladder for DNS problems:

# 1. Confirm DNS is broken from inside the pod
kubectl exec cart-service-7f8b9c4d5-x2k9m -n production -- nslookup api.stripe.com
# → "connection timed out; no servers could be reached"

# 2. Is CoreDNS running?
kubectl get pods -n kube-system -l k8s-app=kube-dns
# → Running, healthy

# 3. Does internal DNS work?
kubectl exec cart-service-7f8b9c4d5-x2k9m -n production -- nslookup rabbitmq.production.svc.cluster.local
# → Resolves to 10.96.45.12

Internal DNS works, external doesn't. This narrows it to CoreDNS upstream config or a NetworkPolicy blocking DNS egress. Check kubectl get configmap coredns -n kube-system -o yaml for the forward directive, and check NetworkPolicies — the egress policy from Part 3 that allows only port 443 would also block DNS (UDP 53) if applied broadly.

Remember: DNS in Kubernetes uses UDP port 53 (and TCP 53 for large responses). Any egress NetworkPolicy that doesn't explicitly allow port 53 to the kube-dns service will break all DNS resolution for affected pods. This is the most common NetworkPolicy footgun.

Trivia: The default ndots setting in Kubernetes pod resolv.conf is 5. This means any hostname with fewer than 5 dots gets search domains appended first. A lookup for api.stripe.com (2 dots) actually tries api.stripe.com.production.svc.cluster.local, then api.stripe.com.svc.cluster.local, then api.stripe.com.cluster.local, then finally api.stripe.com. — that's 4 DNS queries for one lookup. For external-heavy workloads, setting dnsConfig.options: [{name: ndots, value: "2"}] in the pod spec dramatically reduces DNS query volume.


Part 9: Ephemeral Debug Containers

When a container is built from a distroless or scratch image, there's no shell to exec into. No sh, no bash, no curl, no nslookup. This is where ephemeral debug containers shine.

# Attach a debug container to a running pod
kubectl debug -it cart-service-7f8b9c4d5-x2k9m -n production \
  --image=nicolaka/netshoot --target=cart-service

The --target flag shares the process namespace with the specified container. You can see its processes, its environment, and its filesystem at /proc/1/root/.

# Inside the debug container:
ps aux                           # See all processes in the target container
cat /proc/1/root/etc/resolv.conf # Read the target's DNS config
nslookup api.stripe.com          # Test DNS from the pod's network namespace
curl -v http://localhost:8080/healthz  # Hit the app's health endpoint

For crashing pods (too fast to exec into), copy the pod with the entrypoint replaced:

kubectl debug order-processor-8a9b0c1d2-m3n4 -n production \
  -it --copy-to=debug-order --container=order-processor \
  --image=order-processor:v3.2.1 -- /bin/sh

For node-level debugging without SSH:

kubectl debug node/worker-2 -it --image=ubuntu
chroot /host    # Host filesystem is at /host
journalctl -u kubelet --since "10 minutes ago"

Trivia: Ephemeral debug containers were a long time coming. The feature was introduced as alpha in Kubernetes v1.16 (2019), reached beta in v1.23 (2021), and finally went GA in v1.25 (2022). Before that, debugging distroless containers meant rebuilding the image with debugging tools baked in — which defeats the entire purpose of minimal images.


Part 10: RBAC Errors From Inside Pods

One more class of failure you'll hit: the pod runs fine, but fails when it tries to talk to the Kubernetes API.

kubectl logs notification-svc-4b5c6d7e8-f9a0 -n production -c main
ERROR: forbidden: User "system:serviceaccount:production:default" cannot list
resource "configmaps" in API group "" in the namespace "production"

The pod is using the default service account, which has no permissions. The application needs to read ConfigMaps but no one created a Role and RoleBinding.

# Check what service account the pod uses
kubectl get pod notification-svc-4b5c6d7e8-f9a0 -n production \
  -o jsonpath='{.spec.serviceAccountName}'
default

Fix: create a ServiceAccount, a Role granting only the needed permissions, and a RoleBinding connecting them. Then set serviceAccountName: notification-svc in the pod spec.

# ServiceAccount + Role + RoleBinding (all in namespace: production)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: notification-svc
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: notification-svc-role
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: notification-svc-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: notification-svc
roleRef:
  kind: Role
  name: notification-svc-role
  apiGroup: rbac.authorization.k8s.io

Gotcha: If a pod doesn't need to talk to the Kubernetes API at all, set automountServiceAccountToken: false in the pod spec. This prevents the service account token from being mounted, which is both a security best practice and eliminates a whole class of RBAC debugging.


The Decision Tree

When you encounter a broken pod, start here:

kubectl get pods -o wide
        |
        v
What is the STATUS?
        |
        +-- Pending
        |     → kubectl describe pod → Events → FailedScheduling
        |     → Check: resource requests, taints, node affinity, PVC binding
        |
        +-- Init:Error / Init:CrashLoopBackOff
        |     → kubectl logs <pod> -c <init-container-name>
        |     → Init containers run sequentially; find which one failed
        |
        +-- ImagePullBackOff
        |     → kubectl describe pod → Events → look for 401, 404, or network error
        |     → Check: image name/tag, imagePullSecrets, registry access
        |
        +-- CrashLoopBackOff
        |     → kubectl logs <pod> --previous
        |     → Check exit code: 137=OOM, 1=app error, 127=cmd not found
        |     → If no logs: kubectl debug or override entrypoint
        |
        +-- Running but not Ready
        |     → Readiness probe failing
        |     → kubectl describe pod → look for "Readiness probe failed" in Events
        |
        +-- Running and Ready but not working
        |     → Test from inside: kubectl exec <pod> -- curl localhost:<port>
        |     → Check Service endpoints: kubectl get endpoints <svc>
        |     → Check NetworkPolicy, DNS, upstream dependencies
        |
        +-- Terminating (stuck)
              → Finalizer blocking deletion
              → kubectl get pod <name> -o jsonpath='{.metadata.finalizers}'
              → Or container ignoring SIGTERM (PID 1 problem)

Cheat Sheet

Symptom First command What to look for
Pod Pending kubectl describe pod <name> FailedScheduling in Events: resources, taints, PVC
CrashLoopBackOff kubectl logs <pod> --previous Exit code + error message
ImagePullBackOff kubectl describe pod <name> 401 (auth), 404 (not found), timeout (network)
OOMKilled kubectl describe pod <name> Reason: OOMKilled in Last State, then check limits
Probe failures kubectl describe pod <name> Unhealthy events, check probe config
No endpoints kubectl get endpoints <svc> Empty = label selector mismatch
DNS failure kubectl exec <pod> -- nslookup kubernetes.default If fails: check CoreDNS pods and NetworkPolicy
PVC mount fail kubectl describe pod <name> FailedMount or Multi-Attach error
RBAC error kubectl logs <pod> forbidden: User "system:serviceaccount:..."
Init failure kubectl logs <pod> -c <init-container> Init containers run in order; find the failed one

The flags you'll type most often: --previous (crash logs), -c <name> (multi-container), -o wide (node placement), -A (all namespaces), --sort-by=.lastTimestamp (events).


Exercises

Exercise 1: Read the room (2 minutes)

Given this kubectl get pods output, rank the pods by urgency and explain your reasoning:

NAME           READY   STATUS             RESTARTS      AGE
api-v2-abc     0/1     CrashLoopBackOff   12 (4m ago)   1h
worker-def     0/1     Pending            0             3h
cache-ghi      1/1     Running            0             5d
frontend-jkl   0/1     ImagePullBackOff   0             10m
Answer 1. **api-v2**: CrashLoopBackOff, 12 restarts — actively hurting users. `kubectl logs --previous` now. 2. **frontend-jkl**: ImagePullBackOff for 10 minutes — likely a bad deploy. Quick fix (secret/typo). 3. **worker-def**: Pending 3 hours — nobody noticed, probably a background worker. Check scheduling. 4. **cache-ghi**: Running, 5 days, no restarts — fine, but "Running" != "working correctly."

Exercise 2: Debug the probe (5 minutes)

A pod keeps getting killed every 5 minutes. kubectl describe pod shows repeated Liveness probe failed events. The probe config is:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 30
  timeoutSeconds: 1
  failureThreshold: 3

The app takes 2-3 seconds to respond to /healthz under load. Write a better probe configuration.

Answer The 1-second timeout is too aggressive. Increase `timeoutSeconds` to 5, add a `startupProbe` with `failureThreshold: 30` for slow starts, and bump `failureThreshold` on liveness to 6. Also: is `/healthz` doing too much work? Liveness endpoints should be lightweight — move dependency checks to a readiness probe.

Exercise 3: The resource puzzle (5 minutes)

A namespace has a ResourceQuota capping requests.cpu at 8 cores. Current usage: 7 CPUs requested. You deploy a pod requesting 2 CPUs. The Deployment looks fine, the ReplicaSet exists, but no pod appears. Where do you look?

Answer `kubectl describe replicaset -n production` — the Events section shows `exceeded quota`. 7 + 2 = 9 > 8 CPU limit. The Deployment itself shows no error — you must drill down to the ReplicaSet. This multi-level indirection (Deployment -> ReplicaSet -> Pod creation failure) is a classic trap. Fix: reduce the request or increase the quota.

Takeaways

  • The debugging ladder is a sequence, not a menu. Get, Describe, Logs, Exec, Events, Node. Start at the top.

  • Describe pod, not logs, is your first stop. The Events section tells the infrastructure story — scheduling, image pulls, probes, mounts. Logs tell the application story. Infrastructure first.

  • CrashLoopBackOff exponential backoff caps at 5 minutes. The sequence: 10s, 20s, 40s, 80s, 160s, 300s. Once you fix the issue, the pod recovers on the next cycle.

  • Exit codes above 128 mean "killed by signal." Subtract 128 to get the signal number. 137 = SIGKILL (OOMKilled). 143 = SIGTERM.

  • NetworkPolicy is default-deny once any policy selects a pod. If you add a policy that allows port 443, you've just blocked every other port — including DNS (53).

  • Requests affect scheduling. Limits affect runtime. A pod can be Pending (requests too high) even with an empty cluster, or OOMKilled (limits too low) even with plenty of node memory.