K8S Troubleshooting¶

56 cards — 🟢 13 easy | 🟡 25 medium | 🔴 11 hard

🟢 Easy (13)¶

1. A pod is in CrashLoopBackOff. What does this status mean?

Show answer

The container starts, crashes, and Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5m). Common causes: missing config/secrets, unhandled exception at startup, OOMKilled, bad entrypoint command.

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

2. A pod is stuck in ImagePullBackOff. What are the common causes?

Show answer

1) Image name or tag is wrong. 2) Image doesn't exist in the registry. 3) ImagePullSecret is missing or expired. 4) Private registry requires auth. 5) Network policy or firewall blocks registry access.

Remember: ImagePullBackOff = can't pull image. Check: typo, auth, network, tag existence.

Gotcha: `kubectl describe pod` Events section has the exact error.

3. A pod is stuck in Pending state. What do you check first?

Show answer

kubectl describe pod — look at Events. Common reasons: insufficient CPU/memory (no node fits requests), no matching nodeSelector/affinity, PVC not bound, taints without tolerations.

Remember: Pending = can't schedule. Causes: no resources, taints, unbound PVC, no nodes.

Gotcha: `kubectl describe pod` shows FailedScheduling reason.

4. How do you use kubectl events for troubleshooting?

Show answer

kubectl get events --sort-by=.lastTimestamp shows recent cluster events. kubectl get events --field-selector involvedObject.name= filters to one resource. Events expire after 1 hour by default — check quickly after an issue.

Remember: Flow: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

5. How do you find pods that match a particular label selector?

Show answer

kubectl get pods -l app=myapp,env=prod (comma = AND). kubectl get pods -l 'app in (myapp,otherapp)' (set-based). kubectl get pods --show-labels to see all labels. Mismatched labels are the top cause of Service/Deployment issues.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

6. What exit code does a container terminated by OOMKilled have, and why?

Show answer

Exit code 137 (128 + 9), because the Linux kernel sends SIGKILL (signal 9) when the process exceeds its memory cgroup limit.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

7. What kubectl command shows whether a pod was OOMKilled, and what fields do you look for?

Show answer

kubectl describe pod . Look for Last State: Terminated, Reason: OOMKilled, Exit Code: 137 in the container status section.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

8. What is the difference between resources.requests.memory and resources.limits.memory in a Kubernetes pod spec?

Show answer

requests.memory is the scheduling guarantee (the kubelet reserves this amount on the node). limits.memory is the hard ceiling enforced by the Linux cgroup — exceeding it triggers OOMKill.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

9. What are the three Kubernetes QoS classes and which is evicted first under memory pressure?

Show answer

BestEffort (no requests/limits, evicted first), Burstable (requests < limits, evicted second), Guaranteed (requests == limits, evicted last).

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

10. What does CrashLoopBackOff mean in Kubernetes?

Show answer

CrashLoopBackOff is a status indicating the container started, crashed, and the kubelet is waiting with exponential backoff (10s, 20s, 40s... capped at 5 minutes) before restarting it again.

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

11. What kubectl command shows logs from a crashed container's previous run?

Show answer

kubectl logs --previous. The --previous flag retrieves logs from the last terminated container instance.

Example: `kubectl logs pod --previous --tail=100` — last 100 lines from crashed container.

Gotcha: Logs lost on pod deletion. Set up log aggregation (Fluentd/Loki) for persistence.

12. A pod is in CrashLoopBackOff with exit code 137. What killed it and what do you check first?

Show answer

Exit code 137 means SIGKILL — most commonly the kernel OOM killer. Run kubectl describe pod and look for Reason: OOMKilled in Last State, then check the container memory limit versus actual usage.

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

13. What three kubectl commands form the basic CrashLoopBackOff diagnostic workflow?

Show answer

1) kubectl get pods (see restart count and status), 2) kubectl describe pod (events, exit codes, last state), 3) kubectl logs --previous (see what the container printed before dying).

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

🟡 Medium (25)¶

1. How do you get the crash output from a CrashLoopBackOff pod?

Show answer

kubectl logs --previous shows stdout/stderr from the last crashed container. If the container exits too fast, kubectl describe pod Events section often reveals the reason (OOMKilled, exec format error, etc.).

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

2. How do you verify that an ImagePullSecret is correct?

Show answer

kubectl get secret -o jsonpath='{.data.\\.dockerconfigjson}' | base64 -d to inspect credentials. Then test manually: docker login with those creds. Also check the secret is in the same namespace as the pod.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

3. How do you determine if a pod is Pending due to resource pressure?

Show answer

kubectl describe nodes | grep -A 5 'Allocated resources' shows used vs allocatable. If requests exceed available, the scheduler can't place the pod. kubectl get events --field-selector reason=FailedScheduling confirms.

Remember: Pending = can't schedule. Causes: no resources, taints, unbound PVC, no nodes.

Gotcha: `kubectl describe pod` shows FailedScheduling reason.

4. A pod keeps restarting but logs show no errors. What could be wrong?

Show answer

Likely a misconfigured liveness probe. If the probe endpoint is wrong, too slow, or has a short timeout, kubelet kills the container as 'unhealthy'. Check: kubectl describe pod — look for 'Liveness probe failed' in Events.

Remember: High restarts = crashing. `kubectl describe pod` for reason, `kubectl logs --previous` for logs.

5. What is the difference between liveness, readiness, and startup probes?

Show answer

Liveness: is the process alive? Failure = container restart. Readiness: can it serve traffic? Failure = removed from Service endpoints. Startup: is the app finished initializing? Blocks liveness/readiness until it passes. Use startup probes for slow-starting apps.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

6. When do you use kubectl logs vs kubectl describe?

Show answer

Use logs to see application output (stdout/stderr). Use describe to see Kubernetes-level info: scheduling decisions, probe results, image pulls, resource limits, events. Start with describe for cluster issues, logs for app issues.

Example: `kubectl logs pod --previous --tail=100` — last 100 lines from crashed container.

Gotcha: Logs lost on pod deletion. Set up log aggregation (Fluentd/Loki) for persistence.

7. A Service exists but gets no traffic. How do you debug?

Show answer

kubectl get endpoints — if empty, the selector doesn't match any pod labels. Check: kubectl get pods --show-labels and compare with kubectl get svc -o yaml | grep selector. Also verify pods are Ready.

Remember: Flow: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

8. How do you test connectivity to a Service from inside the cluster?

Show answer

kubectl run tmp --image=busybox --rm -it -- wget -qO- http://..svc.cluster.local:. Or use kubectl exec into an existing pod. Check DNS resolution: nslookup ..svc.cluster.local.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

9. Pods can reach external IPs but internal DNS fails. What do you check?

Show answer

1) CoreDNS pods are running: kubectl get pods -n kube-system -l k8s-app=kube-dns. 2) CoreDNS service has endpoints. 3) Pod resolv.conf points to CoreDNS ClusterIP. 4) Check CoreDNS logs for errors. 5) NetworkPolicy may be blocking UDP/53.

Example: `kubectl exec debug-pod -- nslookup kubernetes.default` verifies cluster DNS.

10. Pods won't schedule on a specific node. How do you check taints?

Show answer

kubectl describe node | grep Taints. Common taints: node.kubernetes.io/not-ready, node.kubernetes.io/memory-pressure, node.kubernetes.io/disk-pressure. Pods need matching tolerations in their spec to schedule on tainted nodes.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

11. A node shows NotReady status. How do you investigate?

Show answer

kubectl describe node — check Conditions (MemoryPressure, DiskPressure, PIDPressure). SSH to the node: check kubelet logs (journalctl -u kubelet), disk space (df -h), memory (free -m). Common cause: kubelet can't reach the API server.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

12. A PVC is stuck in Pending. What are the common causes?

Show answer

1) No StorageClass matches the request. 2) StorageClass provisioner can't create the volume (cloud API error, quota). 3) No PV available for static provisioning. 4) Access mode mismatch (ReadWriteMany not supported). Check: kubectl describe pvc Events.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

13. A Deployment rollout is stuck. How do you diagnose it?

Show answer

kubectl rollout status deploy/ shows progress. kubectl describe deploy/ shows conditions. Common causes: new pods failing probes, insufficient quota, image pull errors. kubectl rollout undo deploy/ reverts to the last working revision.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

14. How do you check Deployment revision history and roll back to a specific version?

Show answer

kubectl rollout history deploy/ lists revisions. kubectl rollout history deploy/ --revision=2 shows details. kubectl rollout undo deploy/ --to-revision=2 rolls back. Revisions are stored in ReplicaSets — don't delete old ReplicaSets.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

15. A developer says their app can't reach a service. They're in different namespaces. What's the fix?

Show answer

Use the FQDN: ..svc.cluster.local. Short names only resolve within the same namespace. Also check: NetworkPolicy may restrict cross-namespace traffic. kubectl get netpol -n to verify.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

16. A pod can reach the internet but not other pods. What do you check?

Show answer

1) CNI plugin is healthy (check kube-system pods). 2) NetworkPolicy blocking inter-pod traffic. 3) iptables rules on the node (kube-proxy issues). 4) Pod CIDR overlap with node network. Run kubectl exec -- ping to confirm.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

17. A pod starts but behaves incorrectly after a ConfigMap update. Why?

Show answer

ConfigMaps mounted as volumes update eventually (kubelet sync period, ~60s). But env vars from ConfigMaps are set at pod creation and never update. Fix: restart the pod (kubectl rollout restart deploy/) or use a sidecar that watches for changes.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

18. Why does a Java application with -Xmx1g in a container limited to 512Mi get OOMKilled, and how do you fix it?

Show answer

The JVM requests 1GB of heap from the OS, but the cgroup enforces a 512Mi ceiling and kills the process. Fix by using -XX:MaxRAMPercentage=75.0 so the JVM sizes its heap relative to the container's memory limit.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

19. What is oom_score_adj and how does Kubernetes use it to influence which process the OOM killer targets?

Show answer

oom_score_adj (-1000 to 1000) adjusts a process's OOM kill priority. Kubernetes sets it by QoS class: Guaranteed gets -997 (killed last), BestEffort gets 1000 (killed first), and Burstable gets a scaled value in between.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

20. Which Prometheus metric should you use to predict OOMKill, and why not container_memory_usage_bytes?

Show answer

Use container_memory_working_set_bytes because it excludes inactive file cache and reflects what the OOM killer actually evaluates. container_memory_usage_bytes includes reclaimable cache and overstates true pressure.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

21. What is a LimitRange and how does it prevent OOMKilled caused by missing resource limits?

Show answer

A LimitRange is a namespace-scoped object that sets default memory requests and limits for containers that do not specify their own. It ensures every pod has a cgroup ceiling, preventing unbounded memory consumption that causes node-level OOM.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

22. What is the difference between exit codes 126 and 127 in a container?

Show answer

Exit code 126 means the entrypoint binary exists but cannot be executed (permission denied). Exit code 127 means the entrypoint binary does not exist in the image (command not found).

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

23. How can a liveness probe cause CrashLoopBackOff, and how do you prevent it for slow-starting apps?

Show answer

If initialDelaySeconds is too short, the liveness probe fails before the app finishes starting, causing Kubernetes to kill and restart the container repeatedly. Use a startupProbe with a high failureThreshold to give the app time to initialize before liveness checks begin.

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

24. A container keeps restarting with exit code 137. Describe your troubleshooting steps.

Show answer

Run kubectl describe pod to confirm OOMKilled in the Last State reason. Check the container's memory limit in the pod spec. Use kubectl top pod or Prometheus metrics to see actual memory usage. Increase the memory limit or fix the memory leak in the application.

Remember: Flow: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

25. How do init containers help prevent CrashLoopBackOff caused by missing dependencies?

Show answer

Init containers run before the main container and block startup until they succeed. You can use an init container to wait for a dependency (e.g., polling a database port with nc -z) so the main container only starts when its dependencies are actually ready.

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

🔴 Hard (11)¶

1. DNS lookups are slow in pods. What is the ndots issue?

Show answer

Default ndots:5 means any name with fewer than 5 dots gets search domains appended first (e.g., api.example.com tries api.example.com..svc.cluster.local before the real lookup). Fix: set dnsConfig.options ndots:2 in the pod spec or use FQDNs with trailing dots.

Example: `kubectl exec debug-pod -- nslookup kubernetes.default` verifies cluster DNS.

2. A pod with a PVC can't start and shows a multi-attach error. What's wrong?

Show answer

The PV is ReadWriteOnce (RWO) and is already mounted on another node. This happens during rolling updates when old and new pods are on different nodes. Fix: use Recreate strategy instead of RollingUpdate, or switch to ReadWriteMany if the storage supports it.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

3. A pod is OOMKilled but the app's memory usage looks normal. What happened?

Show answer

Check memory limits vs actual usage: kubectl top pod . The kernel OOM killer uses RSS (resident set size), which includes shared libraries and buffers. Java apps commonly exceed limits due to off-heap memory. Fix: increase limits or tune the runtime (e.g., -XX:MaxRAMPercentage for JVMs).

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

4. How do you distinguish a container-level OOM from a node-level OOM, and what commands reveal each?

Show answer

Container-level: single pod affected, Exit Code 137, Reason OOMKilled in kubectl describe pod. Node-level: multiple pods affected, dmesg shows kernel OOM killer messages, kubelet logs show eviction activity, kubectl describe node shows MemoryPressure: True.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

5. Explain the kubelet eviction thresholds for memory and how they interact with the kernel OOM killer.

Show answer

The kubelet has --eviction-hard (e.g., memory.available<100Mi) and --eviction-soft thresholds. When available memory crosses the soft threshold for its grace period or hits the hard threshold, the kubelet evicts pods (BestEffort first). If eviction cannot free memory fast enough and available memory reaches zero, the kernel OOM killer fires as a last resort.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

6. A pod has a main container limited to 512Mi and an Istio sidecar limited to 256Mi. The main container is OOMKilled despite using only 400Mi. What is the likely cause?

Show answer

Each container has its own cgroup and memory limit. If the main container is OOMKilled at 400Mi with a 512Mi limit, it may be counting shared memory (e.g., tmpfs mounts, emptyDir medium: Memory volumes) against the container's cgroup. Check for memory-backed volumes and sidecar memory consumption patterns with kubectl top pod --containers.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

7. How does the Vertical Pod Autoscaler help prevent OOMKilled, and what are its risks?

Show answer

VPA analyzes historical memory usage and recommends or automatically sets requests and limits. It provides lowerBound, target, and upperBound recommendations. Risks: in UpdateMode it restarts pods to apply new limits (disruption), it can undersize limits if load patterns are spiky, and it conflicts with HPA on the same resource — never use VPA and HPA both scaling on memory.

Remember: OOMKilled = exit 137 (128+SIGKILL). Exceeded memory limit. Increase or fix leak.

Gotcha: Check `kubectl describe pod` for Reason: OOMKilled in Last State.

8. What is the PID 1 problem in containers and how does it cause exit code 137 on pod termination?

Show answer

In a container, the entrypoint runs as PID 1. If PID 1 does not handle SIGTERM, Kubernetes sends SIGTERM on shutdown, the process ignores it, Kubernetes waits the terminationGracePeriodSeconds (default 30s), then sends SIGKILL — resulting in exit code 137. Fix by using exec form in Dockerfile or a lightweight init system like tini.

Remember: K8s troubleshooting: Get→Describe→Logs→Exec. Mnemonic: "GDLE."

Gotcha: Always check Events with `kubectl describe` — they tell WHY, not just WHAT failed.

9. How do you debug a CrashLoopBackOff when kubectl logs --previous shows no output?

Show answer

Use kubectl debug -it --image=busybox --target= to attach an ephemeral debug container sharing the pod's namespace. Alternatively, run a new pod with the same image but override the entrypoint to sleep (kubectl run debug --image= --overrides=...) then exec in and manually run the entrypoint to observe the error interactively.

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

10. How do you distinguish CrashLoopBackOff from CreateContainerConfigError and ImagePullBackOff?

Show answer

In CrashLoopBackOff the container started and ran before crashing — check application logs. In ImagePullBackOff the image could not be pulled (wrong tag, registry auth, network). In CreateContainerConfigError the container could not be configured (referenced ConfigMap or Secret does not exist). The key distinction is whether the container process ever executed.

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.

11. Why might a container with exit code 0 enter CrashLoopBackOff, and how do you fix it?

Show answer

With the default restartPolicy: Always, Kubernetes restarts containers even on successful exit (code 0). If the container's process completes and exits cleanly, it will be restarted indefinitely. Fix by changing restartPolicy to OnFailure or Never (for Jobs/CronJobs), or redesign the container to run as a long-lived process that does not exit.

Remember: CrashLoopBackOff = start-crash-retry with exponential backoff (10s→20s→40s→5min).

Gotcha: `kubectl logs pod --previous` shows crash reason. Common: missing config, OOM, wrong cmd.