k8s
l1
runbook
crashloop
k8s-core --- Portal | Level: L1: Foundations | Topics: CrashLoopBackOff, Kubernetes Core | Domain: Kubernetes

Runbook: Pod CrashLoopBackOff¶

Field	Value
Domain	Kubernetes
Alert	`container_restart_rate > 5/min` or pod status = CrashLoopBackOff
Severity	P2
Est. Resolution Time	15-30 minutes
Escalation Timeout	30 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl get pods -n <NAMESPACE> --field-selector=status.phase!=Running -o wide

If output shows: multiple pods across different deployments → This may be a node or cluster-wide problem, see node-not-ready.md If output shows: a single pod or all pods from one deployment → Continue with steps below

Step 1: Check Pod Status and Restart Count¶

Why: Confirms the CrashLoopBackOff and shows how long it has been failing, which sets urgency.

kubectl get pod <POD_NAME> -n <NAMESPACE> -o wide
kubectl get events -n <NAMESPACE> --field-selector=involvedObject.name=<POD_NAME> --sort-by='.lastTimestamp'

Expected output:

NAME          READY   STATUS             RESTARTS   AGE
my-app-xyz    0/1     CrashLoopBackOff   12         8m

If this fails: Verify you are in the correct namespace with kubectl config get-contexts and confirm the pod name with kubectl get pods -n <NAMESPACE>.

Step 2: Get Previous Container Logs¶

Why: Current logs often show nothing because the container crashed immediately. The -p flag retrieves the logs from the last terminated instance, which contains the actual error.

kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=100

Expected output:

# The actual error — stack trace, panic message, missing file, failed config parse, etc.
Error: cannot load config: open /etc/app/config.yaml: no such file or directory

If this fails: The pod may not have completed even one start cycle yet. Wait 30 seconds and retry, or check kubectl describe pod <POD_NAME> -n <NAMESPACE> for init container failures.

Step 3: Describe Pod for Exit Codes and Events¶

Why: The exit code tells you the category of failure. Exit code 1 = app crash, exit code 137 = OOMKilled (see oom-kill.md), exit code 126/127 = bad entrypoint/command.

kubectl describe pod <POD_NAME> -n <NAMESPACE>

Expected output — look for this section:

Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Thu, 19 Mar 2026 10:00:00 +0000
  Finished:     Thu, 19 Mar 2026 10:00:02 +0000

If exit code is 137: This is an OOMKill — stop here and follow oom-kill.md instead. If exit code is 1: Application-level crash — continue to Step 4. If exit code is 126 or 127: Bad command or entrypoint in the container spec — check spec.containers[].command and spec.containers[].args.

Step 4: Check for OOMKill vs Application Crash¶

Why: OOMKill and app crashes require completely different fixes. Confusing them wastes time.

kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'

Expected output (OOMKill):

OOMKilled

Expected output (app crash):

Error

If output is OOMKilled: Follow oom-kill.md — do not continue here. If output is Error: Continue to Step 5.

Step 5: Fix the Root Cause¶

Why: The logs from Step 2 should now tell you what is wrong. Common causes and their fixes are below.

Case A — Missing ConfigMap or Secret:

# Check if the referenced ConfigMap exists
kubectl get configmap <CONFIGMAP_NAME> -n <NAMESPACE>
# Check if the referenced Secret exists
kubectl get secret <SECRET_NAME> -n <NAMESPACE>
# If missing, create it — example for ConfigMap from file
kubectl create configmap <CONFIGMAP_NAME> --from-file=<CONFIG_FILE_PATH> -n <NAMESPACE>

Case B — Wrong image tag or image does not exist:

# Check the current image
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[0].image}'
# Fix the image tag
kubectl set image deployment/<DEPLOYMENT_NAME> <CONTAINER_NAME>=<IMAGE>:<CORRECT_TAG> -n <NAMESPACE>

Case C — Insufficient resource limits causing startup failure:

# Check current limits
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[0].resources}'
# Edit deployment to adjust limits
kubectl edit deployment <DEPLOYMENT_NAME> -n <NAMESPACE>
# Under resources.limits, increase memory or cpu as needed

Expected output after fix: Pod transitions from CrashLoopBackOff to Running within 1-2 minutes. If this fails: The root cause may be in application code or a dependency (database unreachable, upstream service down) — check app-level logs carefully and check dependent services.

Step 6: Trigger Rollout and Confirm¶

Why: Some fixes (like ConfigMap changes) require a pod restart to take effect. A rollout ensures a clean restart and records the change in deployment history.

kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --timeout=5m

Expected output:

Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
deployment "my-app" successfully rolled out

If this fails: The new pods are also crashing — recheck logs on the new pods with Step 2, or rollback with kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>.

Verification¶

# Confirm the issue is resolved
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL> -w

Success looks like: All pods show Running with READY column showing 1/1 (or N/N for multi-container pods) and restart count is not incrementing. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 30 min	SRE on-call	"Kubernetes CrashLoopBackOff in , pod , exit code , runbook exhausted"
Data loss suspected (stateful workload)	Platform Lead	"Data loss risk: stateful pod in CrashLoop in , possible volume corruption"
Scope expanding beyond namespace	Platform team	"Multi-namespace impact: CrashLoop spreading, possible shared dependency failure"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Add the root cause to the team's known issues log
Verify that the fix is reflected in the deployment manifest in git (not just patched live)

Common Mistakes¶

Deleting the pod instead of fixing the deployment: Deleting a crashing pod causes Kubernetes to immediately recreate it from the same broken deployment spec. The crash loop continues. Always fix the Deployment (or StatefulSet/DaemonSet) spec, not just the pod.
Reading only the current container logs: A CrashLoopBackOff pod restarts repeatedly. The current container logs are often empty or show only startup output because the container crashed within seconds. Always use kubectl logs -p to get the previous (terminated) container's logs — that is where the actual error lives.
Assuming the fix is applied: After editing a Deployment, always run kubectl rollout status to confirm the new pods rolled out successfully. A broken admission webhook or image pull error can silently stall the rollout.

Tips and Gotchas¶

CrashLoopBackOff is not a distinct error -- it means "container crashed and kubelet is backing off before restarting"
The back-off timer resets after a pod runs successfully for 10 minutes
--previous only works if there was a previous container instance; on first crash after pod creation, only current logs exist
A container exiting with code 0 can still cause CrashLoopBackOff if restartPolicy: Always and the process isn't meant to exit

Cross-References¶

Survival Guide: On-Call Survival Guide (pocket card version)
Topic Pack: Kubernetes Topics (deep background)
Related Runbook: oom-kill.md — if exit code is 137
Related Runbook: deploy-stuck.md — if rollout stalls after fix
Related Runbook: pvc-pending.md — if crash is caused by unbound volume
Troubleshooting Guide: training/library/guides/troubleshooting.md (CrashLoopBackOff section)
Lab: training/interactive/runtime-labs/lab-runtime-01-rollout-probe-failure/
Lab: training/interactive/runtime-labs/lab-runtime-08-resource-limits-oom/
Incident Scenario: training/interactive/incidents/scenarios/crashloop-bad-command.sh

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core

Runbook: Pod CrashLoopBackOff¶

Quick Assessment (30 seconds)¶

Step 1: Check Pod Status and Restart Count¶

Step 2: Get Previous Container Logs¶

Step 3: Describe Pod for Exit Codes and Events¶

Step 4: Check for OOMKill vs Application Crash¶

Step 5: Fix the Root Cause¶

Step 6: Trigger Rollout and Confirm¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Tips and Gotchas¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Pod CrashLoopBackOff¶

Quick Assessment (30 seconds)¶

Step 1: Check Pod Status and Restart Count¶

Step 2: Get Previous Container Logs¶

Step 3: Describe Pod for Exit Codes and Events¶

Step 4: Check for OOMKill vs Application Crash¶

Step 5: Fix the Root Cause¶

Step 6: Trigger Rollout and Confirm¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Tips and Gotchas¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶