- devops
- l1
- runbook
- docker
- cicd --- Portal | Level: L1: Foundations | Topics: Docker / Containers, CI/CD | Domain: DevOps & Tooling
Runbook: Container Registry Pull Failure¶
| Field | Value |
|---|---|
| Domain | CI/CD |
| Alert | ImagePullBackOff pods, or image_pull_errors_total > 0 |
| Severity | P1 (if blocking deployment), P2 (if single pod) |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 20 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, container registry credentials, ability to update Kubernetes secrets |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get pods -A | grep -E "ImagePull|ErrImage"
Step 1: Confirm the Error and Read the Message¶
Why: ImagePullBackOff can mean several different things — missing image tag, wrong registry URL, expired credentials, or network issue. Reading the actual error message tells you which one it is.
# Get the exact error message for the failing pod
kubectl describe pod <POD_NAME> -n <NAMESPACE> | grep -A10 "Failed\|Error\|Pull\|Back-off"
# If you don't know the pod name, find it first:
kubectl get pods -n <NAMESPACE>
# Then describe the one in ImagePullBackOff state
One of these messages will appear:
"Failed to pull image "<IMAGE>:<TAG>": rpc error: ... manifest unknown"
→ The image tag does not exist in the registry
"Failed to pull image "<IMAGE>:<TAG>": ... unauthorized: authentication required"
→ Credentials are wrong or missing
"Failed to pull image "<IMAGE>:<TAG>": ... connection refused"
→ Registry is unreachable (network issue)
"Back-off pulling image "<IMAGE>:<TAG>""
→ Kubernetes is retrying after a previous failure (check earlier Events for root cause)
kubectl get pods -n <NAMESPACE> | grep -v Running to list non-running pods.
Step 2: Verify the Image Tag Exists in the Registry¶
Why: A common cause of pull failures is deploying a tag that was never pushed, or using the wrong tag name (e.g., latest when the registry uses versioned tags).
# Check if the image and tag exist using docker manifest inspect (does not pull the image):
docker manifest inspect <REGISTRY_HOST>/<IMAGE_NAME>:<TAG>
# If docker CLI is not available, check the registry UI directly:
# - DockerHub: https://hub.docker.com/r/<IMAGE_NAME>/tags
# - AWS ECR: aws ecr describe-images --repository-name <REPO_NAME> --image-ids imageTag=<TAG>
# - GCR: gcloud container images list-tags gcr.io/<PROJECT>/<IMAGE>
# - GHCR: https://github.com/<ORG>/<REPO>/pkgs/container/<IMAGE>
# AWS ECR example:
aws ecr describe-images --repository-name <REPO_NAME> --image-ids imageTag=<TAG> --region <REGION>
docker manifest inspect returns JSON with image details — tag exists.
aws ecr describe-images returns imageDetails with the matching tag — tag exists.
If the tag does NOT exist, you will see:
"no such manifest" or "manifest unknown" or "Error: No such image"
Step 3: Check the Image Pull Secret¶
Why: Kubernetes uses a pull secret to authenticate with private registries. If the secret is missing, expired, or has wrong credentials, all pulls to that registry will fail.
# Check if the pull secret exists in the namespace:
kubectl get secret -n <NAMESPACE> | grep docker
# Inspect the secret contents (decode and pretty-print):
kubectl get secret <PULL_SECRET_NAME> -n <NAMESPACE> \
-o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | python3 -m json.tool
# Check the deployment spec to see which pull secret it references:
kubectl get deployment <DEPLOY_NAME> -n <NAMESPACE> \
-o jsonpath='{.spec.template.spec.imagePullSecrets}'
The decoded secret should contain a "auths" block with your registry hostname:
{
"auths": {
"<REGISTRY_HOST>": {
"username": "<USERNAME>",
"password": "<PASSWORD>",
"email": "<EMAIL>",
"auth": "<BASE64_ENCODED_USER:PASS>"
}
}
}
If the secret is missing: kubectl get secret returns no rows.
If the registry hostname is wrong: the "auths" block uses a different hostname.
Step 4: Test Registry Authentication Manually¶
Why: Before updating secrets, verify your credentials are valid. Updating with wrong credentials just replaces one broken secret with another.
# Test login from your local machine (not the cluster):
docker login <REGISTRY_HOST> -u <USERNAME> -p <PASSWORD>
# AWS ECR — get and test a fresh auth token:
aws ecr get-login-password --region <REGION> | docker login --username AWS \
--password-stdin <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com
# GCR — authenticate using service account key:
docker login -u _json_key --password-stdin https://gcr.io < <SERVICE_ACCOUNT_KEY_FILE>
# GHCR — authenticate with a PAT:
echo <GITHUB_PAT> | docker login ghcr.io -u <GITHUB_USERNAME> --password-stdin
Step 5: Update the Pull Secret in Kubernetes¶
Why: Once you have valid credentials, update the Kubernetes secret so the cluster can pull images again.
# Create or update the docker-registry secret (--dry-run + apply is safe to run even if it exists):
kubectl create secret docker-registry <SECRET_NAME> \
--docker-server=<REGISTRY_HOST> \
--docker-username=<USERNAME> \
--docker-password=<PASSWORD> \
--docker-email=<EMAIL> \
-n <NAMESPACE> \
--dry-run=client -o yaml | kubectl apply -f -
# AWS ECR — the auth token expires every 12 hours, so you may need to automate this.
# For a one-time fix, get a fresh token and create the secret:
kubectl create secret docker-registry ecr-pull-secret \
--docker-server=<ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com \
--docker-username=AWS \
--docker-password=$(aws ecr get-login-password --region <REGION>) \
-n <NAMESPACE> \
--dry-run=client -o yaml | kubectl apply -f -
"secret/<SECRET_NAME> configured" — if the secret already existed and was updated
"secret/<SECRET_NAME> created" — if the secret was new
Step 6: Restart Pods to Trigger a Fresh Pull¶
Why: Pods with ImagePullBackOff will not automatically retry after you fix the secret — you need to restart the deployment to trigger a new pull.
# Restart the deployment (triggers a rolling restart — no downtime if replicas > 1):
kubectl rollout restart deployment/<DEPLOY_NAME> -n <NAMESPACE>
# Watch the rollout:
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>
# Check that pods are no longer in ImagePullBackOff:
kubectl get pods -n <NAMESPACE> -w
"deployment.apps/<DEPLOY_NAME> restarted"
"Waiting for deployment rollout to finish..."
"deployment "<DEPLOY_NAME>" successfully rolled out"
All pods show status "Running" with containers "READY".
ImagePullBackOff after restart, return to Step 1 — the error message may have changed after the secret update. If pods start but then crash, see ../kubernetes/crashloopbackoff.md.
Verification¶
Success looks like: No output — zero pods inImagePullBackOff or ErrImagePull state across all namespaces.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min | Platform/Infra on-call | "P1: Container registry pull failure blocking deployments in |
| Registry appears down | Platform/Infra on-call | "Registry |
| Security incident | Security on-call | "Security incident: container registry credentials may have been compromised — unauthorized pulls detected" |
| Scope expanding (all namespaces) | Platform/Infra on-call | "Cluster-wide ImagePullBackOff across all namespaces — possible registry outage or cluster networking issue" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- If root cause was expired ECR tokens: set up automated token rotation (see platform team docs)
- If root cause was a missing secret in a new namespace: update namespace provisioning automation to include pull secrets
- Document which registries require pull secrets and their rotation schedule
Common Mistakes¶
- Not checking if the image tag actually exists: The most common cause of
ImagePullBackOffis deploying a tag that was never pushed. Always check the registry before suspecting credentials. - Credential expiry (tokens rotate): AWS ECR tokens expire every 12 hours. If this happens repeatedly, you need automated token rotation — a one-time fix is not enough.
- Wrong registry URL: Private registries need the full hostname (e.g.,
my-registry.example.com:5000/image:tag). Using justimage:tagtells Kubernetes to look on DockerHub. - Updating the secret but forgetting to restart pods: Kubernetes does not automatically restart pods when you update a secret — you must trigger a rollout.
- Testing from your local machine but not the cluster: Your laptop may be able to reach the registry but the cluster may have a network policy or VPC routing issue blocking access. If local works but cluster doesn't, check network policies and security groups.
Cross-References¶
- Topic Pack:
training/library/topic-packs/cicd-fundamentals/(deep background on container registries and image management) - Related Runbook: deploy-rollback.md — if you need to roll back to a previously-pulled image
- Related Runbook: build-failure-triage.md — if the image tag was never built
- Related Runbook:
../kubernetes/imagepullbackoff.md— Kubernetes-specific diagnosis
Wiki Navigation¶
Related Content¶
- Mental Models (Core Concepts) (Topic Pack, L0) — CI/CD, Docker / Containers
- AWS ECS (Topic Pack, L2) — Docker / Containers
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD
- CI Pipeline Documentation (Reference, L1) — CI/CD
- CI/CD Drills (Drill, L1) — CI/CD
- CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
- CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Docker / Containers
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Docker / Containers
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Docker / Containers