devops
l1
runbook
docker
cicd --- Portal | Level: L1: Foundations | Topics: Docker / Containers, CI/CD | Domain: DevOps & Tooling

Runbook: Container Registry Pull Failure¶

Field	Value
Domain	CI/CD
Alert	ImagePullBackOff pods, or `image_pull_errors_total > 0`
Severity	P1 (if blocking deployment), P2 (if single pod)
Est. Resolution Time	15-30 minutes
Escalation Timeout	20 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, container registry credentials, ability to update Kubernetes secrets

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl get pods -A | grep -E "ImagePull|ErrImage"

If output shows: many pods across multiple namespaces → This is a cluster-wide registry outage or credential expiry — skip to Step 3 to check credentials immediately If output shows: a single pod or one namespace → This is likely a tag-not-found or namespace-specific secret issue — proceed to Step 1

Step 1: Confirm the Error and Read the Message¶

Why: ImagePullBackOff can mean several different things — missing image tag, wrong registry URL, expired credentials, or network issue. Reading the actual error message tells you which one it is.

# Get the exact error message for the failing pod
kubectl describe pod <POD_NAME> -n <NAMESPACE> | grep -A10 "Failed\|Error\|Pull\|Back-off"

# If you don't know the pod name, find it first:
kubectl get pods -n <NAMESPACE>
# Then describe the one in ImagePullBackOff state

Expected output:

One of these messages will appear:
  "Failed to pull image "<IMAGE>:<TAG>": rpc error: ... manifest unknown"
    → The image tag does not exist in the registry
  "Failed to pull image "<IMAGE>:<TAG>": ... unauthorized: authentication required"
    → Credentials are wrong or missing
  "Failed to pull image "<IMAGE>:<TAG>": ... connection refused"
    → Registry is unreachable (network issue)
  "Back-off pulling image "<IMAGE>:<TAG>""
    → Kubernetes is retrying after a previous failure (check earlier Events for root cause)

If this fails: If the pod name is not known, use kubectl get pods -n <NAMESPACE> | grep -v Running to list non-running pods.

Step 2: Verify the Image Tag Exists in the Registry¶

Why: A common cause of pull failures is deploying a tag that was never pushed, or using the wrong tag name (e.g., latest when the registry uses versioned tags).

# Check if the image and tag exist using docker manifest inspect (does not pull the image):
docker manifest inspect <REGISTRY_HOST>/<IMAGE_NAME>:<TAG>

# If docker CLI is not available, check the registry UI directly:
# - DockerHub: https://hub.docker.com/r/<IMAGE_NAME>/tags
# - AWS ECR: aws ecr describe-images --repository-name <REPO_NAME> --image-ids imageTag=<TAG>
# - GCR: gcloud container images list-tags gcr.io/<PROJECT>/<IMAGE>
# - GHCR: https://github.com/<ORG>/<REPO>/pkgs/container/<IMAGE>

# AWS ECR example:
aws ecr describe-images --repository-name <REPO_NAME> --image-ids imageTag=<TAG> --region <REGION>

Expected output:

docker manifest inspect returns JSON with image details — tag exists.
aws ecr describe-images returns imageDetails with the matching tag — tag exists.

If the tag does NOT exist, you will see:
  "no such manifest" or "manifest unknown" or "Error: No such image"

If this fails: If the tag does not exist, the issue is in your CI/CD pipeline — the image was never built and pushed. Fix the pipeline first (see build-failure-triage.md), then redeploy.

Step 3: Check the Image Pull Secret¶

Why: Kubernetes uses a pull secret to authenticate with private registries. If the secret is missing, expired, or has wrong credentials, all pulls to that registry will fail.

# Check if the pull secret exists in the namespace:
kubectl get secret -n <NAMESPACE> | grep docker

# Inspect the secret contents (decode and pretty-print):
kubectl get secret <PULL_SECRET_NAME> -n <NAMESPACE> \
  -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | python3 -m json.tool

# Check the deployment spec to see which pull secret it references:
kubectl get deployment <DEPLOY_NAME> -n <NAMESPACE> \
  -o jsonpath='{.spec.template.spec.imagePullSecrets}'

Expected output:

The decoded secret should contain a "auths" block with your registry hostname:
{
  "auths": {
    "<REGISTRY_HOST>": {
      "username": "<USERNAME>",
      "password": "<PASSWORD>",
      "email": "<EMAIL>",
      "auth": "<BASE64_ENCODED_USER:PASS>"
    }
  }
}

If the secret is missing: kubectl get secret returns no rows.
If the registry hostname is wrong: the "auths" block uses a different hostname.

If this fails: If the secret does not exist at all, proceed to Step 5 to create it. If it exists but the hostname doesn't match your image's registry, update the secret with the correct registry.

Step 4: Test Registry Authentication Manually¶

Why: Before updating secrets, verify your credentials are valid. Updating with wrong credentials just replaces one broken secret with another.

# Test login from your local machine (not the cluster):
docker login <REGISTRY_HOST> -u <USERNAME> -p <PASSWORD>

# AWS ECR — get and test a fresh auth token:
aws ecr get-login-password --region <REGION> | docker login --username AWS \
  --password-stdin <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com

# GCR — authenticate using service account key:
docker login -u _json_key --password-stdin https://gcr.io < <SERVICE_ACCOUNT_KEY_FILE>

# GHCR — authenticate with a PAT:
echo <GITHUB_PAT> | docker login ghcr.io -u <GITHUB_USERNAME> --password-stdin

Expected output:

"Login Succeeded"

If this fails: If login fails, the credentials themselves are invalid or expired. Go to the registry's IAM/admin console and regenerate or rotate the token/password before proceeding to Step 5.

Step 5: Update the Pull Secret in Kubernetes¶

Why: Once you have valid credentials, update the Kubernetes secret so the cluster can pull images again.

# Create or update the docker-registry secret (--dry-run + apply is safe to run even if it exists):
kubectl create secret docker-registry <SECRET_NAME> \
  --docker-server=<REGISTRY_HOST> \
  --docker-username=<USERNAME> \
  --docker-password=<PASSWORD> \
  --docker-email=<EMAIL> \
  -n <NAMESPACE> \
  --dry-run=client -o yaml | kubectl apply -f -

# AWS ECR — the auth token expires every 12 hours, so you may need to automate this.
# For a one-time fix, get a fresh token and create the secret:
kubectl create secret docker-registry ecr-pull-secret \
  --docker-server=<ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com \
  --docker-username=AWS \
  --docker-password=$(aws ecr get-login-password --region <REGION>) \
  -n <NAMESPACE> \
  --dry-run=client -o yaml | kubectl apply -f -

Expected output:

"secret/<SECRET_NAME> configured"   — if the secret already existed and was updated
"secret/<SECRET_NAME> created"      — if the secret was new

If this fails: If you get a permissions error running kubectl, you need cluster-level permissions to manage secrets. Ask a senior engineer or platform team member to run this step.

Step 6: Restart Pods to Trigger a Fresh Pull¶

Why: Pods with ImagePullBackOff will not automatically retry after you fix the secret — you need to restart the deployment to trigger a new pull.

# Restart the deployment (triggers a rolling restart — no downtime if replicas > 1):
kubectl rollout restart deployment/<DEPLOY_NAME> -n <NAMESPACE>

# Watch the rollout:
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>

# Check that pods are no longer in ImagePullBackOff:
kubectl get pods -n <NAMESPACE> -w

Expected output:

"deployment.apps/<DEPLOY_NAME> restarted"
"Waiting for deployment rollout to finish..."
"deployment "<DEPLOY_NAME>" successfully rolled out"
All pods show status "Running" with containers "READY".

If this fails: If pods are still in ImagePullBackOff after restart, return to Step 1 — the error message may have changed after the secret update. If pods start but then crash, see ../kubernetes/crashloopbackoff.md.

Verification¶

# Confirm the issue is resolved
kubectl get pods -A | grep -E "ImagePull|ErrImage"

Success looks like: No output — zero pods in ImagePullBackOff or ErrImagePull state across all namespaces. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 20 min	Platform/Infra on-call	"P1: Container registry pull failure blocking deployments in , credentials appear valid but pulls still failing"
Registry appears down	Platform/Infra on-call	"Registry is unreachable from the cluster — checking cloud provider status page"
Security incident	Security on-call	"Security incident: container registry credentials may have been compromised — unauthorized pulls detected"
Scope expanding (all namespaces)	Platform/Infra on-call	"Cluster-wide ImagePullBackOff across all namespaces — possible registry outage or cluster networking issue"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
If root cause was expired ECR tokens: set up automated token rotation (see platform team docs)
If root cause was a missing secret in a new namespace: update namespace provisioning automation to include pull secrets
Document which registries require pull secrets and their rotation schedule

Common Mistakes¶

Not checking if the image tag actually exists: The most common cause of ImagePullBackOff is deploying a tag that was never pushed. Always check the registry before suspecting credentials.
Credential expiry (tokens rotate): AWS ECR tokens expire every 12 hours. If this happens repeatedly, you need automated token rotation — a one-time fix is not enough.
Wrong registry URL: Private registries need the full hostname (e.g., my-registry.example.com:5000/image:tag). Using just image:tag tells Kubernetes to look on DockerHub.
Updating the secret but forgetting to restart pods: Kubernetes does not automatically restart pods when you update a secret — you must trigger a rollout.
Testing from your local machine but not the cluster: Your laptop may be able to reach the registry but the cluster may have a network policy or VPC routing issue blocking access. If local works but cluster doesn't, check network policies and security groups.

Cross-References¶

Topic Pack: training/library/topic-packs/cicd-fundamentals/ (deep background on container registries and image management)
Related Runbook: deploy-rollback.md — if you need to roll back to a previously-pulled image
Related Runbook: build-failure-triage.md — if the image tag was never built
Related Runbook: ../kubernetes/imagepullbackoff.md — Kubernetes-specific diagnosis

Mental Models (Core Concepts) (Topic Pack, L0) — CI/CD, Docker / Containers
AWS ECS (Topic Pack, L2) — Docker / Containers
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD
CI Pipeline Documentation (Reference, L1) — CI/CD
CI/CD Drills (Drill, L1) — CI/CD
CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Docker / Containers
Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Docker / Containers
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Docker / Containers

Runbook: Container Registry Pull Failure¶

Quick Assessment (30 seconds)¶

Step 1: Confirm the Error and Read the Message¶

Step 2: Verify the Image Tag Exists in the Registry¶

Step 3: Check the Image Pull Secret¶

Step 4: Test Registry Authentication Manually¶

Step 5: Update the Pull Secret in Kubernetes¶

Step 6: Restart Pods to Trigger a Fresh Pull¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Container Registry Pull Failure¶

Quick Assessment (30 seconds)¶

Step 1: Confirm the Error and Read the Message¶

Step 2: Verify the Image Tag Exists in the Registry¶

Step 3: Check the Image Pull Secret¶

Step 4: Test Registry Authentication Manually¶

Step 5: Update the Pull Secret in Kubernetes¶

Step 6: Restart Pods to Trigger a Fresh Pull¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶