Skip to content

Runbook: Container Registry Pull Failure

Field Value
Domain CI/CD
Alert ImagePullBackOff pods, or image_pull_errors_total > 0
Severity P1 (if blocking deployment), P2 (if single pod)
Est. Resolution Time 15-30 minutes
Escalation Timeout 20 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, container registry credentials, ability to update Kubernetes secrets

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get pods -A | grep -E "ImagePull|ErrImage"
If output shows: many pods across multiple namespaces → This is a cluster-wide registry outage or credential expiry — skip to Step 3 to check credentials immediately If output shows: a single pod or one namespace → This is likely a tag-not-found or namespace-specific secret issue — proceed to Step 1

Step 1: Confirm the Error and Read the Message

Why: ImagePullBackOff can mean several different things — missing image tag, wrong registry URL, expired credentials, or network issue. Reading the actual error message tells you which one it is.

# Get the exact error message for the failing pod
kubectl describe pod <POD_NAME> -n <NAMESPACE> | grep -A10 "Failed\|Error\|Pull\|Back-off"

# If you don't know the pod name, find it first:
kubectl get pods -n <NAMESPACE>
# Then describe the one in ImagePullBackOff state
Expected output:
One of these messages will appear:
  "Failed to pull image "<IMAGE>:<TAG>": rpc error: ... manifest unknown"
    → The image tag does not exist in the registry
  "Failed to pull image "<IMAGE>:<TAG>": ... unauthorized: authentication required"
    → Credentials are wrong or missing
  "Failed to pull image "<IMAGE>:<TAG>": ... connection refused"
    → Registry is unreachable (network issue)
  "Back-off pulling image "<IMAGE>:<TAG>""
    → Kubernetes is retrying after a previous failure (check earlier Events for root cause)
If this fails: If the pod name is not known, use kubectl get pods -n <NAMESPACE> | grep -v Running to list non-running pods.

Step 2: Verify the Image Tag Exists in the Registry

Why: A common cause of pull failures is deploying a tag that was never pushed, or using the wrong tag name (e.g., latest when the registry uses versioned tags).

# Check if the image and tag exist using docker manifest inspect (does not pull the image):
docker manifest inspect <REGISTRY_HOST>/<IMAGE_NAME>:<TAG>

# If docker CLI is not available, check the registry UI directly:
# - DockerHub: https://hub.docker.com/r/<IMAGE_NAME>/tags
# - AWS ECR: aws ecr describe-images --repository-name <REPO_NAME> --image-ids imageTag=<TAG>
# - GCR: gcloud container images list-tags gcr.io/<PROJECT>/<IMAGE>
# - GHCR: https://github.com/<ORG>/<REPO>/pkgs/container/<IMAGE>

# AWS ECR example:
aws ecr describe-images --repository-name <REPO_NAME> --image-ids imageTag=<TAG> --region <REGION>
Expected output:
docker manifest inspect returns JSON with image details  tag exists.
aws ecr describe-images returns imageDetails with the matching tag  tag exists.

If the tag does NOT exist, you will see:
  "no such manifest" or "manifest unknown" or "Error: No such image"
If this fails: If the tag does not exist, the issue is in your CI/CD pipeline — the image was never built and pushed. Fix the pipeline first (see build-failure-triage.md), then redeploy.

Step 3: Check the Image Pull Secret

Why: Kubernetes uses a pull secret to authenticate with private registries. If the secret is missing, expired, or has wrong credentials, all pulls to that registry will fail.

# Check if the pull secret exists in the namespace:
kubectl get secret -n <NAMESPACE> | grep docker

# Inspect the secret contents (decode and pretty-print):
kubectl get secret <PULL_SECRET_NAME> -n <NAMESPACE> \
  -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | python3 -m json.tool

# Check the deployment spec to see which pull secret it references:
kubectl get deployment <DEPLOY_NAME> -n <NAMESPACE> \
  -o jsonpath='{.spec.template.spec.imagePullSecrets}'
Expected output:
The decoded secret should contain a "auths" block with your registry hostname:
{
  "auths": {
    "<REGISTRY_HOST>": {
      "username": "<USERNAME>",
      "password": "<PASSWORD>",
      "email": "<EMAIL>",
      "auth": "<BASE64_ENCODED_USER:PASS>"
    }
  }
}

If the secret is missing: kubectl get secret returns no rows.
If the registry hostname is wrong: the "auths" block uses a different hostname.
If this fails: If the secret does not exist at all, proceed to Step 5 to create it. If it exists but the hostname doesn't match your image's registry, update the secret with the correct registry.

Step 4: Test Registry Authentication Manually

Why: Before updating secrets, verify your credentials are valid. Updating with wrong credentials just replaces one broken secret with another.

# Test login from your local machine (not the cluster):
docker login <REGISTRY_HOST> -u <USERNAME> -p <PASSWORD>

# AWS ECR — get and test a fresh auth token:
aws ecr get-login-password --region <REGION> | docker login --username AWS \
  --password-stdin <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com

# GCR — authenticate using service account key:
docker login -u _json_key --password-stdin https://gcr.io < <SERVICE_ACCOUNT_KEY_FILE>

# GHCR — authenticate with a PAT:
echo <GITHUB_PAT> | docker login ghcr.io -u <GITHUB_USERNAME> --password-stdin
Expected output:
"Login Succeeded"
If this fails: If login fails, the credentials themselves are invalid or expired. Go to the registry's IAM/admin console and regenerate or rotate the token/password before proceeding to Step 5.

Step 5: Update the Pull Secret in Kubernetes

Why: Once you have valid credentials, update the Kubernetes secret so the cluster can pull images again.

# Create or update the docker-registry secret (--dry-run + apply is safe to run even if it exists):
kubectl create secret docker-registry <SECRET_NAME> \
  --docker-server=<REGISTRY_HOST> \
  --docker-username=<USERNAME> \
  --docker-password=<PASSWORD> \
  --docker-email=<EMAIL> \
  -n <NAMESPACE> \
  --dry-run=client -o yaml | kubectl apply -f -

# AWS ECR — the auth token expires every 12 hours, so you may need to automate this.
# For a one-time fix, get a fresh token and create the secret:
kubectl create secret docker-registry ecr-pull-secret \
  --docker-server=<ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com \
  --docker-username=AWS \
  --docker-password=$(aws ecr get-login-password --region <REGION>) \
  -n <NAMESPACE> \
  --dry-run=client -o yaml | kubectl apply -f -
Expected output:
"secret/<SECRET_NAME> configured"   — if the secret already existed and was updated
"secret/<SECRET_NAME> created"      — if the secret was new
If this fails: If you get a permissions error running kubectl, you need cluster-level permissions to manage secrets. Ask a senior engineer or platform team member to run this step.

Step 6: Restart Pods to Trigger a Fresh Pull

Why: Pods with ImagePullBackOff will not automatically retry after you fix the secret — you need to restart the deployment to trigger a new pull.

# Restart the deployment (triggers a rolling restart — no downtime if replicas > 1):
kubectl rollout restart deployment/<DEPLOY_NAME> -n <NAMESPACE>

# Watch the rollout:
kubectl rollout status deployment/<DEPLOY_NAME> -n <NAMESPACE>

# Check that pods are no longer in ImagePullBackOff:
kubectl get pods -n <NAMESPACE> -w
Expected output:
"deployment.apps/<DEPLOY_NAME> restarted"
"Waiting for deployment rollout to finish..."
"deployment "<DEPLOY_NAME>" successfully rolled out"
All pods show status "Running" with containers "READY".
If this fails: If pods are still in ImagePullBackOff after restart, return to Step 1 — the error message may have changed after the secret update. If pods start but then crash, see ../kubernetes/crashloopbackoff.md.

Verification

# Confirm the issue is resolved
kubectl get pods -A | grep -E "ImagePull|ErrImage"
Success looks like: No output — zero pods in ImagePullBackOff or ErrImagePull state across all namespaces. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 20 min Platform/Infra on-call "P1: Container registry pull failure blocking deployments in , credentials appear valid but pulls still failing"
Registry appears down Platform/Infra on-call "Registry is unreachable from the cluster — checking cloud provider status page"
Security incident Security on-call "Security incident: container registry credentials may have been compromised — unauthorized pulls detected"
Scope expanding (all namespaces) Platform/Infra on-call "Cluster-wide ImagePullBackOff across all namespaces — possible registry outage or cluster networking issue"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • If root cause was expired ECR tokens: set up automated token rotation (see platform team docs)
  • If root cause was a missing secret in a new namespace: update namespace provisioning automation to include pull secrets
  • Document which registries require pull secrets and their rotation schedule

Common Mistakes

  1. Not checking if the image tag actually exists: The most common cause of ImagePullBackOff is deploying a tag that was never pushed. Always check the registry before suspecting credentials.
  2. Credential expiry (tokens rotate): AWS ECR tokens expire every 12 hours. If this happens repeatedly, you need automated token rotation — a one-time fix is not enough.
  3. Wrong registry URL: Private registries need the full hostname (e.g., my-registry.example.com:5000/image:tag). Using just image:tag tells Kubernetes to look on DockerHub.
  4. Updating the secret but forgetting to restart pods: Kubernetes does not automatically restart pods when you update a secret — you must trigger a rollout.
  5. Testing from your local machine but not the cluster: Your laptop may be able to reach the registry but the cluster may have a network policy or VPC routing issue blocking access. If local works but cluster doesn't, check network policies and security groups.

Cross-References

  • Topic Pack: training/library/topic-packs/cicd-fundamentals/ (deep background on container registries and image management)
  • Related Runbook: deploy-rollback.md — if you need to roll back to a previously-pulled image
  • Related Runbook: build-failure-triage.md — if the image tag was never built
  • Related Runbook: ../kubernetes/imagepullbackoff.md — Kubernetes-specific diagnosis

Wiki Navigation