Skip to content

Incident Replay: ImagePullBackOff — Registry Authentication Failure

Setup

  • System context: Kubernetes cluster pulling images from a private container registry (ECR). A new deployment fails to start — all pods stuck in ImagePullBackOff.
  • Time: Monday 08:30 UTC
  • Your role: On-call SRE / platform engineer

Round 1: Alert Fires

[Pressure cue: "New deployment of checkout-service stuck — 0 of 5 replicas available. Deploy pipeline is blocked. Product launch is today."]

What you see: kubectl get pods shows 5 pods in ImagePullBackOff. kubectl describe pod shows event: "Failed to pull image: rpc error: code = Unknown desc = Error response from daemon: pull access denied, repository does not exist or may require authentication."

Choose your action: - A) Check if the image tag exists in the registry - B) Check the imagePullSecret attached to the pod - C) Manually pull the image on a node to test - D) Redeploy with the latest tag instead

[Result: kubectl get pod -o jsonpath='{.spec.imagePullSecrets}' shows the pod references regcred secret. kubectl get secret regcred -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d reveals the ECR token expired. ECR tokens expire every 12 hours. Proceed to Round 2.]

If you chose A:

[Result: Image exists in ECR — confirmed via aws ecr describe-images. The issue is authentication, not a missing image.]

If you chose C:

[Result: Manual pull on a node also fails with "no basic auth credentials." Confirms auth issue but does not identify the cause.]

If you chose D:

[Result: latest tag has the same auth problem — it is the credentials, not the tag.]

Round 2: First Triage Data

[Pressure cue: "ECR token expired. The automated token refresh should have handled this. Why did it fail?"]

What you see: The cluster has a CronJob (ecr-token-refresh) that runs every 10 hours to refresh the ECR auth token and update the regcred secret. kubectl get cronjob ecr-token-refresh shows the last successful run was 26 hours ago — 2 scheduled runs have failed.

Choose your action: - A) Manually refresh the token now and update the secret - B) Check the CronJob's failed job logs - C) Delete the CronJob and use IAM Roles for Service Accounts (IRSA) instead - D) Extend the ECR token lifetime

[Result: aws ecr get-login-password | kubectl create secret docker-registry regcred --docker-server=... --docker-password=... --dry-run=client -o yaml | kubectl apply -f -. Token refreshed. Pods begin pulling images. But you still need to fix the CronJob. Proceed to Round 3.]

If you chose B:

[Result: Failed job logs show "error: Unable to assume role — the security token included in the request is expired." The CronJob's own IAM credentials expired. Root cause found. Leads to Round 3.]

If you chose C:

[Result: IRSA is the right long-term architecture but migrating during an incident is not the time. Fix the immediate issue first.]

If you chose D:

[Result: ECR tokens have a maximum lifetime of 12 hours. You cannot extend them.]

Round 3: Root Cause Identification

[Pressure cue: "Images pulling. Fix the automation."]

What you see: Root cause: The CronJob uses a ServiceAccount with an IAM role assumed via a temporary STS token. The STS session expired because the OIDC provider certificate was rotated, invalidating the trust relationship. The CronJob could not assume the IAM role to call ecr get-login-password.

Choose your action: - A) Update the OIDC provider thumbprint in AWS IAM - B) Recreate the IAM role trust policy - C) Switch to IRSA which handles OIDC automatically - D) Fix the OIDC thumbprint and plan migration to IRSA

[Result: OIDC thumbprint updated — CronJob can assume the IAM role again. Next scheduled run succeeds. Migration to IRSA planned for next sprint to eliminate the CronJob entirely. Proceed to Round 4.]

If you chose A:

[Result: Fixes the immediate issue but the CronJob architecture is fragile. IRSA is the better long-term solution.]

If you chose B:

[Result: Trust policy is correct — the issue is the OIDC certificate thumbprint, not the policy.]

If you chose C:

[Result: Right architecture but takes days to implement and test. Not an incident-time change.]

Round 4: Remediation

[Pressure cue: "Deployment succeeded. Product launch proceeding."]

Actions: 1. Verify all pods are Running: kubectl get pods 2. Verify CronJob runs successfully: kubectl create job --from=cronjob/ecr-token-refresh test-refresh 3. Add alerting for CronJob failures 4. Add alerting for imagePullBackOff events 5. Plan IRSA migration to eliminate the token refresh CronJob

Damage Report

  • Total downtime: 0 (new deployment blocked, existing services unaffected)
  • Blast radius: New deployment delayed 45 minutes; product launch timeline at risk
  • Optimal resolution time: 10 minutes (check secret -> manual refresh -> fix OIDC)
  • If every wrong choice was made: 3+ hours with wrong-path debugging and architecture changes during incident

Cross-References