Incident Replay: ImagePullBackOff — Registry Authentication Failure¶

Setup¶

System context: Kubernetes cluster pulling images from a private container registry (ECR). A new deployment fails to start — all pods stuck in ImagePullBackOff.
Time: Monday 08:30 UTC
Your role: On-call SRE / platform engineer

Round 1: Alert Fires¶

[Pressure cue: "New deployment of checkout-service stuck — 0 of 5 replicas available. Deploy pipeline is blocked. Product launch is today."]

What you see: kubectl get pods shows 5 pods in ImagePullBackOff. kubectl describe pod shows event: "Failed to pull image: rpc error: code = Unknown desc = Error response from daemon: pull access denied, repository does not exist or may require authentication."

Choose your action: - A) Check if the image tag exists in the registry - B) Check the imagePullSecret attached to the pod - C) Manually pull the image on a node to test - D) Redeploy with the latest tag instead

If you chose B (recommended):¶

[Result: kubectl get pod -o jsonpath='{.spec.imagePullSecrets}' shows the pod references regcred secret. kubectl get secret regcred -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d reveals the ECR token expired. ECR tokens expire every 12 hours. Proceed to Round 2.]

If you chose A:¶

[Result: Image exists in ECR — confirmed via aws ecr describe-images. The issue is authentication, not a missing image.]

If you chose C:¶

[Result: Manual pull on a node also fails with "no basic auth credentials." Confirms auth issue but does not identify the cause.]

If you chose D:¶

[Result: latest tag has the same auth problem — it is the credentials, not the tag.]

Round 2: First Triage Data¶

[Pressure cue: "ECR token expired. The automated token refresh should have handled this. Why did it fail?"]

What you see: The cluster has a CronJob (ecr-token-refresh) that runs every 10 hours to refresh the ECR auth token and update the regcred secret. kubectl get cronjob ecr-token-refresh shows the last successful run was 26 hours ago — 2 scheduled runs have failed.

Choose your action: - A) Manually refresh the token now and update the secret - B) Check the CronJob's failed job logs - C) Delete the CronJob and use IAM Roles for Service Accounts (IRSA) instead - D) Extend the ECR token lifetime

If you chose A (recommended):¶

[Result: aws ecr get-login-password | kubectl create secret docker-registry regcred --docker-server=... --docker-password=... --dry-run=client -o yaml | kubectl apply -f -. Token refreshed. Pods begin pulling images. But you still need to fix the CronJob. Proceed to Round 3.]

If you chose B:¶

[Result: Failed job logs show "error: Unable to assume role — the security token included in the request is expired." The CronJob's own IAM credentials expired. Root cause found. Leads to Round 3.]

If you chose C:¶

[Result: IRSA is the right long-term architecture but migrating during an incident is not the time. Fix the immediate issue first.]

If you chose D:¶

[Result: ECR tokens have a maximum lifetime of 12 hours. You cannot extend them.]

Round 3: Root Cause Identification¶

[Pressure cue: "Images pulling. Fix the automation."]

What you see: Root cause: The CronJob uses a ServiceAccount with an IAM role assumed via a temporary STS token. The STS session expired because the OIDC provider certificate was rotated, invalidating the trust relationship. The CronJob could not assume the IAM role to call ecr get-login-password.

Choose your action: - A) Update the OIDC provider thumbprint in AWS IAM - B) Recreate the IAM role trust policy - C) Switch to IRSA which handles OIDC automatically - D) Fix the OIDC thumbprint and plan migration to IRSA

If you chose D (recommended):¶

[Result: OIDC thumbprint updated — CronJob can assume the IAM role again. Next scheduled run succeeds. Migration to IRSA planned for next sprint to eliminate the CronJob entirely. Proceed to Round 4.]

If you chose A:¶

[Result: Fixes the immediate issue but the CronJob architecture is fragile. IRSA is the better long-term solution.]

If you chose B:¶

[Result: Trust policy is correct — the issue is the OIDC certificate thumbprint, not the policy.]

If you chose C:¶

[Result: Right architecture but takes days to implement and test. Not an incident-time change.]

Round 4: Remediation¶

[Pressure cue: "Deployment succeeded. Product launch proceeding."]

Actions: 1. Verify all pods are Running: kubectl get pods 2. Verify CronJob runs successfully: kubectl create job --from=cronjob/ecr-token-refresh test-refresh 3. Add alerting for CronJob failures 4. Add alerting for imagePullBackOff events 5. Plan IRSA migration to eliminate the token refresh CronJob

Damage Report¶

Total downtime: 0 (new deployment blocked, existing services unaffected)
Blast radius: New deployment delayed 45 minutes; product launch timeline at risk
Optimal resolution time: 10 minutes (check secret -> manual refresh -> fix OIDC)
If every wrong choice was made: 3+ hours with wrong-path debugging and architecture changes during incident

Cross-References¶

Primer: Kubernetes Ops
Primer: AWS IAM
Primer: Container Images
Footguns: Kubernetes Ops

Incident Replay: ImagePullBackOff — Registry Authentication Failure¶

Setup¶

Round 1: Alert Fires¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose D (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose C:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶