Investigation: Deployment Stuck, ImagePull Auth Failure, Fix Is Vault Secret Rotation¶
Phase 1: Kubernetes Investigation (Dead End)¶
Check the pod events:
$ kubectl describe pod order-service-8c7d6e5f4-p2n8q -n prod | grep -A10 "Events"
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m default-scheduler Successfully assigned prod/order-service-8c7d6e5f4-p2n8q to worker-node-02
Normal Pulling 3m (x3 over 4m) kubelet Pulling image "registry.internal:5000/order-service:v2.14.0"
Warning Failed 3m (x3 over 4m) kubelet Failed to pull image: rpc error: code = Unknown desc = unauthorized: authentication required
Warning Failed 3m (x3 over 4m) kubelet Error: ErrImagePull
Normal BackOff 2m (x5 over 3m) kubelet Back-off pulling image
Warning Failed 2m (x5 over 3m) kubelet Error: ImagePullBackOff
Check the imagePullSecret:
$ kubectl get pod order-service-8c7d6e5f4-p2n8q -n prod \
-o jsonpath='{.spec.imagePullSecrets[*].name}'
regcred-order-service
$ kubectl get secret regcred-order-service -n prod
NAME TYPE DATA AGE
regcred-order-service kubernetes.io/dockerconfigjson 1 47d
The secret exists. Decode and test the credentials:
$ kubectl get secret regcred-order-service -n prod \
-o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq .
{
"auths": {
"registry.internal:5000": {
"username": "svc-order-service",
"password": "hvs.CAESIG...<truncated>",
"auth": "c3ZjLW9yZGVyLXNlcnZpY2U6aHZzLkNBRVNJRy4uLg=="
}
}
}
# Test the credentials directly
$ curl -u "svc-order-service:hvs.CAESIG..." https://registry.internal:5000/v2/
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required"}]}
The credentials in the secret are rejected by the registry. But wait — the secret is 47 days old. Check if the password is a Vault token:
That hvs. prefix is a HashiCorp Vault service token. This is not a static password — it is a Vault-issued dynamic credential.
The Pivot¶
Check if other services use the same pattern:
$ kubectl get secrets -n prod -o name | grep regcred | while read s; do
age=$(kubectl get $s -n prod -o jsonpath='{.metadata.creationTimestamp}')
echo "$s created=$age"
done
secret/regcred-order-service created=2026-01-31T14:22:00Z
secret/regcred-payment-service created=2026-03-18T09:00:00Z
secret/regcred-inventory-service created=2026-03-18T09:00:00Z
secret/regcred-user-service created=2026-03-18T09:00:00Z
The other services' credentials were refreshed yesterday. The order-service credential is 47 days old — it was not refreshed.
Phase 2: Security Investigation (Root Cause)¶
Check the Vault secret engine for registry credentials:
$ vault read sys/mounts/registry-creds/tune
Key Value
--- -----
default_lease_ttl 720h # 30 days
max_lease_ttl 1440h # 60 days
$ vault list registry-creds/creds/
Keys
----
inventory-service
order-service
payment-service
user-service
$ vault read registry-creds/creds/order-service
Key Value
--- -----
lease_id registry-creds/creds/order-service/abc123
lease_duration 0s # EXPIRED
renewable false
username svc-order-service
password hvs.CAESIG...(expired)
The Vault lease for the order-service registry credentials has expired. The credentials have a 30-day TTL and were last rotated 47 days ago. The External Secrets Operator (ESO) is supposed to refresh these credentials automatically:
$ kubectl get externalsecret order-service-regcred -n prod
NAME STORE REFRESH STATUS
order-service-regcred vault 1h SecretSyncedError
$ kubectl describe externalsecret order-service-regcred -n prod | grep -A5 "Status"
Status:
Conditions:
Message: could not get secret data from provider: vault: 403 permission denied
Reason: SecretSyncedError
Status: False
The ESO cannot access Vault because the Vault policy for the order-service was accidentally removed during a policy cleanup 17 days ago. The other services were not affected because their policies were in a different path.
Domain Bridge: Why This Crossed Domains¶
Key insight: The symptom was a Kubernetes ImagePullBackOff (kubernetes_ops), the root cause was an expired Vault credential due to a deleted Vault policy (security), and the fix requires updating the CI/CD pipeline's Vault configuration (devops_tooling). This is common because: dynamic secret management creates a dependency chain from the secrets engine (Vault), through synchronization (ESO), to Kubernetes Secrets. A break at any link manifests as a Kubernetes deployment failure.
Root Cause¶
During a Vault policy cleanup 17 days ago, the ACL policy granting the External Secrets Operator access to the order-service's registry credentials was accidentally deleted. The ESO could no longer refresh the secret, and after the 30-day Vault lease expired, the Kubernetes Secret contained stale credentials that the registry rejected.