Skip to content

Investigation: Deployment Stuck, ImagePull Auth Failure, Fix Is Vault Secret Rotation

Phase 1: Kubernetes Investigation (Dead End)

Check the pod events:

$ kubectl describe pod order-service-8c7d6e5f4-p2n8q -n prod | grep -A10 "Events"
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  4m                 default-scheduler  Successfully assigned prod/order-service-8c7d6e5f4-p2n8q to worker-node-02
  Normal   Pulling    3m (x3 over 4m)    kubelet            Pulling image "registry.internal:5000/order-service:v2.14.0"
  Warning  Failed     3m (x3 over 4m)    kubelet            Failed to pull image: rpc error: code = Unknown desc = unauthorized: authentication required
  Warning  Failed     3m (x3 over 4m)    kubelet            Error: ErrImagePull
  Normal   BackOff    2m (x5 over 3m)    kubelet            Back-off pulling image
  Warning  Failed     2m (x5 over 3m)    kubelet            Error: ImagePullBackOff

Check the imagePullSecret:

$ kubectl get pod order-service-8c7d6e5f4-p2n8q -n prod \
    -o jsonpath='{.spec.imagePullSecrets[*].name}'
regcred-order-service

$ kubectl get secret regcred-order-service -n prod
NAME                     TYPE                             DATA   AGE
regcred-order-service    kubernetes.io/dockerconfigjson   1      47d

The secret exists. Decode and test the credentials:

$ kubectl get secret regcred-order-service -n prod \
    -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq .
{
  "auths": {
    "registry.internal:5000": {
      "username": "svc-order-service",
      "password": "hvs.CAESIG...<truncated>",
      "auth": "c3ZjLW9yZGVyLXNlcnZpY2U6aHZzLkNBRVNJRy4uLg=="
    }
  }
}

# Test the credentials directly
$ curl -u "svc-order-service:hvs.CAESIG..." https://registry.internal:5000/v2/
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required"}]}

The credentials in the secret are rejected by the registry. But wait — the secret is 47 days old. Check if the password is a Vault token:

$ echo "hvs.CAESIG..." | head -c 10
hvs.CAESIG

That hvs. prefix is a HashiCorp Vault service token. This is not a static password — it is a Vault-issued dynamic credential.

The Pivot

Check if other services use the same pattern:

$ kubectl get secrets -n prod -o name | grep regcred | while read s; do
    age=$(kubectl get $s -n prod -o jsonpath='{.metadata.creationTimestamp}')
    echo "$s created=$age"
done
secret/regcred-order-service created=2026-01-31T14:22:00Z
secret/regcred-payment-service created=2026-03-18T09:00:00Z
secret/regcred-inventory-service created=2026-03-18T09:00:00Z
secret/regcred-user-service created=2026-03-18T09:00:00Z

The other services' credentials were refreshed yesterday. The order-service credential is 47 days old — it was not refreshed.

Phase 2: Security Investigation (Root Cause)

Check the Vault secret engine for registry credentials:

$ vault read sys/mounts/registry-creds/tune
Key                  Value
---                  -----
default_lease_ttl    720h    # 30 days
max_lease_ttl        1440h   # 60 days

$ vault list registry-creds/creds/
Keys
----
inventory-service
order-service
payment-service
user-service

$ vault read registry-creds/creds/order-service
Key                Value
---                -----
lease_id           registry-creds/creds/order-service/abc123
lease_duration     0s      # EXPIRED
renewable          false
username           svc-order-service
password           hvs.CAESIG...(expired)

The Vault lease for the order-service registry credentials has expired. The credentials have a 30-day TTL and were last rotated 47 days ago. The External Secrets Operator (ESO) is supposed to refresh these credentials automatically:

$ kubectl get externalsecret order-service-regcred -n prod
NAME                      STORE    REFRESH   STATUS
order-service-regcred     vault    1h        SecretSyncedError

$ kubectl describe externalsecret order-service-regcred -n prod | grep -A5 "Status"
Status:
  Conditions:
    Message:  could not get secret data from provider: vault: 403 permission denied
    Reason:   SecretSyncedError
    Status:   False

The ESO cannot access Vault because the Vault policy for the order-service was accidentally removed during a policy cleanup 17 days ago. The other services were not affected because their policies were in a different path.

Domain Bridge: Why This Crossed Domains

Key insight: The symptom was a Kubernetes ImagePullBackOff (kubernetes_ops), the root cause was an expired Vault credential due to a deleted Vault policy (security), and the fix requires updating the CI/CD pipeline's Vault configuration (devops_tooling). This is common because: dynamic secret management creates a dependency chain from the secrets engine (Vault), through synchronization (ESO), to Kubernetes Secrets. A break at any link manifests as a Kubernetes deployment failure.

Root Cause

During a Vault policy cleanup 17 days ago, the ACL policy granting the External Secrets Operator access to the order-service's registry credentials was accidentally deleted. The ESO could no longer refresh the secret, and after the 30-day Vault lease expired, the Kubernetes Secret contained stale credentials that the registry rejected.