Skip to content

Remediation: Deployment Stuck, ImagePull Auth Failure, Fix Is Vault Secret Rotation

Immediate Fix (DevOps Tooling — Domain C)

The fix requires restoring the Vault policy and triggering a credential refresh through the CI/CD pipeline.

Step 1: Restore the Vault policy

$ vault policy write eso-order-service - <<'EOF'
path "registry-creds/creds/order-service" {
  capabilities = ["read"]
}
path "registry-creds/data/order-service" {
  capabilities = ["read"]
}
EOF
Success! Uploaded policy: eso-order-service

# Attach the policy to the ESO's Vault role
$ vault write auth/kubernetes/role/external-secrets-operator \
    bound_service_account_names=external-secrets \
    bound_service_account_namespaces=external-secrets \
    policies=eso-default,eso-order-service,eso-payment-service,eso-inventory-service,eso-user-service \
    ttl=1h

Step 2: Force the ExternalSecret to resync

$ kubectl annotate externalsecret order-service-regcred -n prod \
    force-sync=$(date +%s)

# Wait for sync
$ kubectl get externalsecret order-service-regcred -n prod -w
NAME                      STORE    REFRESH   STATUS
order-service-regcred     vault    1h        SecretSynced

Step 3: Verify the new credentials work

$ kubectl get secret regcred-order-service -n prod -o jsonpath='{.data.\.dockerconfigjson}' \
    | base64 -d | jq -r '.auths["registry.internal:5000"].password' \
    | head -c 20
hvs.CAESINxw2...

# Test authentication
$ kubectl get secret regcred-order-service -n prod -o jsonpath='{.data.\.dockerconfigjson}' \
    | base64 -d | jq -r '.auths["registry.internal:5000"] | .username + ":" + .password' \
    | xargs -I{} curl -u {} https://registry.internal:5000/v2/
{}  # Empty JSON = success

Step 4: Restart the rollout

$ kubectl rollout restart deployment/order-service -n prod
deployment.apps/order-service restarted

$ kubectl rollout status deployment/order-service -n prod
Waiting for deployment "order-service" rollout to finish: 1 of 3 updated replicas are available...
deployment "order-service" successfully rolled out

Verification

Domain A (Kubernetes) — Deployment healthy

$ kubectl get pods -n prod -l app=order-service
NAME                             READY   STATUS    RESTARTS   AGE
order-service-9d8e7f6a5-k3m2n   1/1     Running   0          2m
order-service-9d8e7f6a5-j8p4q   1/1     Running   0          2m
order-service-9d8e7f6a5-h7r1s   1/1     Running   0          2m

Domain B (Security) — Vault policy active, ESO syncing

$ vault policy read eso-order-service
path "registry-creds/creds/order-service" {
  capabilities = ["read"]
}

$ kubectl get externalsecret order-service-regcred -n prod
NAME                      STORE    REFRESH   STATUS
order-service-regcred     vault    1h        SecretSynced

Domain C (DevOps Tooling) — Vault policy in IaC

# Ensure the policy is also in the Terraform/Vault IaC so it survives future cleanups
$ grep -r "eso-order-service" devops/terraform/modules/vault/
devops/terraform/modules/vault/policies.tf:  name   = "eso-order-service"

Prevention

  • Monitoring: Add an ExternalSecret sync status alert. Fire WARNING when any ExternalSecret has Status != SecretSynced for more than 2 hours.
- alert: ExternalSecretSyncFailed
  expr: externalsecret_status_condition{condition="SecretSynced",status="False"} == 1
  for: 2h
  labels:
    severity: warning
  • Runbook: Vault policy changes must be tested against all ESO ExternalSecrets before committing. Add a CI check that verifies all ExternalSecrets can authenticate to Vault.

  • Architecture: Define all Vault policies in Terraform/IaC so that manual deletions are detected as drift and automatically corrected. Use Vault Sentinel policies to prevent deletion of policies that are referenced by active Kubernetes auth roles.