ArgoCD & GitOps — Street-Level Ops¶
Quick Diagnosis Commands¶
# Overall state: what's unhealthy or out of sync?
argocd app list -o wide | grep -vE "Synced.*Healthy"
# Specific app detail
argocd app get my-app
# What would change if I sync?
argocd app diff my-app
# Force a fresh reconciliation from Git
argocd app get my-app --refresh && argocd app diff my-app
# Application controller logs (where reconcile errors appear)
kubectl -n argocd logs -l app.kubernetes.io/name=argocd-application-controller --tail=100 -f
# Repo server logs (manifest rendering errors)
kubectl -n argocd logs -l app.kubernetes.io/name=argocd-repo-server --tail=100
# Recent sync events for an app
kubectl -n argocd get events --field-selector involvedObject.name=my-app --sort-by='.lastTimestamp'
Gotcha: App is "OutOfSync" But diff Shows Nothing¶
ArgoCD's diff algorithm strips some fields before comparing (e.g., managedFields, resourceVersion). Sometimes it flags a resource as OutOfSync because of a field ArgoCD wrote itself but tracks differently, or because a mutating admission webhook modifies the object after apply.
Rule: Check the sync result detail, not just the top-level status.
# See which resource is OutOfSync and why
argocd app get my-app --show-resource-details
# Ignore specific fields that drift legitimately (e.g., injected sidecar)
# Add to Application spec:
spec:
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/template/spec/containers/0/image # mutated by image updater
- group: ""
kind: ConfigMap
name: my-config
jqPathExpressions:
- .data."generated-at" # injected by init container
Gotcha: Prune Deletes Resources You Want to Keep¶
If prune: true is set and a resource disappears from Git (renamed file, refactoring), ArgoCD deletes it from the cluster on next sync.
Rule: Test prune behavior with --dry-run before enabling it in production. Use argocd.argoproj.io/managed-by annotation explicitly on resources you want ArgoCD to manage.
# See what would be pruned
argocd app sync my-app --dry-run --prune
# Prevent a specific resource from being pruned
metadata:
annotations:
argocd.argoproj.io/sync-options: Prune=false
Gotcha: Self-Heal Reverts a Manual Hotfix¶
War story: During a P1 incident, an on-call engineer ran
kubectl set imageto roll back a bad deployment. ArgoCD reverted the rollback 3 minutes later. The engineer did it again. ArgoCD reverted again. This loop continued for 15 minutes before someone realized self-heal was fighting the human. The incident post-mortem added "disable ArgoCD self-heal" to the incident response runbook.
You pushed a hotfix via kubectl set image during an incident. ArgoCD reverts it within 3 minutes because selfHeal: true is set.
Rule: During an incident, either disable self-heal for the app or commit the fix to Git immediately. Prefer committing — the hotfix then becomes auditable.
# Disable self-heal temporarily (won't persist — controller resets from Application spec)
# Instead: patch the Application resource directly
kubectl -n argocd patch app my-app --type=merge \
-p '{"spec":{"syncPolicy":{"automated":{"selfHeal":false}}}}'
# Better: commit the hotfix image to Git, then re-enable self-heal
# git commit -m "hotfix: bump image to v1.2.4-patch" && git push
Gotcha: Sync Wave Hook Job Fails but App Goes Healthy¶
If a PreSync hook Job fails but the Application's syncPolicy doesn't account for it, ArgoCD may mark the sync as succeeded anyway — especially if the Job's exit behavior isn't checked.
Rule: Hook failures should always set SyncFail phase response. Use the argocd.argoproj.io/hook-delete-policy: BeforeHookCreation to ensure fresh runs. Monitor the sync status in PostSync steps.
# Check hook resource status manually
kubectl -n my-app get jobs -l argocd.argoproj.io/hook=PreSync
kubectl -n my-app logs job/db-migrate
Pattern: Promote Across Environments via Git¶
Use environment overlays in Kustomize or separate values files in Helm. Promotion = updating the image tag in the target environment's config.
gitops-repo/
├── apps/my-service/
│ ├── base/ ← shared manifests
│ ├── overlays/
│ │ ├── dev/ ← kustomization.yaml with dev image tag
│ │ ├── staging/ ← staging image tag
│ │ └── prod/ ← prod image tag
CI pipeline promotion script:
#!/bin/bash
# Promote image tag from staging to prod
NEW_TAG=$1
cd gitops-repo
# Update prod overlay image tag
cd apps/my-service/overlays/prod
kustomize edit set image ghcr.io/myorg/my-service:${NEW_TAG}
cd -
git add apps/my-service/overlays/prod/kustomization.yaml
git commit -m "chore: promote my-service ${NEW_TAG} to prod"
git push
# Optionally trigger ArgoCD sync immediately
argocd app sync my-service-prod --timeout 120
argocd app wait my-service-prod --health --timeout 300
Pattern: Bootstrap a New Cluster in 5 Minutes¶
# 1. Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
kubectl -n argocd wait --for=condition=available deploy/argocd-server --timeout=120s
# 2. Apply the root Application (App of Apps)
kubectl apply -f gitops-repo/root-app.yaml
# 3. Watch everything come up
argocd app list -w
# ArgoCD syncs root-app → creates child Applications → each child syncs its workloads
Pattern: Locked Sync for Production Releases¶
Use automated.enabled: false for prod, requiring explicit human sync. CI/CD triggers sync with approval gate.
# prod Application — manual sync only
spec:
syncPolicy:
automated: {} # no automated field = manual sync required
# CI pipeline: notify on diff, require approval, then sync
argocd app diff my-service-prod --exit-code # exits 1 if diff exists
# (Slack notification to #releases channel)
# (Human approves in Slack / CI gate)
argocd app sync my-service-prod --timeout 180
argocd app wait my-service-prod --health --timeout 300
Scenario: App Stuck in "Progressing" for 20+ Minutes¶
Debug clue: ArgoCD uses Lua scripts to compute health status. If a custom resource (e.g., Istio VirtualService, cert-manager Certificate) is stuck in "Progressing," it is likely missing a custom health check in the
argocd-cmConfigMap. ArgoCD does not know how to evaluate health for CRDs it has never seen before.
Symptoms: argocd app get my-app shows Health: Progressing, Deployment shows available replicas, but ArgoCD won't go Healthy.
Diagnosis:
# Check what resource ArgoCD thinks is unhealthy
argocd app get my-app --show-resource-details | grep -v Healthy
# Check if a hook is stuck
kubectl -n my-app get jobs
kubectl -n my-app logs job/postinstall-job
# Check Application controller
kubectl -n argocd logs -l app.kubernetes.io/name=argocd-application-controller | grep my-app
# Check repo server (manifest rendering issue?)
kubectl -n argocd logs -l app.kubernetes.io/name=argocd-repo-server | grep -i error
Common causes:
1. PostSync hook Job is running/failing — check job logs
2. Custom health check returning Progressing indefinitely — check argocd-cm Lua script
3. PVC in Pending state (no storage class) — kubectl get pvc -n my-app
4. Ingress not getting an IP (LB not provisioned) — check ingress status
Resolution:
# Force health status refresh
argocd app get my-app --hard-refresh
# If a hook is stuck and you need to unblock: delete it, ArgoCD will re-create next sync
kubectl -n my-app delete job stuck-hook-job
Scenario: Repository Credentials Changed, Apps Stop Syncing¶
Symptoms: All apps from one repo go Unknown/OutOfSync, repo server logs show 401 Unauthorized or authentication required.
# Check repo connection
argocd repo list
argocd repo get https://github.com/myorg/gitops
# Update credentials
argocd repo add https://github.com/myorg/gitops \
--username git \
--password ghp_newtoken123 \
--upsert
# Or update the underlying Secret directly
kubectl -n argocd edit secret repo-secret-name
# Force resync after credential fix
argocd app list -o name | xargs -I{} argocd app get {} --refresh
Emergency: Mass OutOfSync After Bad Commit¶
Gotcha: If self-heal is enabled and the bad commit produces valid but wrong manifests (e.g., replicas: 0), ArgoCD will actively push the broken state to the cluster. Disabling automated sync must be your FIRST action, before even looking at the bad commit.
Someone pushed malformed YAML to the GitOps repo. All Applications trying to render manifests are failing. Self-heal is reverting live resources to broken state.
# 1. Immediately disable automated sync on affected apps to stop the damage
for app in $(argocd app list -o name); do
kubectl -n argocd patch app $app --type=merge \
-p '{"spec":{"syncPolicy":{"automated":null}}}'
done
# 2. Revert the bad commit in Git
cd gitops-repo
git revert HEAD --no-edit
git push
# 3. Re-enable automated sync
for app in $(argocd app list -o name); do
argocd app set $app --sync-policy automated --auto-prune --self-heal
done
# 4. Force sync all
argocd app list -o name | xargs -I{} argocd app sync {} --timeout 120
Useful One-Liners¶
# List all apps not in Synced+Healthy state
argocd app list -o wide | grep -vE "^\S+\s+\S+\s+\S+\s+Synced\s+Healthy"
# Get sync history for all apps
argocd app list -o name | xargs -I{} sh -c 'echo "=== {} ===" && argocd app history {}'
# Sync all apps in a project
argocd app list -l argocd.argoproj.io/project=team-payments -o name | xargs -I{} argocd app sync {}
# Watch all apps until healthy
watch -n5 "argocd app list -o wide | grep -vE 'Synced.*Healthy'"
# Get all resources managed by an app
argocd app resources my-app
# Trigger hard refresh (re-clone repo, re-render manifests)
argocd app get my-app --hard-refresh
# Check ArgoCD server resource usage
kubectl -n argocd top pods
# Describe a specific managed resource
argocd app resource-info my-app --kind Deployment --resource-name my-app
# Export all Applications as YAML
argocd app list -o name | xargs -I{} argocd app get {} -o yaml > all-apps-backup.yaml
# Count apps per sync status
argocd app list -o wide | awk 'NR>1 {print $5}' | sort | uniq -c
Quick Reference¶
- Cheatsheet: Gitops-Argocd