Portal | Level: L2: Operations | Topics: GitOps | Domain: DevOps & Tooling
Scenario: Config Drift Detected in Production¶
The Prompt¶
"Our GitOps tool (ArgoCD) shows the production deployment as 'OutOfSync'. Someone apparently ran
kubectlcommands directly against production. The app is working fine but the declared state doesn't match. How do you handle this?"
Initial Report¶
ArgoCD alert: "Application grokdevops is OutOfSync. Live state differs from Git. Detected manual changes to Deployment spec. Last synced 2 hours ago."
Constraints¶
- Time pressure: You have 15 minutes before the next escalation. The drift may mask a larger issue or be an active hotfix.
- Limited access: You have read access to the cluster and Git repo. Force-syncing ArgoCD requires approval from the tech lead. You cannot identify who made the change without audit logs.
Observable Evidence¶
- ArgoCD UI: The Deployment resource shows a yellow "OutOfSync" badge. Diff view shows
spec.replicaschanged from 3 to 5 and an extra environment variableDEBUG=trueadded. - Dashboard: Application health is green — the app is functioning normally despite the drift.
- Logs: No recent CI/CD pipeline runs; the change was made via direct
kubectlaccess.
Expected Investigation Path¶
# 1. Identify what drifted
kubectl get deploy grokdevops -n grokdevops -o yaml > /tmp/live-state.yaml
helm get manifest grokdevops -n grokdevops > /tmp/declared-state.yaml
diff /tmp/declared-state.yaml /tmp/live-state.yaml
# 2. Check for common drift patterns
kubectl get deploy grokdevops -n grokdevops -o jsonpath='{.spec.replicas}' # Manual scaling?
kubectl get deploy grokdevops -n grokdevops -o json | jq '.spec.template.spec.containers[0].env' # Injected env vars?
# 3. Decide: keep the drift or revert?
# If intentional (hotfix): codify it in values, commit, let GitOps sync
# If accidental: revert via sync
# 4. Reconcile
helm upgrade grokdevops devops/helm/grokdevops -n grokdevops -f devops/helm/values-dev.yaml
Strong Answer¶
"First: don't panic. The app works, so this isn't an outage. Config drift means the live cluster state diverged from what's declared in Git. I'd diff the live state against the Helm manifest to identify exactly what changed. Common drift: someone manually scaled replicas, added env vars for debugging, or patched a resource directly. Once identified, there are two paths: if the change was an intentional hotfix, I'd codify it in the values file, commit to Git, and let the GitOps pipeline sync it properly. If it was accidental, I'd just trigger a sync to revert to declared state. Either way, I'd then review team processes — we should have guardrails (RBAC, admission webhooks, or policy agents) to prevent direct kubectl changes in production. GitOps only works if Git is the single source of truth."
Common Traps¶
- Force-syncing without understanding the drift — the drift might be an active hotfix keeping prod running
- Not investigating who/why — drift is a process failure signal
- Ignoring the meta-problem — this should trigger an RBAC/process review
- Not knowing diff tools —
helm get manifestvs live state comparison
Practice and Links¶
- Lab:
training/interactive/runtime-labs/lab-runtime-07-gitops-sync-and-drift/ - Doc:
training/library/guides/gitops-example.md
Wiki Navigation¶
Related Content¶
- Argo Flashcards (CLI) (flashcard_deck, L1) — GitOps
- GitOps (Topic Pack, L1) — GitOps
- GitOps & ArgoCD Drills (Drill, L2) — GitOps
- Gitops Flashcards (CLI) (flashcard_deck, L1) — GitOps
- Interview: GitOps Drift Detected (Scenario, L2) — GitOps
- Lab: GitOps Sync and Drift (CLI) (Lab, L2) — GitOps
- Runbook: ArgoCD Out of Sync (Runbook, L2) — GitOps
- Runbook: Deploy Rollback (Runbook, L1) — GitOps
- Skillcheck: GitOps (Assessment, L2) — GitOps
- Track: Helm & Release Ops (Reference, L1) — GitOps
Pages that link here¶
- ArgoCD & GitOps - Primer
- GitOps & ArgoCD Drills
- GitOps (ArgoCD) - Skill Check
- Gitops
- Interview Gauntlet: Ansible Playbook 9x Slower
- Interview Gauntlet: Disagreeing with a Technical Decision
- Interview Gauntlet: Improving Team Development Workflow
- Interview Gauntlet: Terraform Plan Shows 47 Resources to Destroy/Recreate
- Interview Gauntlet: When Automation Went Wrong
- Interview Scenarios
- Level 5: SRE & Incident Response
- Runbook: ArgoCD Application OutOfSync
- Runbook: Deploy Rollback
- Scenario: GitOps Drift Causing Outage
- Solution: Lab Runtime 07 -- GitOps Sync and Drift