Skip to content

Portal | Level: L2: Operations | Topics: GitOps | Domain: DevOps & Tooling

Scenario: Config Drift Detected in Production

The Prompt

"Our GitOps tool (ArgoCD) shows the production deployment as 'OutOfSync'. Someone apparently ran kubectl commands directly against production. The app is working fine but the declared state doesn't match. How do you handle this?"

Initial Report

ArgoCD alert: "Application grokdevops is OutOfSync. Live state differs from Git. Detected manual changes to Deployment spec. Last synced 2 hours ago."

Constraints

  • Time pressure: You have 15 minutes before the next escalation. The drift may mask a larger issue or be an active hotfix.
  • Limited access: You have read access to the cluster and Git repo. Force-syncing ArgoCD requires approval from the tech lead. You cannot identify who made the change without audit logs.

Observable Evidence

  • ArgoCD UI: The Deployment resource shows a yellow "OutOfSync" badge. Diff view shows spec.replicas changed from 3 to 5 and an extra environment variable DEBUG=true added.
  • Dashboard: Application health is green — the app is functioning normally despite the drift.
  • Logs: No recent CI/CD pipeline runs; the change was made via direct kubectl access.

Expected Investigation Path

# 1. Identify what drifted
kubectl get deploy grokdevops -n grokdevops -o yaml > /tmp/live-state.yaml
helm get manifest grokdevops -n grokdevops > /tmp/declared-state.yaml
diff /tmp/declared-state.yaml /tmp/live-state.yaml

# 2. Check for common drift patterns
kubectl get deploy grokdevops -n grokdevops -o jsonpath='{.spec.replicas}'  # Manual scaling?
kubectl get deploy grokdevops -n grokdevops -o json | jq '.spec.template.spec.containers[0].env'  # Injected env vars?

# 3. Decide: keep the drift or revert?
# If intentional (hotfix): codify it in values, commit, let GitOps sync
# If accidental: revert via sync

# 4. Reconcile
helm upgrade grokdevops devops/helm/grokdevops -n grokdevops -f devops/helm/values-dev.yaml

Strong Answer

"First: don't panic. The app works, so this isn't an outage. Config drift means the live cluster state diverged from what's declared in Git. I'd diff the live state against the Helm manifest to identify exactly what changed. Common drift: someone manually scaled replicas, added env vars for debugging, or patched a resource directly. Once identified, there are two paths: if the change was an intentional hotfix, I'd codify it in the values file, commit to Git, and let the GitOps pipeline sync it properly. If it was accidental, I'd just trigger a sync to revert to declared state. Either way, I'd then review team processes — we should have guardrails (RBAC, admission webhooks, or policy agents) to prevent direct kubectl changes in production. GitOps only works if Git is the single source of truth."

Common Traps

  • Force-syncing without understanding the drift — the drift might be an active hotfix keeping prod running
  • Not investigating who/why — drift is a process failure signal
  • Ignoring the meta-problem — this should trigger an RBAC/process review
  • Not knowing diff toolshelm get manifest vs live state comparison
  • Lab: training/interactive/runtime-labs/lab-runtime-07-gitops-sync-and-drift/
  • Doc: training/library/guides/gitops-example.md

Wiki Navigation