Thinking Out Loud: ArgoCD & GitOps¶
A senior SRE's internal monologue while working through a real GitOps issue. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
ArgoCD shows 5 applications as "OutOfSync" even though no one has made changes to the Git repos or the clusters. The team is hesitant to hit "Sync" because last time they synced an unexpectedly OutOfSync app, it reverted a manual hotfix that was applied during an incident.
The Monologue¶
Five apps OutOfSync with no intentional changes. This is drift — either the cluster state changed (manual kubectl edits, admission webhooks, operator reconciliation) or the Git source changed (dependency updates, Helm chart version bumps). Let me figure out which.
argocd app list --output json | jq '.[] | select(.status.sync.status == "OutOfSync") | {name: .metadata.name, health: .status.health.status, sync: .status.sync.status}'
Five apps: payment-service, user-service, feature-flags, logging-agent, cert-manager. Mixed set — some are our apps, some are infrastructure. Let me check what the diff looks like for each.
The diff shows... the replicas field changed. Git says replicas: 3, the cluster has replicas: 5. Someone scaled up the payment-service manually (probably during the Black Friday traffic spike last week) and never updated the Git source. If we sync, ArgoCD will scale it back down to 3. That could cause an outage if we're still seeing elevated traffic.
Mental Model: Git Is the Desired State, Not Always the Current Truth¶
In GitOps, the Git repo is the desired state, and ArgoCD enforces it. But manual changes (hotfixes, emergency scaling, debugging patches) create a gap between desired and actual state. Before syncing, you must decide: is the Git state correct (sync to Git), or is the cluster state correct (update Git to match the cluster)? Blindly syncing always reverts manual changes, which is often the WRONG thing to do.
Let me check each OutOfSync app's diff to understand whether Git or the cluster has the "correct" state.
for app in payment-service user-service feature-flags logging-agent cert-manager; do
echo "=== $app ==="
argocd app diff $app 2>&1 | head -10
echo
done
Results:
- payment-service: replicas 3 (Git) vs 5 (cluster) — manual scale-up, probably intentional
- user-service: a new annotation kubectl.kubernetes.io/last-applied-configuration was added — this is just metadata noise
- feature-flags: ConfigMap has an extra key TEMP_FLAG_EMERGENCY_BYPASS — manual hotfix from last incident
- logging-agent: image tag differs, Git has v1.9.0, cluster has v1.9.1 — someone manually upgraded
- cert-manager: CRD field ordering changed — this is just YAML serialization noise, not a real change
Let me handle these one at a time.
payment-service: I need to check if the scale-up is still needed.
CPU at 45% per pod. With 5 pods, that's healthy. If we drop to 3, each pod would be at 75%. That's too high for a payment service. The Git source needs to be updated, not the cluster.
# Update Git to match the cluster
cd /tmp && git clone git@github.com:our-org/k8s-manifests.git && cd k8s-manifests
sed -i 's/replicas: 3/replicas: 5/' apps/payment-service/deployment.yaml
git add . && git commit -m "fix: update payment-service replicas to 5 (post-incident capacity)"
git push origin main
user-service: The last-applied-configuration annotation is kubectl metadata noise. This happens when someone runs kubectl apply directly instead of through ArgoCD. I can ignore this diff or set ArgoCD to ignore it.
argocd app set user-service --config-management-plugin '' 2>/dev/null
# Better: add an ignore difference in the Application spec
kubectl patch application user-service -n argocd --type merge -p '{
"spec": {
"ignoreDifferences": [{
"group": "",
"kind": "Deployment",
"jsonPointers": ["/metadata/annotations/kubectl.kubernetes.io~1last-applied-configuration"]
}]
}
}'
Mental Model: ArgoCD Ignore Differences for Known Drift¶
Some drift is expected and permanent: annotations added by controllers, fields defaulted by the API server, or CRD field ordering changes. Use ArgoCD's
ignoreDifferencesto stop flagging these as OutOfSync. But be careful — every ignored field is a field that ArgoCD won't manage. Only ignore fields that are genuinely not managed by Git.
feature-flags: The emergency bypass flag was added during an incident. I need to check with the team whether it's still needed or if it can be removed.
kubectl get configmap feature-flags-config -n feature-flags -o jsonpath='{.data.TEMP_FLAG_EMERGENCY_BYPASS}'
Value: true. Let me check when the incident was resolved... it's been 5 days. The flag should have been either committed to Git or removed by now. Let me check with the on-call.
For now, I'll leave it and tag it for follow-up. But I'll add it to Git as a tracked config, not a manual override.
logging-agent: Someone manually upgraded from v1.9.0 to v1.9.1. This is the riskiest one — if I sync, it'll downgrade back to v1.9.0. Let me check why they upgraded.
No recent commits to the logging-agent manifests. The person who upgraded didn't update Git. Let me check if v1.9.1 was a fix for something.
I'll ask in the team channel, but in the meantime, I'll update Git to v1.9.1 since that's what's running and presumably was upgraded for a reason.
cd /tmp/k8s-manifests
sed -i 's/v1.9.0/v1.9.1/' apps/logging-agent/daemonset.yaml
git add . && git commit -m "fix: update logging-agent to v1.9.1 (match cluster)"
git push origin main
cert-manager: CRD field ordering changes. This is a known ArgoCD annoyance with CRDs. The API server reorders fields in CRD specs, and ArgoCD sees it as drift. The fix is to ignore CRD diffs or use server-side diff.
argocd app set cert-manager --server-side-diff=true 2>/dev/null || \
kubectl patch application cert-manager -n argocd --type merge -p '{
"spec": {
"ignoreDifferences": [{
"group": "apiextensions.k8s.io",
"kind": "CustomResourceDefinition",
"jsonPointers": ["/spec"]
}]
}
}'
Let me check the sync status now.
After the Git updates propagate (2-3 minutes for ArgoCD to poll), the apps should show as Synced. Let me wait and verify.
sleep 180
argocd app list --output json | jq '.[] | select(.status.sync.status == "OutOfSync") | .metadata.name'
All synced. But I need to address the process gap: people are making manual changes and not updating Git. This undermines the entire GitOps model.
I'll propose adding a webhook notification when apps go OutOfSync, so the team knows immediately when someone makes a manual change. And we should document the post-incident process for committing hotfixes to Git.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Hit "Sync All" and let ArgoCD enforce Git state | Inspect each diff to determine whether Git or the cluster has the correct state | Blind syncing reverts manual hotfixes, scales down emergency capacity, and downgrades intentional upgrades |
| See "OutOfSync" and assume someone broke something | Categorize the drift: intentional manual change, metadata noise, operator behavior, or YAML serialization difference | Not all drift is bad — some is expected, some is a process gap, and some is actual misconfiguration |
| Only fix the immediate OutOfSync state | Also address the process gap (manual changes without Git commits) and add notifications | The OutOfSync is a symptom of a team process issue, not just a technical state |
Not know about ignoreDifferences |
Use it for known, permanent drift (annotations, CRD ordering) to reduce alert noise | Ignoring known non-issues lets you focus on actual drift |
Key Heuristics Used¶
- Diff Before Sync: Always inspect the ArgoCD diff before syncing. Determine whether Git is correct (sync the cluster) or the cluster is correct (update Git).
- Categorize Drift: Not all OutOfSync is equal. Separate intentional manual changes, metadata noise, and real misconfiguration before acting.
- GitOps Requires Process Discipline: GitOps tools enforce Git state. If people make manual changes without updating Git, the tools become adversaries instead of allies.
Cross-References¶
- Primer — GitOps principles, ArgoCD architecture, and the desired-state reconciliation loop
- Street Ops — ArgoCD CLI commands, diff inspection, and sync operations
- Footguns — Blind syncing over manual hotfixes, not using ignoreDifferences for known drift, and manual changes without Git commits