GitOps: The Repo Is the Truth
- lesson
- gitops-principles
- argocd-architecture
- ci/cd-pipeline-evolution
- kubernetes-reconciliation
- kustomize/helm
- drift-detection
- rbac
- multi-cluster-management ---# GitOps — The Repo Is the Truth
Topics: GitOps principles, ArgoCD architecture, CI/CD pipeline evolution, Kubernetes reconciliation, Kustomize/Helm, drift detection, RBAC, multi-cluster management Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (Git and Kubernetes concepts explained as we go)
The Mission¶
Your team has been deploying to Kubernetes the old-fashioned way: a GitHub Actions workflow
runs kubectl apply at the end of CI. It works — until it doesn't. Last Tuesday, someone
ran kubectl scale deploy/payments --replicas=5 during a traffic spike. The CI pipeline ran
two hours later and silently scaled it back to 3. The payments service buckled. Nobody
noticed for 40 minutes because the deploy "succeeded."
Your job: migrate from push-based CI deploys to ArgoCD-driven GitOps. By the end of this lesson, you'll understand why the old model breaks, how the new model works at every layer, and how to build it so drift can never sneak past you again.
Part 1: The Archaeology — How We Got Here¶
Before we touch ArgoCD, let's trace the history. The migration you're doing isn't arbitrary — it's the result of a decade of deployment evolution, and each step solved the previous step's problem.
The Timeline¶
| Era | How deploys worked | What broke |
|---|---|---|
| ~2010 | SSH into server, git pull, restart |
"It works on my machine." No rollback. |
| ~2013 | Capistrano/Fabric scripts | Scripts diverge across teams. Credentials everywhere. |
| ~2015 | Jenkins pipeline → kubectl apply |
CI has cluster creds. Drift accumulates silently. |
| ~2017 | GitOps: controller inside the cluster pulls from Git | You're reading this lesson. |
Name Origin: The term "GitOps" was coined by Alexis Richardson, CEO of Weaveworks, in a 2017 blog post titled "GitOps — Operations by Pull Request." The name stuck because it captured the core idea in two syllables: Git is your operations source of truth. Weaveworks built Flux, the first GitOps controller, before the term even existed — the tool came first, then they named the pattern.
Trivia: Weaveworks, the company that invented GitOps and built Flux, shut down in February 2024 after failing to secure funding. The movement they started continued to thrive through ArgoCD and the CNCF-hosted Flux project. The company died; the idea didn't.
Push vs Pull: The Architectural Shift¶
Here's the old model — what you're migrating away from:
Push model (traditional CI/CD):
Developer → git push → CI pipeline → builds image → kubectl apply → Cluster
↑
CI needs cluster credentials
No drift detection
No self-healing
And the new model:
Pull model (GitOps):
Developer → git push → CI pipeline → builds image → updates manifest repo
↓
ArgoCD (inside cluster) polls repo
↓
Compares desired vs live state
↓
Applies diff to cluster
Three things changed:
- Credentials flipped. In push, CI needs cluster credentials — every CI runner is an attack surface. In pull, the controller lives inside the cluster. Credentials never leave.
- Drift detection appeared. Push can't see manual changes. Pull continuously compares desired state (Git) to actual state (cluster) and flags — or fixes — the gap.
- Git became the audit log. Every deploy is a commit. Every rollback is a revert.
git logis your change history.
Mental Model: Think of push-based CI as a mail carrier who drops off a package and leaves. If someone moves the package, the carrier doesn't know and doesn't care. GitOps is a security guard who checks the room every 3 minutes and puts everything back where it belongs.
Flashcard Check #1¶
| Question | Answer (cover and test yourself) |
|---|---|
| Who coined the term "GitOps" and when? | Alexis Richardson, CEO of Weaveworks, in 2017. |
| In push-based CI/CD, who holds cluster credentials? | The CI runner (Jenkins, GitHub Actions, etc.). |
| In pull-based GitOps, who holds cluster credentials? | The controller running inside the cluster (ArgoCD). |
| What are the four GitOps principles? | Declarative, Versioned and immutable, Pulled automatically, Continuously reconciled. |
Remember: Mnemonic for the four principles: DVPC — Declarative, Versioned, Pulled, Continuously reconciled. If any one is missing, it's not GitOps. CI pushing
kubectl applyis declarative and versioned, but not pulled or continuously reconciled.
Part 2: ArgoCD Architecture — What's Actually Running¶
Time to look under the hood. When you install ArgoCD, five (or six) components land in the
argocd namespace. Each has a specific job:
┌──────────────────────────────────────────────────────────┐
│ argocd namespace │
│ │
│ ┌─────────────────┐ ┌──────────────────────────────┐ │
│ │ argocd-server │ │ argocd-application-controller│ │
│ │ (API + Web UI) │ │ (the brain — diffs & syncs) │ │
│ └────────┬────────┘ └──────────────┬───────────────┘ │
│ │ │ │
│ ┌────────┴────────┐ ┌──────────────┴───────────────┐ │
│ │ argocd-dex- │ │ argocd-repo-server │ │
│ │ server │ │ (clones repos, renders │ │
│ │ (SSO/OIDC) │ │ Helm/Kustomize/YAML) │ │
│ └─────────────────┘ └──────────────┬───────────────┘ │
│ │ │
│ ┌─────────────┴────────────┐ │
│ │ Redis (caching layer) │ │
│ └──────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
| Component | What it does | What breaks if it dies |
|---|---|---|
argocd-server |
Serves the web UI and API. Handles auth, RBAC. | No UI, no CLI access. Apps keep running. |
argocd-application-controller |
The reconciliation brain. Diffs desired (Git) vs live (cluster). | No syncs, no drift detection. Existing apps still run. |
argocd-repo-server |
Clones Git repos, renders Helm charts/Kustomize overlays into plain YAML. | Can't render new manifests. Syncs stall. |
argocd-dex-server |
SSO/OIDC authentication provider. | Can't log in via SSO. Admin password still works. |
| Redis | Caches repo state, app state, RBAC data. | Slower performance. Controller re-fetches everything. |
Under the Hood: The application controller uses the same watch-and-reconcile loop as every other Kubernetes controller. It's the same pattern the kube-controller-manager uses to manage Deployments, the same pattern kubelet uses to manage pods. ArgoCD didn't invent this — it leveraged a pattern that Kubernetes was already built on. If you've read the
what-happens-when-you-kubectl-applylesson, this is the same control loop operating one layer up.Name Origin: "Argo" comes from the ship Argo in Greek mythology — the vessel that carried Jason and the Argonauts on their quest. The Argo project family (ArgoCD, Argo Workflows, Argo Rollouts, Argo Events) was created at Applatix, later acquired by Intuit (the TurboTax company). ArgoCD was open-sourced in 2018 and became a CNCF graduated project in 2022.
Trivia: One of the original motivations for creating ArgoCD at Intuit was that Flux (the existing GitOps tool) lacked a graphical interface. Intuit engineers wanted a visual resource tree showing sync status. The ArgoCD UI became one of its most distinguishing features and a major driver of adoption.
Part 3: The Application Resource — Your First ArgoCD Manifest¶
An ArgoCD Application is a CRD (Custom Resource Definition) that answers three questions:
- Where is the desired state? (Git repo, branch, path)
- Where should it be deployed? (cluster, namespace)
- How should syncing behave? (automatic vs manual, prune, self-heal)
Here's the Application manifest for your payments service migration:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-service
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io # cascade-delete on app deletion
spec:
project: default
source:
repoURL: https://github.com/acme-corp/gitops-manifests.git
targetRevision: main
path: apps/payments/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert manual kubectl changes
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
Let's break this down field by field:
| Field | What it controls | What happens if you get it wrong |
|---|---|---|
targetRevision: main |
Which branch/tag to track | HEAD in prod means every merge deploys immediately |
path |
Directory in the repo containing manifests | Wrong path = ArgoCD syncs nothing or the wrong app |
prune: true |
Delete cluster resources removed from Git | Rename a directory = ArgoCD deletes everything in it |
selfHeal: true |
Revert manual kubectl changes |
Someone's emergency hotfix gets silently reverted |
finalizers |
Delete managed resources when the Application is deleted | Without it, deleting the app orphans resources in the cluster |
Gotcha:
prune: trueis the most dangerous setting in ArgoCD. A team renamed a Helm chart directory in their GitOps repo. ArgoCD treated every resource from the old path as orphaned and pruned them all — including a PostgreSQL StatefulSet with 500GB of data. The PVCs were deleted. Recovery required restoring from a 6-hour-old backup. Protect critical resources withargocd.argoproj.io/sync-options: Prune=false.
Part 4: The Reconciliation Loop — How ArgoCD Actually Works¶
This is the core of GitOps and the part most people hand-wave past. Let's trace exactly what happens every 3 minutes (the default polling interval):
Every 3 minutes:
1. Application controller checks: "which apps need sync?"
2. Repo server clones the Git repo (or uses cached version)
3. Repo server renders manifests:
- Plain YAML? Pass through.
- Helm chart? helm template with values files
- Kustomize? kustomize build on the overlay path
4. Controller compares rendered manifests to live cluster state
5. Comparison result:
├─ Synced: desired == live. Do nothing.
├─ OutOfSync: desired != live.
│ ├─ If automated sync: apply the diff.
│ └─ If manual sync: flag it, wait for human.
└─ Unknown: can't reach cluster or repo. Alert.
The comparison isn't a naive YAML diff. ArgoCD normalizes both sides — stripping
server-generated fields like creationTimestamp, resourceVersion, and
kubectl.kubernetes.io/last-applied-configuration. Getting this normalization right was one
of ArgoCD's hardest engineering challenges, and edge cases in drift detection remain the most
common source of bug reports.
Under the Hood: The 3-minute polling interval is a deliberate design choice, not a limitation. ArgoCD polls Git rather than relying on webhooks for simplicity and reliability — webhooks can fail silently, be blocked by firewalls, or get rate-limited. You can configure webhooks for near-instant sync, but the polling ensures ArgoCD catches changes even when webhooks fail. Configure the interval in the
argocd-cmConfigMap:timeout.reconciliation: 180s.
The Diff in Action¶
Let's see what happens when someone runs kubectl scale directly:
# Someone scales the payments service manually during a traffic spike
kubectl -n payments scale deploy/payments --replicas=5
# ArgoCD sees the drift within 3 minutes
argocd app diff payments-service
Output:
===== apps/Deployment payments/payments =====
--- desired (Git)
+++ live (cluster)
@@ -1,5 +1,5 @@
spec:
- replicas: 3
+ replicas: 5
With selfHeal: true, ArgoCD reverts this within one reconciliation cycle. The payments
service goes back to 3 replicas. If that's not what you want, you need to update Git — not
the cluster.
War Story: This is exactly the incident from your mission. A team had
selfHeal: trueenabled. During Black Friday, an SRE scaled a service from 3 to 8 replicas. ArgoCD scaled it back to 3 within minutes. The SRE scaled it up again. ArgoCD scaled it back. This loop repeated four times before someone realized the "random scaling failures" were ArgoCD doing its job. The fix: commit the scale change to Git, or use an HPA (Horizontal Pod Autoscaler) and tell ArgoCD to ignore the replicas field.
# Tell ArgoCD to ignore HPA-managed fields
spec:
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
Flashcard Check #2¶
| Question | Answer |
|---|---|
| How often does ArgoCD poll Git by default? | Every 3 minutes (180 seconds). |
What does selfHeal: true do? |
Reverts any manual changes to cluster state that diverge from Git. |
| What is the ArgoCD sync status when desired state equals live state? | Synced. |
| Why does ArgoCD normalize manifests before comparing? | To strip server-generated fields (creationTimestamp, resourceVersion, etc.) that would cause false diffs. |
| How do you prevent ArgoCD from fighting with an HPA? | Use ignoreDifferences with a jsonPointer to /spec/replicas. |
Part 5: Imperative vs Declarative — The Real Comparison¶
Let's make the old-vs-new contrast concrete with your payments service migration.
The Old Way: Push-Based CI¶
# .github/workflows/deploy.yml — the pipeline you're replacing
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: docker build -t ghcr.io/acme-corp/payments:${{ github.sha }} .
- name: Push image
run: docker push ghcr.io/acme-corp/payments:${{ github.sha }}
- name: Deploy to cluster
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
kubectl --kubeconfig=/tmp/kubeconfig \
set image deploy/payments \
payments=ghcr.io/acme-corp/payments:${{ github.sha }} \
-n payments
Problems with this approach:
| Problem | Why it hurts |
|---|---|
CI runner has KUBECONFIG secret |
Compromised runner = cluster access |
| No drift detection | kubectl edit changes persist until next CI run |
| Rollback means re-running an old pipeline | Slow, error-prone, assumes old image still exists |
| No audit trail beyond CI logs | "Who deployed what when?" requires digging through pipeline logs |
| No health verification | Pipeline exits 0 as soon as kubectl set image returns |
The New Way: GitOps with ArgoCD¶
# .github/workflows/ci.yml — CI still builds, but does NOT deploy
name: CI
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: docker build -t ghcr.io/acme-corp/payments:${{ github.sha }} .
- name: Push image
run: docker push ghcr.io/acme-corp/payments:${{ github.sha }}
- name: Update GitOps repo
run: |
git clone https://x-access-token:${{ secrets.GITOPS_TOKEN }}@github.com/acme-corp/gitops-manifests.git
cd gitops-manifests
cd apps/payments/overlays/production
kustomize edit set image ghcr.io/acme-corp/payments=ghcr.io/acme-corp/payments:${{ github.sha }}
git add .
git commit -m "deploy: payments ${{ github.sha }}"
git push
CI pushes the intent to a Git repo. ArgoCD picks it up and applies it. The pipeline never touches the cluster.
Interview Bridge: "What is the difference between GitOps and CI/CD?" is an increasingly common interview question. The key: in CI/CD, the pipeline pushes changes to the cluster (push model). In GitOps, an agent in the cluster pulls desired state from Git and continuously reconciles (pull model). CI never needs cluster credentials, and
git logbecomes the audit trail.
Part 6: Sync Policies, Waves, and Health Checks¶
Sync Waves — Ordering Your Deploy¶
When ArgoCD syncs, it doesn't apply everything at once. Sync waves let you control ordering:
Wave -2: Namespace, RBAC, ServiceAccount
Wave -1: ConfigMaps, Secrets
Wave 0: Database migration Job (PreSync hook)
Wave 1: Application Deployment
Wave 2: Ingress, HPA, PodDisruptionBudget
Annotate resources to assign them to waves:
ArgoCD waits for all resources in wave N to be healthy before starting wave N+1. This is where health checks become critical.
Health Checks — What "Healthy" Means¶
ArgoCD has built-in health checks for common resources:
| Resource | Healthy when |
|---|---|
| Deployment | availableReplicas == desiredReplicas |
| StatefulSet | readyReplicas == replicas |
| Job | succeeded >= 1 |
| PVC | phase == Bound |
| Ingress | status.loadBalancer.ingress has at least one entry |
Gotcha: If your Deployment has no readiness probe, ArgoCD considers it healthy as soon as the pod is Running — even if the app hasn't finished initializing. A wave-1 app can start connecting to a wave-0 database that isn't ready yet. Always define readiness probes on ArgoCD-managed Deployments.
Sync Hooks — Running Jobs at Deploy Time¶
Database migrations are the classic use case. Run them before the app deploys:
apiVersion: batch/v1
kind: Job
metadata:
name: payments-db-migrate
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
argocd.argoproj.io/sync-wave: "-1"
spec:
activeDeadlineSeconds: 300
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: ghcr.io/acme-corp/payments:v2.1.0
command: ["python", "manage.py", "migrate"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: payments-db-credentials
key: url
| Hook phase | When it runs | Use case |
|---|---|---|
PreSync |
Before main sync | DB migrations, schema validation |
Sync |
During main sync (with resources) | Rare — most things are just resources |
PostSync |
After all resources are healthy | Smoke tests, notifications |
SyncFail |
When sync fails | Cleanup, alerting |
Gotcha: Migrations must be idempotent. If a sync fails and retries, the PreSync hook runs again. A migration that tries to
CREATE TABLEwithoutIF NOT EXISTSwill fail on retry and block all future deploys. Use a migration framework (Alembic, Flyway, Liquibase) that tracks which migrations have already run.
Part 7: App of Apps — Bootstrapping an Entire Cluster¶
One Application per service is manageable. Thirty Applications across three clusters is not. The App of Apps pattern solves this: one root Application points to a directory of Application manifests.
gitops-manifests/
├── root-app.yaml ← you apply this once, manually
├── apps/
│ ├── payments.yaml ← Application for payments service
│ ├── users.yaml ← Application for users service
│ ├── monitoring.yaml ← Application for Prometheus stack
│ ├── ingress-nginx.yaml ← Application for ingress controller
│ └── cert-manager.yaml ← Application for TLS certificates
└── apps/payments/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
└── overlays/
├── dev/
├── staging/
└── production/
The root app:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: platform-root
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/acme-corp/gitops-manifests.git
targetRevision: main
path: apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
Bootstrap:
# One command to rule them all
kubectl apply -f root-app.yaml
# ArgoCD syncs root-app → creates child Applications → each child syncs its workloads
Trivia: The App of Apps pattern was discovered, not designed. Users noticed that an ArgoCD Application can manage any Kubernetes resource — including other Application CRDs. The community started using this to bootstrap entire clusters from a single commit. The Argo team later formalized it in the documentation.
ApplicationSet — When App of Apps Isn't Enough¶
For multi-cluster deployments, ApplicationSet generates Applications from templates:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: payments-all-clusters
namespace: argocd
spec:
generators:
- list:
elements:
- cluster: prod-us-east
url: https://prod-us-east.example.com
env: prod
- cluster: prod-eu-west
url: https://prod-eu-west.example.com
env: prod
- cluster: staging
url: https://staging.example.com
env: staging
template:
metadata:
name: "payments-{{cluster}}"
spec:
project: default
source:
repoURL: https://github.com/acme-corp/gitops-manifests.git
targetRevision: main
path: "apps/payments/overlays/{{env}}"
destination:
server: "{{url}}"
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true
One manifest, three clusters. Add a new cluster by adding an element to the list. Remove one and its Application — and all its resources — get pruned.
Part 8: Kustomize and Helm Integration¶
ArgoCD doesn't care how you write manifests. It supports both major templating tools.
Kustomize — Overlays for Environments¶
Your repo structure for Kustomize:
apps/payments/
├── base/
│ ├── deployment.yaml # replicas: 2, image: ghcr.io/acme-corp/payments:latest
│ ├── service.yaml
│ └── kustomization.yaml
└── overlays/
├── dev/
│ └── kustomization.yaml # replicas: 1, dev resources
├── staging/
│ └── kustomization.yaml # replicas: 2, staging DB
└── production/
├── kustomization.yaml # replicas: 5, prod DB, resource limits
└── patches/
└── replicas.yaml
ArgoCD source config for Kustomize:
source:
repoURL: https://github.com/acme-corp/gitops-manifests.git
targetRevision: main
path: apps/payments/overlays/production
ArgoCD automatically detects the kustomization.yaml and runs kustomize build.
Helm — Charts with Values Files¶
source:
repoURL: https://charts.bitnami.com/bitnami
chart: postgresql
targetRevision: 13.2.0
helm:
valueFiles:
- values-prod.yaml
parameters:
- name: auth.postgresPassword
value: "$POSTGRES_PASSWORD" # use External Secrets instead
Gotcha: When ArgoCD manages a Helm chart,
helm lswon't show it. ArgoCD renders the chart viahelm templateand manages the raw manifests — it doesn't create a Helm release. Runninghelm upgrademanually alongside ArgoCD creates a fight where both try to manage the same resources. One owner per release, always.
Part 9: RBAC in ArgoCD — Who Can Deploy What¶
ArgoCD uses Casbin policies (stored in the argocd-rbac-cm ConfigMap) to control access.
This is separate from Kubernetes RBAC — ArgoCD has its own permission model.
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-rbac-cm
namespace: argocd
data:
policy.default: role:readonly
policy.csv: |
# Payments team: can view and sync their apps
p, role:payments-team, applications, get, payments/*, allow
p, role:payments-team, applications, sync, payments/*, allow
# Platform team: can sync anything, manage clusters
p, role:platform-admin, applications, *, */*, allow
p, role:platform-admin, clusters, *, *, allow
p, role:platform-admin, repositories, *, *, allow
# Bind SSO groups to roles
g, acme-corp:payments-engineers, role:payments-team
g, acme-corp:platform-team, role:platform-admin
scopes: "[groups]"
The format is Casbin policy syntax: p, SUBJECT, RESOURCE, ACTION, OBJECT, EFFECT.
AppProject adds another layer — it restricts which repos, clusters, and namespaces an Application can reference:
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: payments
namespace: argocd
spec:
sourceRepos:
- https://github.com/acme-corp/gitops-manifests
- https://github.com/acme-corp/shared-charts
destinations:
- namespace: payments-*
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: ""
kind: Namespace
Mental Model: Think of RBAC as who can do what, and AppProject as what apps are allowed to touch. RBAC says "the payments team can sync apps." AppProject says "apps in the payments project can only deploy to the
payments-*namespace." Both constraints must pass.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What is the App of Apps pattern? | A root Application that manages a directory of child Application manifests, bootstrapping an entire cluster from one commit. |
What does argocd.argoproj.io/sync-wave: "-1" mean? |
This resource syncs before wave 0 resources. |
Why can't you use helm ls with ArgoCD-managed charts? |
ArgoCD renders charts via helm template and manages raw manifests — no Helm release metadata is created. |
| What is an AppProject? | A CRD that restricts which repos, clusters, and namespaces a set of Applications can access — the multi-tenancy boundary. |
What does hook-delete-policy: BeforeHookCreation do? |
Deletes the old hook resource before creating the new one on the next sync, preventing "already exists" errors. |
Part 10: The GitOps Workflow vs Traditional CI/CD¶
Let's walk through the complete lifecycle side by side.
Scenario: Deploy a New Feature¶
Traditional CI/CD:
1. Developer merges PR to app repo
2. CI builds image → ghcr.io/acme-corp/payments:abc123
3. CI pushes image to registry
4. CI runs: kubectl set image deploy/payments payments=...abc123
5. kubectl returns 0 (image updated, not verified healthy)
6. Developer assumes it worked
GitOps:
1. Developer merges PR to app repo
2. CI builds image → ghcr.io/acme-corp/payments:abc123
3. CI pushes image to registry
4. CI commits new image tag to gitops-manifests repo
5. ArgoCD detects change within 3 minutes (or instantly via webhook)
6. ArgoCD renders manifests, diffs against live state
7. ArgoCD applies diff, monitors health checks
8. ArgoCD marks app as Synced + Healthy (or Degraded if probes fail)
9. Team sees status in ArgoCD UI and Slack notification
Scenario: Rollback¶
Traditional CI/CD:
1. Find the last good commit SHA
2. Re-run the CI pipeline for that SHA
3. Hope the old image still exists in the registry
4. Wait for CI to finish (build + test + deploy again)
GitOps:
# Option 1: Git revert
cd gitops-manifests
git revert HEAD
git push
# Option 2: ArgoCD CLI
argocd app rollback payments-service 3 # rollback to revision 3
# Option 3: ArgoCD UI → click "History" → click "Rollback"
Rollback in GitOps is a git revert — seconds, not minutes.
Part 11: Multi-Cluster Management¶
Your company has three clusters: dev, staging, prod. Here's how ArgoCD manages all three from a single installation.
Register Clusters¶
# ArgoCD runs in the management cluster
# Register external clusters
argocd cluster add prod-us-east --name prod-us-east
argocd cluster add staging --name staging
# Verify
argocd cluster list
ArgoCD stores cluster credentials as Secrets in the argocd namespace. The in-cluster
(where ArgoCD runs) is always available as https://kubernetes.default.svc.
Hub-Spoke vs Instance-per-Cluster¶
| Model | How it works | Good for | Risk |
|---|---|---|---|
| Hub-spoke | One ArgoCD manages all clusters | Centralized visibility, single RBAC | ArgoCD is a SPOF |
| Instance-per-cluster | Each cluster runs its own ArgoCD | Isolation, blast radius containment | More operational overhead |
Most organizations start hub-spoke and split when they hit scale limits or compliance boundaries.
Part 12: Secrets — The Hard Part¶
GitOps says "everything in Git." Secrets say "not me."
Every GitOps team hits this tension. The solutions:
| Approach | How it works | Tradeoffs |
|---|---|---|
| Sealed Secrets | Encrypt secrets with a cluster-side key; commit ciphertext to Git | Simple. Key rotation is manual. Secrets are cluster-specific. |
| SOPS + age | Encrypt values in YAML files; decrypt at apply time | Works with any Git workflow. Requires key management. |
| External Secrets Operator | CRD references a secret in Vault/AWS SM; controller fetches it | Secrets never touch Git. Adds a dependency on the external store. |
| Vault Agent Injector | Vault sidecar injects secrets into pod at runtime | Pod-level injection. Tightest integration with Vault. |
Gotcha: Even with External Secrets Operator, ArgoCD's diff view may show the resulting Secret as "OutOfSync" because the live Secret (populated by ESO) differs from the ExternalSecret CRD that ArgoCD manages. Use
ignoreDifferenceson Secret resources managed by ESO.
Exercises¶
Exercise 1: Read the Diff (2 minutes)¶
ArgoCD shows this diff for your payments service:
--- desired (Git)
+++ live (cluster)
@@ -4,7 +4,7 @@
spec:
replicas: 3
template:
spec:
containers:
- name: payments
- image: ghcr.io/acme-corp/payments:v2.1.0
+ image: ghcr.io/acme-corp/payments:v2.0.9
Questions: 1. Is the cluster ahead of or behind Git? 2. What likely happened? 3. Should you sync to Git, or update Git to match the cluster?
Answer
1. The cluster is *behind* Git — running an older image (v2.0.9 vs v2.1.0). 2. A sync likely failed partway through, or someone manually rolled back the image. 3. Check if v2.1.0 was intentionally deployed and whether it caused issues. If v2.1.0 is the desired version, sync. If v2.0.9 was a deliberate rollback, update Git to v2.0.9 and investigate why v2.1.0 failed.Exercise 2: Write an Application Manifest (10 minutes)¶
Create an ArgoCD Application for a service called user-api with these requirements:
- Git repo: https://github.com/acme-corp/gitops-manifests.git
- Path: apps/user-api/overlays/staging
- Branch: main
- Namespace: user-api
- Automatic sync with self-heal but without prune (staging, not prod)
- Auto-create the namespace
Solution
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: user-api-staging
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/acme-corp/gitops-manifests.git
targetRevision: main
path: apps/user-api/overlays/staging
destination:
server: https://kubernetes.default.svc
namespace: user-api
syncPolicy:
automated:
selfHeal: true
prune: false
syncOptions:
- CreateNamespace=true
Exercise 3: Design a Sync Wave Strategy (15 minutes)¶
You're deploying a full-stack app with: - A PostgreSQL StatefulSet - A Redis Deployment - A database migration Job - The application Deployment - An Ingress - An HPA
Design the sync wave ordering. Which hook type does the migration need? What happens if you put everything in wave 0?
Solution
The migration should be a `PreSync` hook with `hook-delete-policy: BeforeHookCreation` and `activeDeadlineSeconds: 300`. If everything is wave 0, ArgoCD applies all resources simultaneously. The migration could run before PostgreSQL is ready. The app could start before the migration finishes. The HPA could conflict with ArgoCD on the replicas field. Order matters.Cheat Sheet¶
| Task | Command / Config |
|---|---|
| Install ArgoCD | kubectl create ns argocd && kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml |
| Get admin password | argocd admin initial-password -n argocd |
| Port-forward UI | kubectl port-forward svc/argocd-server -n argocd 8080:443 |
| List all apps | argocd app list |
| Check app status | argocd app get <app> |
| View drift diff | argocd app diff <app> |
| Force sync | argocd app sync <app> |
| Force refresh (skip poll) | argocd app get <app> --refresh |
| Rollback | argocd app rollback <app> <revision> |
| Add external cluster | argocd cluster add <context-name> |
| Check controller logs | kubectl -n argocd logs deploy/argocd-application-controller -f |
| Check repo-server logs | kubectl -n argocd logs deploy/argocd-repo-server -f |
| Concept | Key Point |
|---|---|
| Four GitOps principles | Declarative, Versioned, Pulled, Continuously reconciled (DVPC) |
| Default poll interval | 3 minutes (timeout.reconciliation in argocd-cm) |
| selfHeal | Reverts manual kubectl changes to match Git |
| prune | Deletes resources removed from Git (dangerous — protect StatefulSets) |
| App of Apps | One root Application manages a directory of child Applications |
| ApplicationSet | Template-based generation of Applications for multi-cluster |
| ignoreDifferences | Tells ArgoCD to stop fighting controllers (HPA, webhooks) |
| Sync waves | Lower numbers sync first; ArgoCD waits for health between waves |
Takeaways¶
-
GitOps is an architecture, not a tool. The shift from push to pull changes who holds credentials, how you audit deploys, and how you recover from failures. ArgoCD and Flux are implementations — the principles are what matter.
-
The reconciliation loop is borrowed from Kubernetes itself. ArgoCD uses the same watch-and-react pattern as every Kubernetes controller. Understanding it once unlocks understanding everywhere.
-
selfHeal: trueis a feature and a footgun. It prevents drift, but it also reverts your emergency hotfixes. Your team needs a documented process for how to make changes during incidents (answer: commit to Git, even in an emergency). -
Secrets are the unsolved problem. Every approach involves tradeoffs. Pick one (Sealed Secrets, SOPS, External Secrets Operator), enforce it consistently, and don't let "just this once" creep in.
-
App of Apps turns cluster bootstrapping into a single commit. One
kubectl applyof the root app and ArgoCD builds the entire platform. This is how you make clusters reproducible. -
prune: trueis the most dangerous flag in ArgoCD. Understand the blast radius. Protect StatefulSets and PVCs withPrune=falseannotations. Test directory renames in staging first.
Related Lessons¶
- What Happens When You
kubectl apply— the Kubernetes control loop that ArgoCD builds on top of - What Happens When You
git pushto CI — the CI pipeline that feeds the GitOps repo - What Happens When You
helm install— Helm internals, including whyhelm lscan't see ArgoCD-managed releases - Terraform vs Ansible vs Helm — comparing declarative tools and when each fits
Pages that link here¶
- Compliance As Code Automating The Auditor
- Container Registries Where Your Images Actually Live
- Cross-Domain Lessons
- Git Internals The Content Addressable Filesystem
- Github Actions Ci Cd That Lives In Your Repo
- Kustomize Kubernetes Config Without Templates
- Packer Building Machine Images That Dont Lie
- Supply Chain Security Trusting Your Dependencies
- Terraform Modules Building Infrastructure Legos
- Terraform Vs Ansible Vs Helm
- What Happens When You Git Push To Ci
- What Happens When You Helm Install
- What Happens When You Kubectl Apply