Skip to content

GitOps: The Repo Is the Truth

  • lesson
  • gitops-principles
  • argocd-architecture
  • ci/cd-pipeline-evolution
  • kubernetes-reconciliation
  • kustomize/helm
  • drift-detection
  • rbac
  • multi-cluster-management ---# GitOps — The Repo Is the Truth

Topics: GitOps principles, ArgoCD architecture, CI/CD pipeline evolution, Kubernetes reconciliation, Kustomize/Helm, drift detection, RBAC, multi-cluster management Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (Git and Kubernetes concepts explained as we go)


The Mission

Your team has been deploying to Kubernetes the old-fashioned way: a GitHub Actions workflow runs kubectl apply at the end of CI. It works — until it doesn't. Last Tuesday, someone ran kubectl scale deploy/payments --replicas=5 during a traffic spike. The CI pipeline ran two hours later and silently scaled it back to 3. The payments service buckled. Nobody noticed for 40 minutes because the deploy "succeeded."

Your job: migrate from push-based CI deploys to ArgoCD-driven GitOps. By the end of this lesson, you'll understand why the old model breaks, how the new model works at every layer, and how to build it so drift can never sneak past you again.


Part 1: The Archaeology — How We Got Here

Before we touch ArgoCD, let's trace the history. The migration you're doing isn't arbitrary — it's the result of a decade of deployment evolution, and each step solved the previous step's problem.

The Timeline

Era How deploys worked What broke
~2010 SSH into server, git pull, restart "It works on my machine." No rollback.
~2013 Capistrano/Fabric scripts Scripts diverge across teams. Credentials everywhere.
~2015 Jenkins pipeline → kubectl apply CI has cluster creds. Drift accumulates silently.
~2017 GitOps: controller inside the cluster pulls from Git You're reading this lesson.

Name Origin: The term "GitOps" was coined by Alexis Richardson, CEO of Weaveworks, in a 2017 blog post titled "GitOps — Operations by Pull Request." The name stuck because it captured the core idea in two syllables: Git is your operations source of truth. Weaveworks built Flux, the first GitOps controller, before the term even existed — the tool came first, then they named the pattern.

Trivia: Weaveworks, the company that invented GitOps and built Flux, shut down in February 2024 after failing to secure funding. The movement they started continued to thrive through ArgoCD and the CNCF-hosted Flux project. The company died; the idea didn't.

Push vs Pull: The Architectural Shift

Here's the old model — what you're migrating away from:

Push model (traditional CI/CD):
  Developer → git push → CI pipeline → builds image → kubectl apply → Cluster
                                              CI needs cluster credentials
                                              No drift detection
                                              No self-healing

And the new model:

Pull model (GitOps):
  Developer → git push → CI pipeline → builds image → updates manifest repo
                                              ArgoCD (inside cluster) polls repo
                                              Compares desired vs live state
                                              Applies diff to cluster

Three things changed:

  1. Credentials flipped. In push, CI needs cluster credentials — every CI runner is an attack surface. In pull, the controller lives inside the cluster. Credentials never leave.
  2. Drift detection appeared. Push can't see manual changes. Pull continuously compares desired state (Git) to actual state (cluster) and flags — or fixes — the gap.
  3. Git became the audit log. Every deploy is a commit. Every rollback is a revert. git log is your change history.

Mental Model: Think of push-based CI as a mail carrier who drops off a package and leaves. If someone moves the package, the carrier doesn't know and doesn't care. GitOps is a security guard who checks the room every 3 minutes and puts everything back where it belongs.


Flashcard Check #1

Question Answer (cover and test yourself)
Who coined the term "GitOps" and when? Alexis Richardson, CEO of Weaveworks, in 2017.
In push-based CI/CD, who holds cluster credentials? The CI runner (Jenkins, GitHub Actions, etc.).
In pull-based GitOps, who holds cluster credentials? The controller running inside the cluster (ArgoCD).
What are the four GitOps principles? Declarative, Versioned and immutable, Pulled automatically, Continuously reconciled.

Remember: Mnemonic for the four principles: DVPCDeclarative, Versioned, Pulled, Continuously reconciled. If any one is missing, it's not GitOps. CI pushing kubectl apply is declarative and versioned, but not pulled or continuously reconciled.


Part 2: ArgoCD Architecture — What's Actually Running

Time to look under the hood. When you install ArgoCD, five (or six) components land in the argocd namespace. Each has a specific job:

┌──────────────────────────────────────────────────────────┐
│                    argocd namespace                       │
│                                                          │
│  ┌─────────────────┐   ┌──────────────────────────────┐  │
│  │  argocd-server   │   │  argocd-application-controller│  │
│  │  (API + Web UI)  │   │  (the brain — diffs & syncs)  │  │
│  └────────┬────────┘   └──────────────┬───────────────┘  │
│           │                           │                  │
│  ┌────────┴────────┐   ┌──────────────┴───────────────┐  │
│  │  argocd-dex-     │   │  argocd-repo-server           │  │
│  │  server          │   │  (clones repos, renders       │  │
│  │  (SSO/OIDC)      │   │   Helm/Kustomize/YAML)       │  │
│  └─────────────────┘   └──────────────┬───────────────┘  │
│                                       │                  │
│                         ┌─────────────┴────────────┐     │
│                         │  Redis (caching layer)    │     │
│                         └──────────────────────────┘     │
└──────────────────────────────────────────────────────────┘
Component What it does What breaks if it dies
argocd-server Serves the web UI and API. Handles auth, RBAC. No UI, no CLI access. Apps keep running.
argocd-application-controller The reconciliation brain. Diffs desired (Git) vs live (cluster). No syncs, no drift detection. Existing apps still run.
argocd-repo-server Clones Git repos, renders Helm charts/Kustomize overlays into plain YAML. Can't render new manifests. Syncs stall.
argocd-dex-server SSO/OIDC authentication provider. Can't log in via SSO. Admin password still works.
Redis Caches repo state, app state, RBAC data. Slower performance. Controller re-fetches everything.

Under the Hood: The application controller uses the same watch-and-reconcile loop as every other Kubernetes controller. It's the same pattern the kube-controller-manager uses to manage Deployments, the same pattern kubelet uses to manage pods. ArgoCD didn't invent this — it leveraged a pattern that Kubernetes was already built on. If you've read the what-happens-when-you-kubectl-apply lesson, this is the same control loop operating one layer up.

Name Origin: "Argo" comes from the ship Argo in Greek mythology — the vessel that carried Jason and the Argonauts on their quest. The Argo project family (ArgoCD, Argo Workflows, Argo Rollouts, Argo Events) was created at Applatix, later acquired by Intuit (the TurboTax company). ArgoCD was open-sourced in 2018 and became a CNCF graduated project in 2022.

Trivia: One of the original motivations for creating ArgoCD at Intuit was that Flux (the existing GitOps tool) lacked a graphical interface. Intuit engineers wanted a visual resource tree showing sync status. The ArgoCD UI became one of its most distinguishing features and a major driver of adoption.


Part 3: The Application Resource — Your First ArgoCD Manifest

An ArgoCD Application is a CRD (Custom Resource Definition) that answers three questions:

  1. Where is the desired state? (Git repo, branch, path)
  2. Where should it be deployed? (cluster, namespace)
  3. How should syncing behave? (automatic vs manual, prune, self-heal)

Here's the Application manifest for your payments service migration:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-service
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io   # cascade-delete on app deletion
spec:
  project: default
  source:
    repoURL: https://github.com/acme-corp/gitops-manifests.git
    targetRevision: main
    path: apps/payments/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: payments
  syncPolicy:
    automated:
      prune: true          # delete resources removed from Git
      selfHeal: true       # revert manual kubectl changes
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Let's break this down field by field:

Field What it controls What happens if you get it wrong
targetRevision: main Which branch/tag to track HEAD in prod means every merge deploys immediately
path Directory in the repo containing manifests Wrong path = ArgoCD syncs nothing or the wrong app
prune: true Delete cluster resources removed from Git Rename a directory = ArgoCD deletes everything in it
selfHeal: true Revert manual kubectl changes Someone's emergency hotfix gets silently reverted
finalizers Delete managed resources when the Application is deleted Without it, deleting the app orphans resources in the cluster

Gotcha: prune: true is the most dangerous setting in ArgoCD. A team renamed a Helm chart directory in their GitOps repo. ArgoCD treated every resource from the old path as orphaned and pruned them all — including a PostgreSQL StatefulSet with 500GB of data. The PVCs were deleted. Recovery required restoring from a 6-hour-old backup. Protect critical resources with argocd.argoproj.io/sync-options: Prune=false.


Part 4: The Reconciliation Loop — How ArgoCD Actually Works

This is the core of GitOps and the part most people hand-wave past. Let's trace exactly what happens every 3 minutes (the default polling interval):

Every 3 minutes:
  1. Application controller checks: "which apps need sync?"
  2. Repo server clones the Git repo (or uses cached version)
  3. Repo server renders manifests:
     - Plain YAML? Pass through.
     - Helm chart? helm template with values files
     - Kustomize? kustomize build on the overlay path
  4. Controller compares rendered manifests to live cluster state
  5. Comparison result:
     ├─ Synced: desired == live. Do nothing.
     ├─ OutOfSync: desired != live.
     │   ├─ If automated sync: apply the diff.
     │   └─ If manual sync: flag it, wait for human.
     └─ Unknown: can't reach cluster or repo. Alert.

The comparison isn't a naive YAML diff. ArgoCD normalizes both sides — stripping server-generated fields like creationTimestamp, resourceVersion, and kubectl.kubernetes.io/last-applied-configuration. Getting this normalization right was one of ArgoCD's hardest engineering challenges, and edge cases in drift detection remain the most common source of bug reports.

Under the Hood: The 3-minute polling interval is a deliberate design choice, not a limitation. ArgoCD polls Git rather than relying on webhooks for simplicity and reliability — webhooks can fail silently, be blocked by firewalls, or get rate-limited. You can configure webhooks for near-instant sync, but the polling ensures ArgoCD catches changes even when webhooks fail. Configure the interval in the argocd-cm ConfigMap: timeout.reconciliation: 180s.

The Diff in Action

Let's see what happens when someone runs kubectl scale directly:

# Someone scales the payments service manually during a traffic spike
kubectl -n payments scale deploy/payments --replicas=5

# ArgoCD sees the drift within 3 minutes
argocd app diff payments-service

Output:

===== apps/Deployment payments/payments =====
--- desired (Git)
+++ live (cluster)
@@ -1,5 +1,5 @@
 spec:
-  replicas: 3
+  replicas: 5

With selfHeal: true, ArgoCD reverts this within one reconciliation cycle. The payments service goes back to 3 replicas. If that's not what you want, you need to update Git — not the cluster.

War Story: This is exactly the incident from your mission. A team had selfHeal: true enabled. During Black Friday, an SRE scaled a service from 3 to 8 replicas. ArgoCD scaled it back to 3 within minutes. The SRE scaled it up again. ArgoCD scaled it back. This loop repeated four times before someone realized the "random scaling failures" were ArgoCD doing its job. The fix: commit the scale change to Git, or use an HPA (Horizontal Pod Autoscaler) and tell ArgoCD to ignore the replicas field.

# Tell ArgoCD to ignore HPA-managed fields
spec:
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas

Flashcard Check #2

Question Answer
How often does ArgoCD poll Git by default? Every 3 minutes (180 seconds).
What does selfHeal: true do? Reverts any manual changes to cluster state that diverge from Git.
What is the ArgoCD sync status when desired state equals live state? Synced.
Why does ArgoCD normalize manifests before comparing? To strip server-generated fields (creationTimestamp, resourceVersion, etc.) that would cause false diffs.
How do you prevent ArgoCD from fighting with an HPA? Use ignoreDifferences with a jsonPointer to /spec/replicas.

Part 5: Imperative vs Declarative — The Real Comparison

Let's make the old-vs-new contrast concrete with your payments service migration.

The Old Way: Push-Based CI

# .github/workflows/deploy.yml — the pipeline you're replacing
name: Deploy
on:
  push:
    branches: [main]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build image
        run: docker build -t ghcr.io/acme-corp/payments:${{ github.sha }} .
      - name: Push image
        run: docker push ghcr.io/acme-corp/payments:${{ github.sha }}
      - name: Deploy to cluster
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          kubectl --kubeconfig=/tmp/kubeconfig \
            set image deploy/payments \
            payments=ghcr.io/acme-corp/payments:${{ github.sha }} \
            -n payments

Problems with this approach:

Problem Why it hurts
CI runner has KUBECONFIG secret Compromised runner = cluster access
No drift detection kubectl edit changes persist until next CI run
Rollback means re-running an old pipeline Slow, error-prone, assumes old image still exists
No audit trail beyond CI logs "Who deployed what when?" requires digging through pipeline logs
No health verification Pipeline exits 0 as soon as kubectl set image returns

The New Way: GitOps with ArgoCD

# .github/workflows/ci.yml — CI still builds, but does NOT deploy
name: CI
on:
  push:
    branches: [main]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build image
        run: docker build -t ghcr.io/acme-corp/payments:${{ github.sha }} .
      - name: Push image
        run: docker push ghcr.io/acme-corp/payments:${{ github.sha }}
      - name: Update GitOps repo
        run: |
          git clone https://x-access-token:${{ secrets.GITOPS_TOKEN }}@github.com/acme-corp/gitops-manifests.git
          cd gitops-manifests
          cd apps/payments/overlays/production
          kustomize edit set image ghcr.io/acme-corp/payments=ghcr.io/acme-corp/payments:${{ github.sha }}
          git add .
          git commit -m "deploy: payments ${{ github.sha }}"
          git push

CI pushes the intent to a Git repo. ArgoCD picks it up and applies it. The pipeline never touches the cluster.

Interview Bridge: "What is the difference between GitOps and CI/CD?" is an increasingly common interview question. The key: in CI/CD, the pipeline pushes changes to the cluster (push model). In GitOps, an agent in the cluster pulls desired state from Git and continuously reconciles (pull model). CI never needs cluster credentials, and git log becomes the audit trail.


Part 6: Sync Policies, Waves, and Health Checks

Sync Waves — Ordering Your Deploy

When ArgoCD syncs, it doesn't apply everything at once. Sync waves let you control ordering:

Wave -2: Namespace, RBAC, ServiceAccount
Wave -1: ConfigMaps, Secrets
Wave  0: Database migration Job (PreSync hook)
Wave  1: Application Deployment
Wave  2: Ingress, HPA, PodDisruptionBudget

Annotate resources to assign them to waves:

metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"    # applies before wave 0

ArgoCD waits for all resources in wave N to be healthy before starting wave N+1. This is where health checks become critical.

Health Checks — What "Healthy" Means

ArgoCD has built-in health checks for common resources:

Resource Healthy when
Deployment availableReplicas == desiredReplicas
StatefulSet readyReplicas == replicas
Job succeeded >= 1
PVC phase == Bound
Ingress status.loadBalancer.ingress has at least one entry

Gotcha: If your Deployment has no readiness probe, ArgoCD considers it healthy as soon as the pod is Running — even if the app hasn't finished initializing. A wave-1 app can start connecting to a wave-0 database that isn't ready yet. Always define readiness probes on ArgoCD-managed Deployments.

Sync Hooks — Running Jobs at Deploy Time

Database migrations are the classic use case. Run them before the app deploys:

apiVersion: batch/v1
kind: Job
metadata:
  name: payments-db-migrate
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    argocd.argoproj.io/sync-wave: "-1"
spec:
  activeDeadlineSeconds: 300
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: ghcr.io/acme-corp/payments:v2.1.0
          command: ["python", "manage.py", "migrate"]
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: payments-db-credentials
                  key: url
Hook phase When it runs Use case
PreSync Before main sync DB migrations, schema validation
Sync During main sync (with resources) Rare — most things are just resources
PostSync After all resources are healthy Smoke tests, notifications
SyncFail When sync fails Cleanup, alerting

Gotcha: Migrations must be idempotent. If a sync fails and retries, the PreSync hook runs again. A migration that tries to CREATE TABLE without IF NOT EXISTS will fail on retry and block all future deploys. Use a migration framework (Alembic, Flyway, Liquibase) that tracks which migrations have already run.


Part 7: App of Apps — Bootstrapping an Entire Cluster

One Application per service is manageable. Thirty Applications across three clusters is not. The App of Apps pattern solves this: one root Application points to a directory of Application manifests.

gitops-manifests/
├── root-app.yaml              ← you apply this once, manually
├── apps/
│   ├── payments.yaml          ← Application for payments service
│   ├── users.yaml             ← Application for users service
│   ├── monitoring.yaml        ← Application for Prometheus stack
│   ├── ingress-nginx.yaml     ← Application for ingress controller
│   └── cert-manager.yaml      ← Application for TLS certificates
└── apps/payments/
    ├── base/
    │   ├── deployment.yaml
    │   ├── service.yaml
    │   └── kustomization.yaml
    └── overlays/
        ├── dev/
        ├── staging/
        └── production/

The root app:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-root
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/acme-corp/gitops-manifests.git
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Bootstrap:

# One command to rule them all
kubectl apply -f root-app.yaml
# ArgoCD syncs root-app → creates child Applications → each child syncs its workloads

Trivia: The App of Apps pattern was discovered, not designed. Users noticed that an ArgoCD Application can manage any Kubernetes resource — including other Application CRDs. The community started using this to bootstrap entire clusters from a single commit. The Argo team later formalized it in the documentation.

ApplicationSet — When App of Apps Isn't Enough

For multi-cluster deployments, ApplicationSet generates Applications from templates:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: payments-all-clusters
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - cluster: prod-us-east
            url: https://prod-us-east.example.com
            env: prod
          - cluster: prod-eu-west
            url: https://prod-eu-west.example.com
            env: prod
          - cluster: staging
            url: https://staging.example.com
            env: staging
  template:
    metadata:
      name: "payments-{{cluster}}"
    spec:
      project: default
      source:
        repoURL: https://github.com/acme-corp/gitops-manifests.git
        targetRevision: main
        path: "apps/payments/overlays/{{env}}"
      destination:
        server: "{{url}}"
        namespace: payments
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

One manifest, three clusters. Add a new cluster by adding an element to the list. Remove one and its Application — and all its resources — get pruned.


Part 8: Kustomize and Helm Integration

ArgoCD doesn't care how you write manifests. It supports both major templating tools.

Kustomize — Overlays for Environments

Your repo structure for Kustomize:

apps/payments/
├── base/
│   ├── deployment.yaml       # replicas: 2, image: ghcr.io/acme-corp/payments:latest
│   ├── service.yaml
│   └── kustomization.yaml
└── overlays/
    ├── dev/
    │   └── kustomization.yaml    # replicas: 1, dev resources
    ├── staging/
    │   └── kustomization.yaml    # replicas: 2, staging DB
    └── production/
        ├── kustomization.yaml    # replicas: 5, prod DB, resource limits
        └── patches/
            └── replicas.yaml

ArgoCD source config for Kustomize:

source:
  repoURL: https://github.com/acme-corp/gitops-manifests.git
  targetRevision: main
  path: apps/payments/overlays/production

ArgoCD automatically detects the kustomization.yaml and runs kustomize build.

Helm — Charts with Values Files

source:
  repoURL: https://charts.bitnami.com/bitnami
  chart: postgresql
  targetRevision: 13.2.0
  helm:
    valueFiles:
      - values-prod.yaml
    parameters:
      - name: auth.postgresPassword
        value: "$POSTGRES_PASSWORD"     # use External Secrets instead

Gotcha: When ArgoCD manages a Helm chart, helm ls won't show it. ArgoCD renders the chart via helm template and manages the raw manifests — it doesn't create a Helm release. Running helm upgrade manually alongside ArgoCD creates a fight where both try to manage the same resources. One owner per release, always.


Part 9: RBAC in ArgoCD — Who Can Deploy What

ArgoCD uses Casbin policies (stored in the argocd-rbac-cm ConfigMap) to control access. This is separate from Kubernetes RBAC — ArgoCD has its own permission model.

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-rbac-cm
  namespace: argocd
data:
  policy.default: role:readonly
  policy.csv: |
    # Payments team: can view and sync their apps
    p, role:payments-team, applications, get, payments/*, allow
    p, role:payments-team, applications, sync, payments/*, allow

    # Platform team: can sync anything, manage clusters
    p, role:platform-admin, applications, *, */*, allow
    p, role:platform-admin, clusters, *, *, allow
    p, role:platform-admin, repositories, *, *, allow

    # Bind SSO groups to roles
    g, acme-corp:payments-engineers, role:payments-team
    g, acme-corp:platform-team, role:platform-admin
  scopes: "[groups]"

The format is Casbin policy syntax: p, SUBJECT, RESOURCE, ACTION, OBJECT, EFFECT.

AppProject adds another layer — it restricts which repos, clusters, and namespaces an Application can reference:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: payments
  namespace: argocd
spec:
  sourceRepos:
    - https://github.com/acme-corp/gitops-manifests
    - https://github.com/acme-corp/shared-charts
  destinations:
    - namespace: payments-*
      server: https://kubernetes.default.svc
  clusterResourceWhitelist:
    - group: ""
      kind: Namespace

Mental Model: Think of RBAC as who can do what, and AppProject as what apps are allowed to touch. RBAC says "the payments team can sync apps." AppProject says "apps in the payments project can only deploy to the payments-* namespace." Both constraints must pass.


Flashcard Check #3

Question Answer
What is the App of Apps pattern? A root Application that manages a directory of child Application manifests, bootstrapping an entire cluster from one commit.
What does argocd.argoproj.io/sync-wave: "-1" mean? This resource syncs before wave 0 resources.
Why can't you use helm ls with ArgoCD-managed charts? ArgoCD renders charts via helm template and manages raw manifests — no Helm release metadata is created.
What is an AppProject? A CRD that restricts which repos, clusters, and namespaces a set of Applications can access — the multi-tenancy boundary.
What does hook-delete-policy: BeforeHookCreation do? Deletes the old hook resource before creating the new one on the next sync, preventing "already exists" errors.

Part 10: The GitOps Workflow vs Traditional CI/CD

Let's walk through the complete lifecycle side by side.

Scenario: Deploy a New Feature

Traditional CI/CD:

1. Developer merges PR to app repo
2. CI builds image → ghcr.io/acme-corp/payments:abc123
3. CI pushes image to registry
4. CI runs: kubectl set image deploy/payments payments=...abc123
5. kubectl returns 0 (image updated, not verified healthy)
6. Developer assumes it worked

GitOps:

1. Developer merges PR to app repo
2. CI builds image → ghcr.io/acme-corp/payments:abc123
3. CI pushes image to registry
4. CI commits new image tag to gitops-manifests repo
5. ArgoCD detects change within 3 minutes (or instantly via webhook)
6. ArgoCD renders manifests, diffs against live state
7. ArgoCD applies diff, monitors health checks
8. ArgoCD marks app as Synced + Healthy (or Degraded if probes fail)
9. Team sees status in ArgoCD UI and Slack notification

Scenario: Rollback

Traditional CI/CD:

1. Find the last good commit SHA
2. Re-run the CI pipeline for that SHA
3. Hope the old image still exists in the registry
4. Wait for CI to finish (build + test + deploy again)

GitOps:

# Option 1: Git revert
cd gitops-manifests
git revert HEAD
git push

# Option 2: ArgoCD CLI
argocd app rollback payments-service 3    # rollback to revision 3

# Option 3: ArgoCD UI → click "History" → click "Rollback"

Rollback in GitOps is a git revert — seconds, not minutes.


Part 11: Multi-Cluster Management

Your company has three clusters: dev, staging, prod. Here's how ArgoCD manages all three from a single installation.

Register Clusters

# ArgoCD runs in the management cluster
# Register external clusters
argocd cluster add prod-us-east --name prod-us-east
argocd cluster add staging --name staging

# Verify
argocd cluster list

ArgoCD stores cluster credentials as Secrets in the argocd namespace. The in-cluster (where ArgoCD runs) is always available as https://kubernetes.default.svc.

Hub-Spoke vs Instance-per-Cluster

Model How it works Good for Risk
Hub-spoke One ArgoCD manages all clusters Centralized visibility, single RBAC ArgoCD is a SPOF
Instance-per-cluster Each cluster runs its own ArgoCD Isolation, blast radius containment More operational overhead

Most organizations start hub-spoke and split when they hit scale limits or compliance boundaries.


Part 12: Secrets — The Hard Part

GitOps says "everything in Git." Secrets say "not me."

Every GitOps team hits this tension. The solutions:

Approach How it works Tradeoffs
Sealed Secrets Encrypt secrets with a cluster-side key; commit ciphertext to Git Simple. Key rotation is manual. Secrets are cluster-specific.
SOPS + age Encrypt values in YAML files; decrypt at apply time Works with any Git workflow. Requires key management.
External Secrets Operator CRD references a secret in Vault/AWS SM; controller fetches it Secrets never touch Git. Adds a dependency on the external store.
Vault Agent Injector Vault sidecar injects secrets into pod at runtime Pod-level injection. Tightest integration with Vault.

Gotcha: Even with External Secrets Operator, ArgoCD's diff view may show the resulting Secret as "OutOfSync" because the live Secret (populated by ESO) differs from the ExternalSecret CRD that ArgoCD manages. Use ignoreDifferences on Secret resources managed by ESO.


Exercises

Exercise 1: Read the Diff (2 minutes)

ArgoCD shows this diff for your payments service:

--- desired (Git)
+++ live (cluster)
@@ -4,7 +4,7 @@
 spec:
   replicas: 3
   template:
     spec:
       containers:
       - name: payments
-        image: ghcr.io/acme-corp/payments:v2.1.0
+        image: ghcr.io/acme-corp/payments:v2.0.9

Questions: 1. Is the cluster ahead of or behind Git? 2. What likely happened? 3. Should you sync to Git, or update Git to match the cluster?

Answer 1. The cluster is *behind* Git — running an older image (v2.0.9 vs v2.1.0). 2. A sync likely failed partway through, or someone manually rolled back the image. 3. Check if v2.1.0 was intentionally deployed and whether it caused issues. If v2.1.0 is the desired version, sync. If v2.0.9 was a deliberate rollback, update Git to v2.0.9 and investigate why v2.1.0 failed.

Exercise 2: Write an Application Manifest (10 minutes)

Create an ArgoCD Application for a service called user-api with these requirements: - Git repo: https://github.com/acme-corp/gitops-manifests.git - Path: apps/user-api/overlays/staging - Branch: main - Namespace: user-api - Automatic sync with self-heal but without prune (staging, not prod) - Auto-create the namespace

Solution
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: user-api-staging
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/acme-corp/gitops-manifests.git
    targetRevision: main
    path: apps/user-api/overlays/staging
  destination:
    server: https://kubernetes.default.svc
    namespace: user-api
  syncPolicy:
    automated:
      selfHeal: true
      prune: false
    syncOptions:
      - CreateNamespace=true

Exercise 3: Design a Sync Wave Strategy (15 minutes)

You're deploying a full-stack app with: - A PostgreSQL StatefulSet - A Redis Deployment - A database migration Job - The application Deployment - An Ingress - An HPA

Design the sync wave ordering. Which hook type does the migration need? What happens if you put everything in wave 0?

Solution
Wave -3: Namespace, ServiceAccount, RBAC
Wave -2: ConfigMaps, Secrets (ExternalSecret CRDs)
Wave -1: PostgreSQL StatefulSet, Redis Deployment
Wave  0: Database migration (PreSync hook, not a wave — it runs before any sync)
Wave  1: Application Deployment
Wave  2: Ingress, HPA, PodDisruptionBudget
The migration should be a `PreSync` hook with `hook-delete-policy: BeforeHookCreation` and `activeDeadlineSeconds: 300`. If everything is wave 0, ArgoCD applies all resources simultaneously. The migration could run before PostgreSQL is ready. The app could start before the migration finishes. The HPA could conflict with ArgoCD on the replicas field. Order matters.

Cheat Sheet

Task Command / Config
Install ArgoCD kubectl create ns argocd && kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
Get admin password argocd admin initial-password -n argocd
Port-forward UI kubectl port-forward svc/argocd-server -n argocd 8080:443
List all apps argocd app list
Check app status argocd app get <app>
View drift diff argocd app diff <app>
Force sync argocd app sync <app>
Force refresh (skip poll) argocd app get <app> --refresh
Rollback argocd app rollback <app> <revision>
Add external cluster argocd cluster add <context-name>
Check controller logs kubectl -n argocd logs deploy/argocd-application-controller -f
Check repo-server logs kubectl -n argocd logs deploy/argocd-repo-server -f
Concept Key Point
Four GitOps principles Declarative, Versioned, Pulled, Continuously reconciled (DVPC)
Default poll interval 3 minutes (timeout.reconciliation in argocd-cm)
selfHeal Reverts manual kubectl changes to match Git
prune Deletes resources removed from Git (dangerous — protect StatefulSets)
App of Apps One root Application manages a directory of child Applications
ApplicationSet Template-based generation of Applications for multi-cluster
ignoreDifferences Tells ArgoCD to stop fighting controllers (HPA, webhooks)
Sync waves Lower numbers sync first; ArgoCD waits for health between waves

Takeaways

  • GitOps is an architecture, not a tool. The shift from push to pull changes who holds credentials, how you audit deploys, and how you recover from failures. ArgoCD and Flux are implementations — the principles are what matter.

  • The reconciliation loop is borrowed from Kubernetes itself. ArgoCD uses the same watch-and-react pattern as every Kubernetes controller. Understanding it once unlocks understanding everywhere.

  • selfHeal: true is a feature and a footgun. It prevents drift, but it also reverts your emergency hotfixes. Your team needs a documented process for how to make changes during incidents (answer: commit to Git, even in an emergency).

  • Secrets are the unsolved problem. Every approach involves tradeoffs. Pick one (Sealed Secrets, SOPS, External Secrets Operator), enforce it consistently, and don't let "just this once" creep in.

  • App of Apps turns cluster bootstrapping into a single commit. One kubectl apply of the root app and ArgoCD builds the entire platform. This is how you make clusters reproducible.

  • prune: true is the most dangerous flag in ArgoCD. Understand the blast radius. Protect StatefulSets and PVCs with Prune=false annotations. Test directory renames in staging first.