Portal | Level: L2: Operations | Topics: Argo Workflows | Domain: Kubernetes
Argo Workflows — Primer¶
Why This Matters¶
Kubernetes CronJobs are fine for simple periodic tasks. But when you need to run a sequence of steps, fan out to parallel jobs, pass data between steps, retry specific failures, or build a machine learning pipeline, CronJobs fall apart. You're left stitching together Jobs with shell scripts and hoping the intermediate state doesn't get lost.
Who made it: Argo Workflows was created at Applatix (later acquired by Intuit) in 2017. It became a CNCF incubating project in 2020 and graduated in 2022. The name "Argo" references the ship from Greek mythology that carried Jason and the Argonauts — fitting for a tool that orchestrates journeys through complex pipelines.
Argo Workflows is a Kubernetes-native workflow engine that turns DAGs and pipelines into first-class Kubernetes resources. Every step runs as a Pod. Progress is visible in a UI. Artifacts flow between steps via S3 or GCS. Failures are retried with backoff. The entire workflow is auditable in etcd.
For platform engineers, Argo Workflows replaces Jenkins pipelines, Airflow DAGs, and ad-hoc shell scripts for batch compute, ML training pipelines, data processing, and CI/CD tasks that don't fit neatly into a linear pipeline. Understanding it means you can design workflows that are parallelizable, resumable, and observable.
Core Concepts¶
1. Installation¶
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/latest/download/install.yaml
# Wait for rollout
kubectl -n argo rollout status deploy/workflow-controller
kubectl -n argo rollout status deploy/argo-server
# Install CLI
curl -sLO https://github.com/argoproj/argo-workflows/releases/latest/download/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz && chmod +x argo-linux-amd64
mv argo-linux-amd64 /usr/local/bin/argo
# Port-forward UI
kubectl -n argo port-forward svc/argo-server 2746:2746
# Open: https://localhost:2746
2. The Workflow Resource¶
A Workflow is a single run of a pipeline. It contains an entrypoint (the first template to run) and a set of templates (reusable step definitions).
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: hello-world-
namespace: argo
spec:
entrypoint: greet
templates:
- name: greet
container:
image: alpine:3.18
command: [echo]
args: ["Hello, Argo Workflows!"]
resources:
requests:
memory: 64Mi
cpu: 100m
argo submit workflow.yaml -n argo --watch
argo list -n argo
argo get @latest -n argo
argo logs @latest -n argo
3. Template Types¶
Container Template¶
Runs a single container (most common):
- name: build-image
container:
image: docker:24-dind
command: [docker, build]
args: ["-t", "myapp:{{workflow.parameters.tag}}", "."]
volumeMounts:
- name: docker-sock
mountPath: /var/run/docker.sock
Script Template¶
Inline script (avoids building a custom image for simple logic):
- name: generate-report
script:
image: python:3.11-slim
command: [python]
source: |
import json, datetime
report = {
"timestamp": datetime.datetime.now(datetime.UTC).isoformat(),
"status": "ok",
"records_processed": 42
}
print(json.dumps(report))
Resource Template¶
Create, patch, or delete Kubernetes resources:
- name: create-configmap
resource:
action: create
manifest: |
apiVersion: v1
kind: ConfigMap
metadata:
name: pipeline-output
namespace: argo
data:
result: "{{inputs.parameters.result}}"
- name: wait-for-job
resource:
action: get
successCondition: status.succeeded > 0
failureCondition: status.failed > 3
manifest: |
apiVersion: batch/v1
kind: Job
metadata:
name: external-job
namespace: argo
Suspend Template¶
Pause the workflow until manually resumed:
- name: wait-for-approval
suspend:
duration: "1h" # auto-resume after 1h if not manually resumed earlier
# Resume a suspended workflow
argo resume my-workflow-xxxxx -n argo
# Or resume a specific node
argo resume my-workflow-xxxxx -n argo --node-field-selector=displayName=wait-for-approval
4. Steps vs DAG¶
Steps define a linear sequence with optional parallelism at each step level:
- name: pipeline
steps:
- - name: fetch-data # Step 1: sequential
template: fetch
- - name: transform-a # Step 2: parallel (same dash level)
template: transform
arguments:
parameters:
- name: shard
value: "a"
- name: transform-b
template: transform
arguments:
parameters:
- name: shard
value: "b"
- - name: load # Step 3: after both transforms complete
template: load
DAG defines tasks with explicit dependencies — more flexible for complex graphs:
- name: ml-pipeline
dag:
tasks:
- name: preprocess
template: preprocess-data
- name: train-model
dependencies: [preprocess]
template: train
- name: evaluate
dependencies: [train-model]
template: evaluate
- name: deploy-if-good
dependencies: [evaluate]
template: deploy
when: "{{tasks.evaluate.outputs.parameters.accuracy}} >= 0.95"
- name: notify-failure
dependencies: [evaluate]
template: notify
when: "{{tasks.evaluate.outputs.parameters.accuracy}} < 0.95"
Use steps when the pipeline is inherently sequential with parallel bursts. Use DAG when dependencies are complex or when you need conditional branching based on upstream outputs.
Gotcha: In the steps syntax, parallel tasks are defined by putting multiple items at the same dash-dash level (same list item). This is easy to get wrong — a missing
- -(double dash at the step level) turns parallel steps into sequential ones. Use DAG when readability matters more than brevity.
5. Artifacts — Passing Data Between Steps¶
Artifacts let steps exchange files without requiring shared volumes. Argo Workflows supports S3, GCS, HDFS, and HTTP as artifact backends.
Configure the artifact repository:
# In the workflow-controller-configmap
apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
namespace: argo
data:
artifactRepository: |
s3:
bucket: my-argo-artifacts
endpoint: s3.amazonaws.com
insecure: false
accessKeySecret:
name: s3-credentials
key: accessKey
secretKeySecret:
name: s3-credentials
key: secretKey
Using artifacts in a workflow:
- name: generate-dataset
container:
image: python:3.11-slim
command: [python, /scripts/generate.py]
outputs:
artifacts:
- name: dataset
path: /tmp/dataset.parquet
- name: train-model
inputs:
artifacts:
- name: dataset
from: "{{steps.generate-dataset.outputs.artifacts.dataset}}"
path: /data/dataset.parquet
container:
image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
command: [python, /scripts/train.py]
outputs:
artifacts:
- name: model
path: /tmp/model.pt
parameters:
- name: accuracy
valueFrom:
path: /tmp/accuracy.txt
6. Parameters — Inputs, Outputs, and Expressions¶
Workflow-level parameters (passed at submit time):
spec:
arguments:
parameters:
- name: image-tag
value: latest # default, overridden at submit
- name: environment
value: staging
entrypoint: deploy-pipeline
Passing outputs as inputs between steps:
steps:
- - name: get-version
template: detect-version
- - name: build
template: build-image
arguments:
parameters:
- name: tag
value: "{{steps.get-version.outputs.parameters.version}}"
Output parameters from a file:
- name: detect-version
script:
image: alpine:3.18
command: [sh]
source: |
git describe --tags --abbrev=0 > /tmp/version.txt
cat /tmp/version.txt
outputs:
parameters:
- name: version
valueFrom:
path: /tmp/version.txt
7. Retry Strategies¶
- name: flaky-api-call
retryStrategy:
limit: "5"
retryPolicy: "Always" # Always | OnFailure | OnError | OnTransientError
backoff:
duration: "10s"
factor: "2" # exponential backoff: 10s, 20s, 40s, 80s, 160s
maxDuration: "3m"
container:
image: curlimages/curl
command: [curl, -sf, "https://api.example.com/data"]
Retry policies:
- Always: retry on any failure (exit code != 0 or pod failure)
- OnFailure: retry on non-zero exit code only
- OnError: retry on pod infrastructure failures only (OOM, eviction)
- OnTransientError: retry on known transient errors (node pressures, API server blips)
8. Parallelism and Fan-out with withItems / withParam¶
withItems — static list fan-out:
- name: process-shards
steps:
- - name: process
template: process-shard
arguments:
parameters:
- name: shard
value: "{{item}}"
withItems:
- shard-001
- shard-002
- shard-003
- shard-004
withParam — dynamic fan-out from a prior step's JSON output:
- name: discover-shards
template: list-shards
# outputs.result: '["shard-001","shard-002","shard-003"]'
- name: process-all
template: process-shard
arguments:
parameters:
- name: shard
value: "{{item}}"
withParam: "{{steps.discover-shards.outputs.result}}"
Control parallelism to avoid overwhelming downstream systems:
spec:
parallelism: 5 # global: at most 5 pods running at once in this workflow
- name: process-shard
parallelism: 3 # template-level: at most 3 concurrent instances of this step
9. WorkflowTemplate — Reusable Definitions¶
WorkflowTemplate is a cluster-scoped (or namespace-scoped) reusable template library. You reference templates from it in ad-hoc Workflows or CronWorkflows.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: build-templates
namespace: argo
spec:
templates:
- name: docker-build
inputs:
parameters:
- name: image
- name: tag
- name: dockerfile
value: Dockerfile
container:
image: gcr.io/kaniko-project/executor:latest
command: [/kaniko/executor]
args:
- --dockerfile={{inputs.parameters.dockerfile}}
- --destination={{inputs.parameters.image}}:{{inputs.parameters.tag}}
- --cache=true
- --cache-repo={{inputs.parameters.image}}-cache
Reference from a Workflow:
spec:
entrypoint: ci-pipeline
templates:
- name: ci-pipeline
steps:
- - name: build
templateRef:
name: build-templates
template: docker-build
arguments:
parameters:
- name: image
value: ghcr.io/myorg/myapp
- name: tag
value: "{{workflow.parameters.tag}}"
10. CronWorkflow¶
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: nightly-etl
namespace: argo
spec:
schedule: "0 2 * * *" # 2am UTC daily
timezone: "UTC"
concurrencyPolicy: Forbid # Allow | Forbid | Replace
startingDeadlineSeconds: 300
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
workflowSpec:
entrypoint: etl-pipeline
arguments:
parameters:
- name: date
value: "{{workflow.creationTimestamp.Y}}-{{workflow.creationTimestamp.m}}-{{workflow.creationTimestamp.d}}"
templates:
- name: etl-pipeline
dag:
tasks:
- name: extract
template: run-extract
- name: transform
dependencies: [extract]
template: run-transform
- name: load
dependencies: [transform]
template: run-load
11. RBAC¶
Service account for workflows (least-privilege):
apiVersion: v1
kind: ServiceAccount
metadata:
name: argo-workflow-sa
namespace: argo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: argo-workflow-role
namespace: argo
rules:
- apiGroups: [""]
resources: [pods]
verbs: [get, watch, patch]
- apiGroups: [""]
resources: [pods/log]
verbs: [get, watch]
- apiGroups: [argoproj.io]
resources: [workflows, workflowtaskresults]
verbs: [get, watch, patch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: argo-workflow-rb
namespace: argo
subjects:
- kind: ServiceAccount
name: argo-workflow-sa
roleRef:
kind: Role
name: argo-workflow-role
apiGroup: rbac.authorization.k8s.io
In WorkflowSpec:
12. Argo Events Integration¶
Argo Events provides event-driven triggers for Workflows. An EventSource captures events (webhooks, S3 notifications, Kafka messages), and a Sensor maps them to WorkflowTemplate submissions.
apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
name: github-webhook
namespace: argo-events
spec:
webhook:
push:
port: "12000"
endpoint: /push
method: POST
---
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
name: github-push-sensor
namespace: argo-events
spec:
template:
serviceAccountName: argo-events-sa
dependencies:
- name: push-event
eventSourceName: github-webhook
eventName: push
triggers:
- template:
name: trigger-ci
k8s:
operation: create
source:
resource:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: ci-triggered-
namespace: argo
spec:
workflowTemplateRef:
name: ci-pipeline
arguments:
parameters:
- name: branch
value: "{{.Input.body.ref}}"
13. When to Use Argo Workflows vs Alternatives¶
| Tool | Best for | Avoid when |
|---|---|---|
| Argo Workflows | Complex DAGs, ML pipelines, multi-step batch, fan-out with artifact passing | Simple periodic tasks |
| Kubernetes CronJob | Simple, single-step periodic tasks | Multi-step, artifact passing, fan-out |
| Tekton | CI/CD pipelines, strong Kubernetes CRD model, Tekton Hub integration | Non-CI workflows, small teams |
| Airflow | Python-native DAGs, large data engineering teams, existing Airflow investment | Kubernetes-native environments without Python expertise |
| GitHub Actions / GitLab CI | Code-centric CI triggered by Git events | Cluster-internal workflows, non-Git triggers |
Quick Reference¶
# Submit
argo submit workflow.yaml -n argo
argo submit workflow.yaml -n argo --wait
argo submit workflow.yaml -n argo -p image-tag=v1.2.3
# Monitor
argo list -n argo
argo get my-workflow-xxxxx -n argo
argo get @latest -n argo --watch
argo logs my-workflow-xxxxx -n argo
argo logs my-workflow-xxxxx -n argo -f # follow
# Control
argo suspend my-workflow-xxxxx -n argo
argo resume my-workflow-xxxxx -n argo
argo retry my-workflow-xxxxx -n argo # retry failed workflow
argo retry my-workflow-xxxxx -n argo --restart-successful # retry all nodes
argo terminate my-workflow-xxxxx -n argo # stop immediately
argo delete my-workflow-xxxxx -n argo
# CronWorkflow
argo cron list -n argo
argo cron suspend nightly-etl -n argo
argo cron resume nightly-etl -n argo
# WorkflowTemplate
argo template list -n argo
argo template get build-templates -n argo
# Garbage collect old workflows
argo delete --completed -n argo
argo delete --older 7d -n argo
Wiki Navigation¶
Prerequisites¶
- Kubernetes Ops (Production) (Topic Pack, L2)