Argo Workflows — Street-Level Ops¶
Quick Diagnosis Commands¶
# What workflows are running / failed?
argo list -n argo --status Running
argo list -n argo --status Failed
# Detailed workflow state
argo get my-workflow-xxxxx -n argo
# Logs for a specific failed step
argo logs my-workflow-xxxxx -n argo --node my-workflow-xxxxx-step-name
# Workflow controller logs (scheduling errors, resource issues)
kubectl -n argo logs -l app=workflow-controller --tail=100 -f
# Argo server logs (API / UI errors)
kubectl -n argo logs -l app=argo-server --tail=100
# Check resource usage of running workflow pods
kubectl -n argo top pods -l workflows.argoproj.io/workflow=my-workflow-xxxxx
# Inspect a failed pod directly
kubectl -n argo describe pod my-workflow-xxxxx-step-abc123
# Check artifact repository connectivity
kubectl -n argo get configmap workflow-controller-configmap -o yaml | grep -A20 artifactRepository
Gotcha: Workflow Hangs in "Pending" — Pod Never Scheduled¶
A workflow step sits in Pending state. The Pod was created but never scheduled.
# Find the pending pod
kubectl -n argo get pods | grep Pending
kubectl -n argo describe pod my-workflow-xxxxx-step-abc | grep -A10 Events
# Common causes in Events:
# "0/3 nodes are available: 3 Insufficient cpu"
# "0/3 nodes are available: persistentvolumeclaim not found"
# "0/3 nodes are available: node selector not matching"
Rule: Always set resource requests on every template. Workflows that omit requests get the namespace default (often zero), which causes unpredictable scheduling. For GPU workloads, ensure the node label selector matches your GPU node pool.
# Always specify resources
container:
resources:
requests:
memory: 256Mi
cpu: 200m
limits:
memory: 1Gi
cpu: 1000m
Gotcha: Artifact Download Fails With "Access Denied"¶
Step 2 tries to download an artifact produced by Step 1. It fails with Access Denied or NoSuchKey.
# Check the artifact key path
argo get my-workflow-xxxxx -n argo
# Look at the outputs section — what path did step 1 write to?
# Check the S3 credentials secret
kubectl -n argo get secret s3-credentials -o yaml
# Check IAM/RBAC — does the workflow service account have S3 read access?
# For AWS: check IRSA annotation on the service account
kubectl -n argo get sa argo-workflow-sa -o yaml | grep eks.amazonaws.com
Rule: Test artifact configuration with a simple two-step workflow before building complex pipelines. Verify the S3 bucket exists, the credentials have read/write access, and the endpoint is correct. Use kubectl exec into the argo-server pod to test connectivity.
Gotcha: withParam Fan-Out Creates Hundreds of Pods Simultaneously¶
You wrote a dynamic fan-out that processes 500 shards. All 500 pods start simultaneously. The Kubernetes API server is overwhelmed. Pods are OOM-killed due to resource exhaustion. Other workloads on the cluster are starved.
Rule: Always set parallelism at the workflow or template level when using withItems or withParam. Start with a conservative value (10-20), measure impact, then tune up.
spec:
parallelism: 20 # workflow-level cap
- name: process-all-shards
parallelism: 10 # template-level cap (more specific, takes effect)
template: process-shard
withParam: "{{steps.list-shards.outputs.result}}"
Pattern: CI/CD Pipeline as a Workflow¶
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: ci-
spec:
arguments:
parameters:
- name: repo
- name: branch
- name: commit-sha
serviceAccountName: argo-ci-sa
entrypoint: ci-pipeline
templates:
- name: ci-pipeline
dag:
tasks:
- name: test
template: run-tests
arguments:
parameters:
- name: commit-sha
value: "{{workflow.parameters.commit-sha}}"
- name: build
dependencies: [test]
template: build-image
arguments:
parameters:
- name: tag
value: "{{workflow.parameters.commit-sha}}"
- name: scan
dependencies: [build]
template: security-scan
arguments:
parameters:
- name: image
value: "ghcr.io/myorg/myapp:{{workflow.parameters.commit-sha}}"
- name: push
dependencies: [scan]
template: push-image
when: "{{workflow.parameters.branch}} == main"
- name: run-tests
inputs:
parameters:
- name: commit-sha
container:
image: python:3.11-slim
command: [sh, -c]
args:
- |
git clone https://github.com/myorg/myapp.git /app
cd /app && git checkout {{inputs.parameters.commit-sha}}
pip install -r requirements.txt
pytest --tb=short -q
- name: build-image
inputs:
parameters:
- name: tag
container:
image: gcr.io/kaniko-project/executor:latest
command: [/kaniko/executor]
args:
- --dockerfile=Dockerfile
- --destination=ghcr.io/myorg/myapp:{{inputs.parameters.tag}}
- --cache=true
volumeMounts:
- name: docker-config
mountPath: /kaniko/.docker
volumes:
- name: docker-config
secret:
secretName: docker-registry-secret
items:
- key: .dockerconfigjson
path: config.json
Pattern: ML Training Pipeline With Conditional Deployment¶
- name: ml-pipeline
dag:
tasks:
- name: prepare-data
template: data-prep
- name: train
dependencies: [prepare-data]
template: model-train
arguments:
artifacts:
- name: training-data
from: "{{tasks.prepare-data.outputs.artifacts.dataset}}"
- name: evaluate
dependencies: [train]
template: model-eval
arguments:
artifacts:
- name: model
from: "{{tasks.train.outputs.artifacts.model}}"
- name: deploy-to-staging
dependencies: [evaluate]
template: model-deploy
when: "{{tasks.evaluate.outputs.parameters.f1-score}} >= 0.90"
arguments:
parameters:
- name: environment
value: staging
- name: notify-low-performance
dependencies: [evaluate]
template: slack-notify
when: "{{tasks.evaluate.outputs.parameters.f1-score}} < 0.90"
arguments:
parameters:
- name: message
value: "Model F1={{tasks.evaluate.outputs.parameters.f1-score}} below threshold 0.90"
Scenario: Failed Workflow — Retry From a Specific Step¶
A long-running ETL pipeline failed at the "load" step after 2 hours of successful extraction and transformation. You don't want to re-run extraction.
# Get the workflow name and failed node
argo get my-etl-workflow-xxxxx -n argo
# Look for failed nodes in the output
# Retry from the failed node only
argo retry my-etl-workflow-xxxxx -n argo \
--node-field-selector=displayName=load-to-warehouse
# If you want to retry from a checkpoint (restart only failed + downstream)
argo retry my-etl-workflow-xxxxx -n argo
# Note: by default, argo retry restarts only failed nodes
# Use --restart-successful to restart everything
argo retry my-etl-workflow-xxxxx -n argo --restart-successful
Scenario: CronWorkflow Missed Its Schedule¶
CronWorkflow was supposed to run at 2am but didn't.
# Check CronWorkflow status
argo cron list -n argo
kubectl -n argo get cronworkflow nightly-etl -o yaml | grep -E "lastScheduled|status"
# Check if it's suspended
argo cron get nightly-etl -n argo
# Common cause: startingDeadlineSeconds too low
# If the workflow controller was down at 2am and came back at 2:06am,
# and startingDeadlineSeconds=300 (5 min), the job is skipped
# Manual trigger
argo submit --from=cronwf/nightly-etl -n argo
# Resume if suspended
argo cron resume nightly-etl -n argo
Scenario: Workflow Pods Pile Up — etcd/API Server Under Pressure¶
Workflows complete but their pods are not deleted. After a week, there are 5000 completed pods in the argo namespace.
# Count stale pods
kubectl -n argo get pods --field-selector=status.phase=Succeeded | wc -l
kubectl -n argo get pods --field-selector=status.phase=Failed | wc -l
# Delete completed workflow resources
argo delete --completed -n argo
argo delete --older 7d -n argo
# Configure pod GC in workflow-controller-configmap
# to auto-delete pods on completion
# workflow-controller-configmap
data:
podGCStrategy: OnWorkflowCompletion # OnPodCompletion | OnWorkflowCompletion | OnWorkflowSuccess
workflowDefaults: |
spec:
podGC:
strategy: OnWorkflowCompletion
ttlStrategy:
secondsAfterCompletion: 86400 # 1 day
secondsAfterSuccess: 3600 # 1 hour
secondsAfterFailure: 604800 # 7 days (keep for debugging)
Emergency: Workflow Controller Crashed, Running Workflows Stalled¶
The workflow controller pod restarted. Existing workflows that were Running are now stuck — no new pods are being created, but the workflow resources show Running status.
# Check controller status
kubectl -n argo get pods -l app=workflow-controller
kubectl -n argo logs -l app=workflow-controller --previous # logs before crash
# Verify controller has recovered
kubectl -n argo rollout status deploy/workflow-controller
# Force controller to re-sync all running workflows
# The controller will pick up running workflows on restart automatically
# If not: restart the controller
kubectl -n argo rollout restart deploy/workflow-controller
# Check for stuck workflows after restart
argo list -n argo --status Running
# Running workflows should start progressing again within 30s
Useful One-Liners¶
# Count workflows by status
argo list -n argo -o json | jq 'group_by(.status.phase) | map({phase: .[0].status.phase, count: length})'
# Find all workflows older than 7 days
argo list -n argo --older 7d
# Get logs from all failed steps across a workflow
argo logs my-workflow-xxxxx -n argo 2>&1 | grep -E "Error|FAIL|Exception"
# Delete all failed workflows (careful!)
argo delete -n argo $(argo list -n argo --status Failed -o name)
# Submit with parameter file
argo submit workflow.yaml -n argo --parameter-file params.yaml
# Lint a workflow before submitting
argo lint workflow.yaml
# Show all artifact paths for a workflow
argo get my-workflow-xxxxx -n argo -o json | jq '.. | .artifacts? // empty | .[].s3.key'
# Watch workflow in real time
argo get @latest -n argo --watch
# Check global parameters at runtime
argo get my-workflow-xxxxx -n argo -o json | jq '.spec.arguments.parameters'
# List WorkflowTemplates
argo template list -n argo
# Trigger CronWorkflow immediately
argo submit --from=cronwf/nightly-etl -n argo --watch