Argo Workflows — Street-Level Ops¶

Quick Diagnosis Commands¶

# What workflows are running / failed?
argo list -n argo --status Running
argo list -n argo --status Failed

# Detailed workflow state
argo get my-workflow-xxxxx -n argo

# Logs for a specific failed step
argo logs my-workflow-xxxxx -n argo --node my-workflow-xxxxx-step-name

# Workflow controller logs (scheduling errors, resource issues)
kubectl -n argo logs -l app=workflow-controller --tail=100 -f

# Argo server logs (API / UI errors)
kubectl -n argo logs -l app=argo-server --tail=100

# Check resource usage of running workflow pods
kubectl -n argo top pods -l workflows.argoproj.io/workflow=my-workflow-xxxxx

# Inspect a failed pod directly
kubectl -n argo describe pod my-workflow-xxxxx-step-abc123

# Check artifact repository connectivity
kubectl -n argo get configmap workflow-controller-configmap -o yaml | grep -A20 artifactRepository

Gotcha: Workflow Hangs in "Pending" — Pod Never Scheduled¶

A workflow step sits in Pending state. The Pod was created but never scheduled.

# Find the pending pod
kubectl -n argo get pods | grep Pending
kubectl -n argo describe pod my-workflow-xxxxx-step-abc | grep -A10 Events

# Common causes in Events:
# "0/3 nodes are available: 3 Insufficient cpu"
# "0/3 nodes are available: persistentvolumeclaim not found"
# "0/3 nodes are available: node selector not matching"

Rule: Always set resource requests on every template. Workflows that omit requests get the namespace default (often zero), which causes unpredictable scheduling. For GPU workloads, ensure the node label selector matches your GPU node pool.

# Always specify resources
container:
  resources:
    requests:
      memory: 256Mi
      cpu: 200m
    limits:
      memory: 1Gi
      cpu: 1000m

Gotcha: Artifact Download Fails With "Access Denied"¶

Step 2 tries to download an artifact produced by Step 1. It fails with Access Denied or NoSuchKey.

# Check the artifact key path
argo get my-workflow-xxxxx -n argo
# Look at the outputs section — what path did step 1 write to?

# Check the S3 credentials secret
kubectl -n argo get secret s3-credentials -o yaml

# Check IAM/RBAC — does the workflow service account have S3 read access?
# For AWS: check IRSA annotation on the service account
kubectl -n argo get sa argo-workflow-sa -o yaml | grep eks.amazonaws.com

Rule: Test artifact configuration with a simple two-step workflow before building complex pipelines. Verify the S3 bucket exists, the credentials have read/write access, and the endpoint is correct. Use kubectl exec into the argo-server pod to test connectivity.

Gotcha: `withParam` Fan-Out Creates Hundreds of Pods Simultaneously¶

You wrote a dynamic fan-out that processes 500 shards. All 500 pods start simultaneously. The Kubernetes API server is overwhelmed. Pods are OOM-killed due to resource exhaustion. Other workloads on the cluster are starved.

Rule: Always set parallelism at the workflow or template level when using withItems or withParam. Start with a conservative value (10-20), measure impact, then tune up.

spec:
  parallelism: 20    # workflow-level cap

- name: process-all-shards
  parallelism: 10    # template-level cap (more specific, takes effect)
  template: process-shard
  withParam: "{{steps.list-shards.outputs.result}}"

Pattern: CI/CD Pipeline as a Workflow¶

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: ci-
spec:
  arguments:
    parameters:
      - name: repo
      - name: branch
      - name: commit-sha
  serviceAccountName: argo-ci-sa
  entrypoint: ci-pipeline
  templates:
    - name: ci-pipeline
      dag:
        tasks:
          - name: test
            template: run-tests
            arguments:
              parameters:
                - name: commit-sha
                  value: "{{workflow.parameters.commit-sha}}"
          - name: build
            dependencies: [test]
            template: build-image
            arguments:
              parameters:
                - name: tag
                  value: "{{workflow.parameters.commit-sha}}"
          - name: scan
            dependencies: [build]
            template: security-scan
            arguments:
              parameters:
                - name: image
                  value: "ghcr.io/myorg/myapp:{{workflow.parameters.commit-sha}}"
          - name: push
            dependencies: [scan]
            template: push-image
            when: "{{workflow.parameters.branch}} == main"

    - name: run-tests
      inputs:
        parameters:
          - name: commit-sha
      container:
        image: python:3.11-slim
        command: [sh, -c]
        args:
          - |
            git clone https://github.com/myorg/myapp.git /app
            cd /app && git checkout {{inputs.parameters.commit-sha}}
            pip install -r requirements.txt
            pytest --tb=short -q

    - name: build-image
      inputs:
        parameters:
          - name: tag
      container:
        image: gcr.io/kaniko-project/executor:latest
        command: [/kaniko/executor]
        args:
          - --dockerfile=Dockerfile
          - --destination=ghcr.io/myorg/myapp:{{inputs.parameters.tag}}
          - --cache=true
        volumeMounts:
          - name: docker-config
            mountPath: /kaniko/.docker
      volumes:
        - name: docker-config
          secret:
            secretName: docker-registry-secret
            items:
              - key: .dockerconfigjson
                path: config.json

Pattern: ML Training Pipeline With Conditional Deployment¶

- name: ml-pipeline
  dag:
    tasks:
      - name: prepare-data
        template: data-prep
      - name: train
        dependencies: [prepare-data]
        template: model-train
        arguments:
          artifacts:
            - name: training-data
              from: "{{tasks.prepare-data.outputs.artifacts.dataset}}"
      - name: evaluate
        dependencies: [train]
        template: model-eval
        arguments:
          artifacts:
            - name: model
              from: "{{tasks.train.outputs.artifacts.model}}"
      - name: deploy-to-staging
        dependencies: [evaluate]
        template: model-deploy
        when: "{{tasks.evaluate.outputs.parameters.f1-score}} >= 0.90"
        arguments:
          parameters:
            - name: environment
              value: staging
      - name: notify-low-performance
        dependencies: [evaluate]
        template: slack-notify
        when: "{{tasks.evaluate.outputs.parameters.f1-score}} < 0.90"
        arguments:
          parameters:
            - name: message
              value: "Model F1={{tasks.evaluate.outputs.parameters.f1-score}} below threshold 0.90"

Scenario: Failed Workflow — Retry From a Specific Step¶

A long-running ETL pipeline failed at the "load" step after 2 hours of successful extraction and transformation. You don't want to re-run extraction.

# Get the workflow name and failed node
argo get my-etl-workflow-xxxxx -n argo
# Look for failed nodes in the output

# Retry from the failed node only
argo retry my-etl-workflow-xxxxx -n argo \
  --node-field-selector=displayName=load-to-warehouse

# If you want to retry from a checkpoint (restart only failed + downstream)
argo retry my-etl-workflow-xxxxx -n argo

# Note: by default, argo retry restarts only failed nodes
# Use --restart-successful to restart everything
argo retry my-etl-workflow-xxxxx -n argo --restart-successful

Scenario: CronWorkflow Missed Its Schedule¶

CronWorkflow was supposed to run at 2am but didn't.

# Check CronWorkflow status
argo cron list -n argo
kubectl -n argo get cronworkflow nightly-etl -o yaml | grep -E "lastScheduled|status"

# Check if it's suspended
argo cron get nightly-etl -n argo

# Common cause: startingDeadlineSeconds too low
# If the workflow controller was down at 2am and came back at 2:06am,
# and startingDeadlineSeconds=300 (5 min), the job is skipped

# Manual trigger
argo submit --from=cronwf/nightly-etl -n argo

# Resume if suspended
argo cron resume nightly-etl -n argo

Scenario: Workflow Pods Pile Up — etcd/API Server Under Pressure¶

Workflows complete but their pods are not deleted. After a week, there are 5000 completed pods in the argo namespace.

# Count stale pods
kubectl -n argo get pods --field-selector=status.phase=Succeeded | wc -l
kubectl -n argo get pods --field-selector=status.phase=Failed | wc -l

# Delete completed workflow resources
argo delete --completed -n argo
argo delete --older 7d -n argo

# Configure pod GC in workflow-controller-configmap
# to auto-delete pods on completion

# workflow-controller-configmap
data:
  podGCStrategy: OnWorkflowCompletion    # OnPodCompletion | OnWorkflowCompletion | OnWorkflowSuccess
  workflowDefaults: |
    spec:
      podGC:
        strategy: OnWorkflowCompletion
      ttlStrategy:
        secondsAfterCompletion: 86400    # 1 day
        secondsAfterSuccess: 3600        # 1 hour
        secondsAfterFailure: 604800      # 7 days (keep for debugging)

Emergency: Workflow Controller Crashed, Running Workflows Stalled¶

The workflow controller pod restarted. Existing workflows that were Running are now stuck — no new pods are being created, but the workflow resources show Running status.

# Check controller status
kubectl -n argo get pods -l app=workflow-controller
kubectl -n argo logs -l app=workflow-controller --previous    # logs before crash

# Verify controller has recovered
kubectl -n argo rollout status deploy/workflow-controller

# Force controller to re-sync all running workflows
# The controller will pick up running workflows on restart automatically
# If not: restart the controller
kubectl -n argo rollout restart deploy/workflow-controller

# Check for stuck workflows after restart
argo list -n argo --status Running
# Running workflows should start progressing again within 30s

Useful One-Liners¶

# Count workflows by status
argo list -n argo -o json | jq 'group_by(.status.phase) | map({phase: .[0].status.phase, count: length})'

# Find all workflows older than 7 days
argo list -n argo --older 7d

# Get logs from all failed steps across a workflow
argo logs my-workflow-xxxxx -n argo 2>&1 | grep -E "Error|FAIL|Exception"

# Delete all failed workflows (careful!)
argo delete -n argo $(argo list -n argo --status Failed -o name)

# Submit with parameter file
argo submit workflow.yaml -n argo --parameter-file params.yaml

# Lint a workflow before submitting
argo lint workflow.yaml

# Show all artifact paths for a workflow
argo get my-workflow-xxxxx -n argo -o json | jq '.. | .artifacts? // empty | .[].s3.key'

# Watch workflow in real time
argo get @latest -n argo --watch

# Check global parameters at runtime
argo get my-workflow-xxxxx -n argo -o json | jq '.spec.arguments.parameters'

# List WorkflowTemplates
argo template list -n argo

# Trigger CronWorkflow immediately
argo submit --from=cronwf/nightly-etl -n argo --watch