Argo Workflows Footguns¶
Mistakes that cause workflow failures, data loss, or cluster instability with Argo Workflows.
1. No Resource Requests — Pods Evicted Under Pressure¶
Templates with no resources.requests get scheduled with BestEffort QoS. Under node memory pressure, they're the first to be evicted. A 4-hour data processing workflow gets evicted at hour 3. There's no retry, so the workflow fails. You lose the intermediate artifacts.
Fix: Always set resources.requests and resources.limits on every container template. Set at least memory: 256Mi, cpu: 100m for small steps, higher for compute-intensive ones. Use retryStrategy with retryPolicy: OnError to catch eviction-related failures.
Under the hood: BestEffort QoS pods are always evicted first, but even Burstable pods (requests set but lower than limits) are evicted before Guaranteed pods (requests == limits). For long-running workflow steps, set
requests == limitsto get Guaranteed QoS and survive node pressure events.
2. withParam Fan-Out Without parallelism Limit¶
Your workflow discovers 2000 items and fans out with withParam. All 2000 pods start simultaneously. The cluster's API server rate-limits pod creation. Nodes can't pull 2000 images at once. The cluster's coredns pods get starved of resources. Other production workloads become unreachable. You've caused a cluster-wide incident with a workflow.
Fix: Always set parallelism at both the workflow level (spec.parallelism) and the template level (inside the template that uses withParam). Start at 10-20 and tune. Never submit a dynamic fan-out workflow to production without first testing with a small parameter list.
3. Secrets in Workflow Spec Parameters¶
You pass a database password as a workflow parameter:
The password is now stored in theWorkflow resource spec in etcd, visible to anyone with kubectl get workflow -n argo -o yaml, logged in audit logs, and visible in the Argo UI.
Fix: Never pass secrets as workflow parameters. Reference secrets via env.valueFrom.secretKeyRef in the container spec, or mount secrets as volumes. Workflow parameters are for non-sensitive configuration only.
4. Long-Running Single-Step Workflows Instead of DAGs¶
You write a workflow with one step that runs a bash script for 6 hours. The script does extract → transform → load sequentially. The pod is evicted at hour 5. You restart from zero.
Fix: Break long workflows into checkpointed DAG steps with artifacts between them. When a step fails, only that step is retried. Artifacts from completed steps persist. A 6-hour ETL becomes three 2-hour steps — eviction at hour 5 means re-running only the last step (2 hours), not all 6.
5. Not Setting podGC Strategy — etcd Bloat¶
You run hundreds of workflows per day. Each workflow creates dozens of pods. Completed pods accumulate. After a month, there are 100,000 pod records in etcd. API server list operations slow to a crawl. kubectl get pods -n argo times out. The etcd compaction cycle struggles to keep up.
Fix: Configure pod garbage collection globally:
# workflow-controller-configmap
data:
podGCStrategy: OnWorkflowCompletion
workflowDefaults: |
spec:
ttlStrategy:
secondsAfterCompletion: 3600
secondsAfterSuccess: 1800
secondsAfterFailure: 604800
retentionPolicy to limit stored workflow history.
6. Artifact Path Collisions in Concurrent Workflows¶
Two concurrent runs of the same CronWorkflow write to the same S3 key path. The artifact from run A is overwritten by run B mid-pipeline. Step 3 of run A reads the artifact and gets run B's data. Results are corrupted.
Fix: Include the workflow name or timestamp in artifact paths:
outputs:
artifacts:
- name: dataset
path: /tmp/dataset.parquet
s3:
key: "etl/{{workflow.name}}/{{workflow.creationTimestamp}}/dataset.parquet"
generateName to ensure unique workflow names, which then produce unique artifact paths by default.
7. Missing serviceAccountName — Default SA Has No Permissions¶
Argo Workflows need to patch their own pod status (via workflowtaskresults) and read secrets for artifact credentials. The default service account in most namespaces has no permissions. Your workflow starts, the main container completes, but the workflow controller can't retrieve the exit code. The step hangs in Running state until it times out.
Fix: Always specify a serviceAccountName in your WorkflowSpec. Create a dedicated service account with the minimum permissions the Argo controller needs (patch pods, get pods/log, patch workflowtaskresults). Apply it as the default in workflowDefaults in the controller configmap.
8. Using retryPolicy: Always on Non-Idempotent Steps¶
Your workflow step sends an email, charges a customer, or writes a record to a database without deduplication. You set retryPolicy: Always because you want resilience. The step fails transiently. It retries 5 times. You send 5 emails, charge the customer 5 times, or insert 5 duplicate records.
Fix: Distinguish between idempotent and non-idempotent steps. Non-idempotent steps should use retryPolicy: OnError (infrastructure failures only) or have no retry. Idempotent steps can safely use retryPolicy: Always. Add idempotency keys or deduplication logic to critical side-effecting operations.
Gotcha:
retryPolicy: OnErrorretries on non-zero exit codes and pod failures (OOMKill, eviction).retryPolicy: Alwaysalso retries onFailedstatus, which includes application-level failures. If your step returns exit code 1 for "invalid input" (a condition that will never succeed on retry),Alwayswill retry it uselessly. UseretryPolicy: OnErrorand handle transient vs permanent failures in your application.
9. Not Using WorkflowTemplate — Copy-Paste Template Hell¶
Every workflow copies the same build-image, run-tests, push-image template blocks. 15 workflows × 5 templates = 75 YAML blocks to update when the base image changes. Someone updates 13 out of 15 and misses 2. Those 2 workflows now build with the old base image that has a CVE.
Fix: Use WorkflowTemplate for shared step definitions. Reference them via templateRef in all consuming workflows. A single change to the WorkflowTemplate propagates to all consumers on next run.
10. CronWorkflow concurrencyPolicy: Allow With Slow Workflows¶
Your CronWorkflow runs every 15 minutes. Sometimes the workflow takes 20 minutes. With concurrencyPolicy: Allow, a second instance starts at 15min while the first is still running. Now two instances compete for the same database rows, write conflicting outputs, and both fail at 20 minutes. Then a third starts...
Fix: Use concurrencyPolicy: Forbid for workflows that process shared state, databases, or exclusive resources. Use concurrencyPolicy: Replace only when the old run's output is irrelevant (pure recomputation). Set startingDeadlineSeconds to avoid queued backlog when the controller restarts.
11. Hardcoding Image Tags as latest¶
latest is a mutable tag. Today's workflow uses Python 3.11. Next week, latest resolves to Python 3.13 which has a breaking change. Your ETL workflow silently starts producing wrong results. Debugging is hard because the workflow YAML hasn't changed — only the image changed.
Fix: Always pin image tags to an immutable digest or specific version. Use python:3.11-slim at minimum, python:3.11-slim@sha256:abc123... for strict reproducibility. Pin base images in workflows just like you would in production Deployments.
12. Overlooking Workflow RBAC — Any User Can Run Any Workflow¶
By default, Argo Workflows UI may allow all authenticated users to submit workflows in any namespace. A developer submits a resource-intensive ML training workflow in the production namespace. It runs as the production service account, which has broad cluster permissions.
Fix: Enable SSO and RBAC in the Argo Server. Define ClusterWorkflowTemplates that restrict which namespaces and service accounts users can target. Use namespace-level RBAC to prevent developers from submitting workflows to production namespaces. Treat the Argo UI as a control plane that requires the same access controls as kubectl.