Decision Tree: Should I Automate This?¶
Category: Operational Decisions Starting Question: "I keep doing this manual task — should I automate it?" Estimated traversal: 3-5 minutes Domains: automation, toil-reduction, SRE, platform-engineering
The Tree¶
I keep doing this manual task — should I automate it?
│
├── [Check 1] Has someone already automated this or does a tool exist?
│ ├── YES (existing tool, script, or SaaS feature covers this)
│ │ └── → ✅ USE EXISTING TOOL — do not rebuild, configure and document it
│ │
│ └── NO (confirmed nothing exists after research)
│ │
│ ├── [Check 2] How often does this task occur?
│ │ │
│ │ ├── Daily or multiple times per day
│ │ │ ├── [Check 3] How long does it take each time?
│ │ │ │ ├── > 30 minutes/occurrence → HIGH VALUE — automate now
│ │ │ │ ├── 5–30 minutes/occurrence → likely worth automating
│ │ │ │ └── < 5 minutes/occurrence → still worth it if error-prone
│ │ │ │
│ │ │ ├── [Check 4] Is the process well-defined and stable?
│ │ │ │ ├── YES (same steps every time, hasn't changed in > 3 months)
│ │ │ │ │ ├── [Check 5] What is the blast radius if automation has a bug?
│ │ │ │ │ │ ├── LOW (affects only one service, fully reversible)
│ │ │ │ │ │ │ └── → ✅ AUTOMATE NOW
│ │ │ │ │ │ ├── MEDIUM (affects a team, reversible with effort)
│ │ │ │ │ │ │ └── → ✅ AUTOMATE with dry-run mode + human approval gate
│ │ │ │ │ │ └── HIGH (affects all regions, irreversible writes, PII)
│ │ │ │ │ │ └── → ✅ PARTIAL AUTOMATION (automate prep, human executes)
│ │ │ │ └── NO (changes every sprint, depends on context each time)
│ │ │ │ └── → ⚠️ DO NOT AUTOMATE YET — stabilize process first
│ │ │
│ │ ├── Weekly
│ │ │ ├── [Check 6] Is human error risk high? (requires exact syntax, ordering matters)
│ │ │ │ ├── YES (error has caused incidents before, or runbook has > 10 steps)
│ │ │ │ │ └── → ✅ AUTOMATE — error reduction justifies cost even at weekly cadence
│ │ │ │ └── NO (straightforward, rarely mis-executed)
│ │ │ │ ├── [Check 7] Do you have time to build + maintain it?
│ │ │ │ │ ├── YES (< 1 day to build, < 1 hour/month to maintain)
│ │ │ │ │ │ └── → ✅ AUTOMATE (backlog it for next sprint)
│ │ │ │ │ └── NO → ✅ AUTOMATE LATER (log in backlog, revisit quarterly)
│ │ │
│ │ ├── Monthly
│ │ │ ├── [Check 8] What is the cost of the manual process vs automation build?
│ │ │ │ │ (time to automate vs time saved over 12 months)
│ │ │ │ ├── Time to automate < 6× monthly task time → worth it
│ │ │ │ │ ├── [Check 5] → apply blast radius check before proceeding
│ │ │ │ └── Time to automate > 6× monthly task time → probably not worth it
│ │ │ │ └── → ✅ AUTOMATE LATER or script partially (reduce steps, not eliminate)
│ │ │
│ │ └── Quarterly or rarer
│ │ ├── [Check 9] Is error risk catastrophic if done wrong?
│ │ │ ├── YES (DR restore, data migration, certificate rotation on all prod certs)
│ │ │ │ └── → ✅ PARTIAL AUTOMATION + mandatory checklist + second approver
│ │ │ └── NO (low stakes, rare, reversible)
│ │ │ └── → ✅ DO NOT AUTOMATE — write a good checklist instead
│ │ │
│ │ └── [Check 10] Is the task an annual compliance requirement?
│ │ ├── YES (SOC 2 evidence collection, access review, etc.)
│ │ │ └── → ✅ AUTOMATE (compliance automation pays for itself in audit time)
│ │ └── NO → document well, do not automate
Node Details¶
Check 1: Research before building¶
Command/method:
# Search your internal tool registry / wiki
curl -s "https://wiki.internal/search?q=automate+task-name" | jq '.results[].title'
# Search GitHub org for existing scripts
gh search repos --owner your-org "task-name automation" --limit 10
# Check community tools
# - HashiCorp Terraform / Ansible for infra tasks
# - Rundeck / Temporal for workflow automation
# - GitHub Actions / ArgoCD workflows for deploy tasks
# - Datadog / Prometheus rules for alert-driven automation
pip search task-keyword 2>/dev/null || brew search task-keyword
# Check if your cloud provider has a native feature
aws ssm list-documents --filters "Key=DocumentType,Values=Automation"
Check 2: Task frequency measurement¶
Command/method:
# Search Slack history for how often you've done this
# Search ops log
grep "task-keyword" /var/log/ops/ops-log.txt | wc -l
grep "task-keyword" /var/log/ops/ops-log.txt | \
awk '{print $1}' | sort | uniq -c | sort -rn | head -10
# Check ticket/issue history
gh issue list --label "manual-task" --state closed --limit 100 | \
grep "task-name" | wc -l
# Calendar / PagerDuty incident history
pd incident list --statuses resolved --since "30 days ago" | grep "task-keyword"
Check 4: Process stability assessment¶
Command/method:
# Check git history of the runbook for how often it changed
git log --oneline --follow runbooks/task-name.md | head -20
# Count runbook revision frequency
git log --since="3 months ago" --oneline runbooks/task-name.md | wc -l
Check 5: Blast radius assessment¶
Command/method:
# Map what the task touches
# - Which services / resources does it modify?
kubectl get all -n production -l "managed-by=task-name"
# - Is the output reversible?
# Example: adding a Kubernetes label is reversible; deleting a namespace is not
# - What is the scope? (one namespace, one cluster, all clusters?)
kubectl config get-contexts | wc -l # How many clusters would be affected?
# - Does it touch PII or financial data?
grep -r "pii\|personal\|payment\|credit" runbooks/task-name.md -i
# Test automation in staging first
kubectl config use-context staging
./automate-task.sh --dry-run
Check 6: Error risk in manual process¶
Command/method:
# Count how many steps are in the runbook
grep -c "^[0-9]\+\." runbooks/task-name.md
# Check if there are copy-paste-sensitive commands (long IDs, exact ordering)
grep -E "(arn:|account-id|cluster-id|secret)" runbooks/task-name.md | wc -l
# Search incident post-mortems for this task as a contributing factor
grep -r "human error\|manual.*error\|mis-executed" postmortems/ | grep "task-name"
Check 7: Maintenance cost estimation¶
Command/method: Think through the full lifecycle cost, not just build time. What you're looking for: - Build time: estimate hours to write, test, review, and deploy the automation - Ongoing maintenance: does it have dependencies that will break (API changes, auth rotation, schema changes)? - On-call burden: what happens when the automation fails at 2am? - Documentation: time to write runbook for the automation itself
Common pitfall: Estimating build time only. A 4-hour script that breaks every month and requires 2 hours to debug each time costs more than the manual task over a year.
Terminal Actions¶
✅ Action: Automate Now¶
Do:
# 1. Write the automation with a mandatory --dry-run mode
cat > automate-task.sh << 'EOF'
#!/bin/bash
set -euo pipefail
DRY_RUN=${DRY_RUN:-false}
if [[ "$DRY_RUN" == "true" ]]; then
echo "[DRY RUN] Would execute: $COMMAND"
else
eval "$COMMAND"
fi
EOF
# 2. Test in non-prod environment first
DRY_RUN=true ./automate-task.sh
kubectl config use-context staging && ./automate-task.sh
# 3. Add to CI/CD or cron with alerting on failure
kubectl apply -f - <<CRONEOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: automate-task
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: task
image: task-runner:latest
restartPolicy: OnFailure
CRONEOF
# 4. Write a runbook for the automation itself (how to debug when it fails)
# 5. Add a monitoring alert for automation failure
✅ Action: Automate with Dry-Run + Human Approval Gate¶
Do:
# 1. Automation generates a plan / diff, does not execute
./automate-task.sh --plan > /tmp/task-plan-$(date +%Y%m%d).txt
# 2. Human reviews the plan
cat /tmp/task-plan-$(date +%Y%m%d).txt
# 3. Human approves execution
read -p "Approve execution? (yes/no): " approval
if [[ "$approval" == "yes" ]]; then
./automate-task.sh --execute
fi
# 4. Or use a GitHub Actions manual approval gate
# (environment protection rules with required reviewers)
✅ Action: Automate Later — Add to Backlog¶
Do:
# Create a tracked issue with time/error cost documented
gh issue create --repo org/platform \
--title "Automate: [task-name] (weekly, 20 min, error-prone)" \
--label "toil-reduction,automation" \
--body "Frequency: weekly. Time cost: 20 min/occurrence. Error incidents: 2 in last 6 months. Estimated build time: 4 hours. ROI positive after 3 months."
# Add to sprint backlog or quarterly OKR
✅ Action: Use Existing Tool¶
Do:
# 1. Install / configure the existing tool
# 2. Validate it covers your use case
# 3. Document the configuration in your team's runbook
# 4. Do NOT build a wrapper around it just to "make it fit" — adapt your process
# Example: using AWS SSM Run Command instead of SSH + manual script
aws ssm send-command \
--document-name "AWS-RunShellScript" \
--targets "Key=tag:Name,Values=prod-workers" \
--parameters 'commands=["systemctl restart myservice"]'
✅ Action: Partial Automation (High Blast Radius)¶
Do:
# Automate the data gathering and verification steps (low risk)
./automate-task.sh --gather-info > /tmp/task-context.txt
./automate-task.sh --validate --dry-run >> /tmp/task-context.txt
# Keep the execution step manual with the context pre-populated
cat /tmp/task-context.txt
echo "Review the above. Execute manually: $FINAL_COMMAND"
⚠️ Warning: Do Not Automate — Unstable Process¶
When: The runbook changes every sprint, the tool/API it calls is changing, or the task itself is being re-evaluated. Risk: Automation built on an unstable process becomes a maintenance burden that slows down process changes. You end up maintaining the automation instead of improving the process. Mitigation: Write a clean, up-to-date runbook instead. Revisit automation after the process has been stable for 3+ months.
Edge Cases¶
- The task is "automate the on-call workflow": On-call workflows often require contextual judgment that defeats simple automation. Partial automation (auto-gather diagnostics, auto-remediate known patterns) is appropriate. Full automation of incident response is high-risk.
- Automation requires elevated permissions: If the automation needs prod credentials or elevated IAM roles, the security posture of the automation host becomes critical. Scope permissions narrowly and use short-lived credentials.
- You're the only person who can maintain it: Automation that only one person understands is a single point of failure. Include a bus-factor check: could a new team member debug this automation at 2am?
- The task is infrequent but has a hard deadline: Annual access reviews, quarterly DR tests, and compliance tasks have calendar-enforced deadlines. Even if the ROI math doesn't pencil out, automation is justified to ensure the deadline isn't missed.
Cross-References¶
- Topic Packs: Automation, Toil Reduction
- Runbooks: automation-standards.md
- Related trees: config-change.md, scale-or-optimize.md