Skip to content

Decision Tree: Should I Automate This?

Category: Operational Decisions Starting Question: "I keep doing this manual task — should I automate it?" Estimated traversal: 3-5 minutes Domains: automation, toil-reduction, SRE, platform-engineering


The Tree

I keep doing this manual task — should I automate it?
├── [Check 1] Has someone already automated this or does a tool exist?
│   ├── YES (existing tool, script, or SaaS feature covers this)
│   │   └── → ✅ USE EXISTING TOOL — do not rebuild, configure and document it
│   │
│   └── NO (confirmed nothing exists after research)
│       │
│       ├── [Check 2] How often does this task occur?
│       │   │
│       │   ├── Daily or multiple times per day
│       │   │   ├── [Check 3] How long does it take each time?
│       │   │   │   ├── > 30 minutes/occurrence → HIGH VALUE — automate now
│       │   │   │   ├── 5–30 minutes/occurrence → likely worth automating
│       │   │   │   └── < 5 minutes/occurrence → still worth it if error-prone
│       │   │   │
│       │   │   ├── [Check 4] Is the process well-defined and stable?
│       │   │   │   ├── YES (same steps every time, hasn't changed in > 3 months)
│       │   │   │   │   ├── [Check 5] What is the blast radius if automation has a bug?
│       │   │   │   │   │   ├── LOW (affects only one service, fully reversible)
│       │   │   │   │   │   │   └── → ✅ AUTOMATE NOW
│       │   │   │   │   │   ├── MEDIUM (affects a team, reversible with effort)
│       │   │   │   │   │   │   └── → ✅ AUTOMATE with dry-run mode + human approval gate
│       │   │   │   │   │   └── HIGH (affects all regions, irreversible writes, PII)
│       │   │   │   │   │       └── → ✅ PARTIAL AUTOMATION (automate prep, human executes)
│       │   │   │   └── NO (changes every sprint, depends on context each time)
│       │   │   │       └── → ⚠️ DO NOT AUTOMATE YET — stabilize process first
│       │   │
│       │   ├── Weekly
│       │   │   ├── [Check 6] Is human error risk high? (requires exact syntax, ordering matters)
│       │   │   │   ├── YES (error has caused incidents before, or runbook has > 10 steps)
│       │   │   │   │   └── → ✅ AUTOMATE — error reduction justifies cost even at weekly cadence
│       │   │   │   └── NO (straightforward, rarely mis-executed)
│       │   │   │       ├── [Check 7] Do you have time to build + maintain it?
│       │   │   │       │   ├── YES (< 1 day to build, < 1 hour/month to maintain)
│       │   │   │       │   │   └── → ✅ AUTOMATE (backlog it for next sprint)
│       │   │   │       │   └── NO → ✅ AUTOMATE LATER (log in backlog, revisit quarterly)
│       │   │
│       │   ├── Monthly
│       │   │   ├── [Check 8] What is the cost of the manual process vs automation build?
│       │   │   │   │         (time to automate vs time saved over 12 months)
│       │   │   │   ├── Time to automate < 6× monthly task time → worth it
│       │   │   │   │   ├── [Check 5] → apply blast radius check before proceeding
│       │   │   │   └── Time to automate > 6× monthly task time → probably not worth it
│       │   │   │       └── → ✅ AUTOMATE LATER or script partially (reduce steps, not eliminate)
│       │   │
│       │   └── Quarterly or rarer
│       │       ├── [Check 9] Is error risk catastrophic if done wrong?
│       │       │   ├── YES (DR restore, data migration, certificate rotation on all prod certs)
│       │       │   │   └── → ✅ PARTIAL AUTOMATION + mandatory checklist + second approver
│       │       │   └── NO (low stakes, rare, reversible)
│       │       │       └── → ✅ DO NOT AUTOMATE — write a good checklist instead
│       │       │
│       │       └── [Check 10] Is the task an annual compliance requirement?
│       │           ├── YES (SOC 2 evidence collection, access review, etc.)
│       │           │   └── → ✅ AUTOMATE (compliance automation pays for itself in audit time)
│       │           └── NO → document well, do not automate

Node Details

Check 1: Research before building

Command/method:

# Search your internal tool registry / wiki
curl -s "https://wiki.internal/search?q=automate+task-name" | jq '.results[].title'

# Search GitHub org for existing scripts
gh search repos --owner your-org "task-name automation" --limit 10

# Check community tools
# - HashiCorp Terraform / Ansible for infra tasks
# - Rundeck / Temporal for workflow automation
# - GitHub Actions / ArgoCD workflows for deploy tasks
# - Datadog / Prometheus rules for alert-driven automation
pip search task-keyword 2>/dev/null || brew search task-keyword

# Check if your cloud provider has a native feature
aws ssm list-documents --filters "Key=DocumentType,Values=Automation"
What you're looking for: Any internal script, approved SaaS tool, or cloud-native feature that covers ≥ 80% of the use case. Perfect coverage is not required. Common pitfall: Building a bespoke automation because the existing tool "doesn't do it exactly right" — when a 20-minute configuration change to the existing tool would have been sufficient.

Check 2: Task frequency measurement

Command/method:

# Search Slack history for how often you've done this
# Search ops log
grep "task-keyword" /var/log/ops/ops-log.txt | wc -l
grep "task-keyword" /var/log/ops/ops-log.txt | \
  awk '{print $1}' | sort | uniq -c | sort -rn | head -10

# Check ticket/issue history
gh issue list --label "manual-task" --state closed --limit 100 | \
  grep "task-name" | wc -l

# Calendar / PagerDuty incident history
pd incident list --statuses resolved --since "30 days ago" | grep "task-keyword"
What you're looking for: Accurate frequency over the last 3 months. Gut feelings about frequency are often wrong by 2–3x in either direction. Common pitfall: Counting occurrences over a busy period (post-incident cleanup, migration sprint) and treating it as steady-state frequency. Use a representative 90-day window.

Check 4: Process stability assessment

Command/method:

# Check git history of the runbook for how often it changed
git log --oneline --follow runbooks/task-name.md | head -20

# Count runbook revision frequency
git log --since="3 months ago" --oneline runbooks/task-name.md | wc -l
What you're looking for: 0–2 changes to the runbook in the last 3 months = stable enough to automate. > 5 changes = the process is still evolving and automation will be constantly broken and re-fixed. Common pitfall: Automating a process that is mid-migration to a new system. You'll automate the old system's steps, finish the migration 2 weeks later, and throw the automation away.

Check 5: Blast radius assessment

Command/method:

# Map what the task touches
# - Which services / resources does it modify?
kubectl get all -n production -l "managed-by=task-name"

# - Is the output reversible?
# Example: adding a Kubernetes label is reversible; deleting a namespace is not

# - What is the scope? (one namespace, one cluster, all clusters?)
kubectl config get-contexts | wc -l  # How many clusters would be affected?

# - Does it touch PII or financial data?
grep -r "pii\|personal\|payment\|credit" runbooks/task-name.md -i

# Test automation in staging first
kubectl config use-context staging
./automate-task.sh --dry-run
What you're looking for: Single-scope, reversible, non-PII = low blast radius. Multi-region, irreversible writes, PII or financial = high blast radius. Common pitfall: Declaring blast radius "low" without checking downstream dependencies. A "simple" database cleanup script may truncate a table that 3 other services read.

Check 6: Error risk in manual process

Command/method:

# Count how many steps are in the runbook
grep -c "^[0-9]\+\." runbooks/task-name.md

# Check if there are copy-paste-sensitive commands (long IDs, exact ordering)
grep -E "(arn:|account-id|cluster-id|secret)" runbooks/task-name.md | wc -l

# Search incident post-mortems for this task as a contributing factor
grep -r "human error\|manual.*error\|mis-executed" postmortems/ | grep "task-name"
What you're looking for: > 10 steps with order-sensitive commands and IDs = high error risk. Prior incidents caused by manual execution error = strong signal to automate. Common pitfall: Underestimating error risk because "I've never made a mistake doing it." Absence of recorded errors doesn't mean the error risk is low — it may mean errors happened but weren't tracked.

Check 7: Maintenance cost estimation

Command/method: Think through the full lifecycle cost, not just build time. What you're looking for: - Build time: estimate hours to write, test, review, and deploy the automation - Ongoing maintenance: does it have dependencies that will break (API changes, auth rotation, schema changes)? - On-call burden: what happens when the automation fails at 2am? - Documentation: time to write runbook for the automation itself

Common pitfall: Estimating build time only. A 4-hour script that breaks every month and requires 2 hours to debug each time costs more than the manual task over a year.


Terminal Actions

✅ Action: Automate Now

Do:

# 1. Write the automation with a mandatory --dry-run mode
cat > automate-task.sh << 'EOF'
#!/bin/bash
set -euo pipefail
DRY_RUN=${DRY_RUN:-false}

if [[ "$DRY_RUN" == "true" ]]; then
  echo "[DRY RUN] Would execute: $COMMAND"
else
  eval "$COMMAND"
fi
EOF

# 2. Test in non-prod environment first
DRY_RUN=true ./automate-task.sh
kubectl config use-context staging && ./automate-task.sh

# 3. Add to CI/CD or cron with alerting on failure
kubectl apply -f - <<CRONEOF
apiVersion: batch/v1
kind: CronJob
metadata:
  name: automate-task
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: task
            image: task-runner:latest
          restartPolicy: OnFailure
CRONEOF

# 4. Write a runbook for the automation itself (how to debug when it fails)
# 5. Add a monitoring alert for automation failure
Verify: Run in dry-run mode, then in staging, then with one production target before enabling full scope.

✅ Action: Automate with Dry-Run + Human Approval Gate

Do:

# 1. Automation generates a plan / diff, does not execute
./automate-task.sh --plan > /tmp/task-plan-$(date +%Y%m%d).txt

# 2. Human reviews the plan
cat /tmp/task-plan-$(date +%Y%m%d).txt

# 3. Human approves execution
read -p "Approve execution? (yes/no): " approval
if [[ "$approval" == "yes" ]]; then
  ./automate-task.sh --execute
fi

# 4. Or use a GitHub Actions manual approval gate
# (environment protection rules with required reviewers)
Verify: Approval gate is documented. Automation without approval is not possible (gate is enforced, not voluntary).

✅ Action: Automate Later — Add to Backlog

Do:

# Create a tracked issue with time/error cost documented
gh issue create --repo org/platform \
  --title "Automate: [task-name] (weekly, 20 min, error-prone)" \
  --label "toil-reduction,automation" \
  --body "Frequency: weekly. Time cost: 20 min/occurrence. Error incidents: 2 in last 6 months. Estimated build time: 4 hours. ROI positive after 3 months."

# Add to sprint backlog or quarterly OKR
Verify: Issue is created with enough context that someone who didn't write it can pick it up 3 months later.

✅ Action: Use Existing Tool

Do:

# 1. Install / configure the existing tool
# 2. Validate it covers your use case
# 3. Document the configuration in your team's runbook
# 4. Do NOT build a wrapper around it just to "make it fit" — adapt your process

# Example: using AWS SSM Run Command instead of SSH + manual script
aws ssm send-command \
  --document-name "AWS-RunShellScript" \
  --targets "Key=tag:Name,Values=prod-workers" \
  --parameters 'commands=["systemctl restart myservice"]'
Verify: Tool handles the task end-to-end. Document the invocation in the team runbook so others know it exists.

✅ Action: Partial Automation (High Blast Radius)

Do:

# Automate the data gathering and verification steps (low risk)
./automate-task.sh --gather-info > /tmp/task-context.txt
./automate-task.sh --validate --dry-run >> /tmp/task-context.txt

# Keep the execution step manual with the context pre-populated
cat /tmp/task-context.txt
echo "Review the above. Execute manually: $FINAL_COMMAND"
Verify: The error-prone, tedious parts are automated. The high-blast-radius execution step remains a conscious human action.

⚠️ Warning: Do Not Automate — Unstable Process

When: The runbook changes every sprint, the tool/API it calls is changing, or the task itself is being re-evaluated. Risk: Automation built on an unstable process becomes a maintenance burden that slows down process changes. You end up maintaining the automation instead of improving the process. Mitigation: Write a clean, up-to-date runbook instead. Revisit automation after the process has been stable for 3+ months.


Edge Cases

  • The task is "automate the on-call workflow": On-call workflows often require contextual judgment that defeats simple automation. Partial automation (auto-gather diagnostics, auto-remediate known patterns) is appropriate. Full automation of incident response is high-risk.
  • Automation requires elevated permissions: If the automation needs prod credentials or elevated IAM roles, the security posture of the automation host becomes critical. Scope permissions narrowly and use short-lived credentials.
  • You're the only person who can maintain it: Automation that only one person understands is a single point of failure. Include a bus-factor check: could a new team member debug this automation at 2am?
  • The task is infrequent but has a hard deadline: Annual access reviews, quarterly DR tests, and compliance tasks have calendar-enforced deadlines. Even if the ROI math doesn't pencil out, automation is justified to ensure the deadline isn't missed.

Cross-References