Skip to content

Anti-Primer: Ansible

Everything that can go wrong, will — and in this story, it does.

The Setup

An ops engineer is configuring 200 servers for a new data pipeline deployment. The playbooks were written months ago and tested against 5 dev servers. The deadline is tomorrow morning, and the engineer decides to skip the dry-run to save time.

The Timeline

Hour 0: Running Against All Hosts

Runs ansible-playbook site.yml without --limit, targeting all 200 servers including production databases. The deadline was looming, and this seemed like the fastest path forward. But the result is half-tested configuration changes hit every server in the fleet simultaneously.

Footgun #1: Running Against All Hosts — runs ansible-playbook site.yml without --limit, targeting all 200 servers including production databases, leading to half-tested configuration changes hit every server in the fleet simultaneously.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Shell Module Everywhere

Uses shell: apt-get install instead of the apt module for every package install. Under time pressure, the team chose speed over caution. But the result is playbook is not idempotent; re-runs take 40 minutes reinstalling everything.

Footgun #2: Shell Module Everywhere — uses shell: apt-get install instead of the apt module for every package install, leading to playbook is not idempotent; re-runs take 40 minutes reinstalling everything.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Global become: true

Sets become: true at the playbook level because one task needs root. Nobody pushed back because the shortcut looked harmless in the moment. But the result is all files are created as root; the application cannot read its own config.

Footgun #3: Global become: true — sets become: true at the playbook level because one task needs root, leading to all files are created as root; the application cannot read its own config.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Variable Precedence Trap

Sets app_port in defaults, group_vars, and extra vars without understanding precedence. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is spends 3 hours debugging why the port change in defaults has no effect.

Footgun #4: Variable Precedence Trap — sets app_port in defaults, group_vars, and extra vars without understanding precedence, leading to spends 3 hours debugging why the port change in defaults has no effect.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Running Against All Hosts Half-tested configuration changes hit every server in the fleet simultaneously Primer: Always use --limit or require a target variable
2 Shell Module Everywhere Playbook is not idempotent; re-runs take 40 minutes reinstalling everything Primer: Native modules for idempotency
3 Global become: true All files are created as root; the application cannot read its own config Primer: Per-task privilege escalation
4 Variable Precedence Trap Spends 3 hours debugging why the port change in defaults has no effect Primer: Single source per scope and ansible debug checks

Damage Report

  • Downtime: 2-6 hours of infrastructure instability or drift
  • Data loss: Risk of data loss if stateful resources are replaced
  • Customer impact: Dependent services may experience outages or degraded performance
  • Engineering time to remediate: 12-24 engineer-hours for state recovery and drift remediation
  • Reputation cost: Infrastructure team credibility damaged; manual intervention required for future changes

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on running against all hosts, they would have learned: Always use --limit or require a target variable.
  • Footgun #2: If the engineer had read the primer, section on shell module everywhere, they would have learned: Native modules for idempotency.
  • Footgun #3: If the engineer had read the primer, section on global become: true, they would have learned: Per-task privilege escalation.
  • Footgun #4: If the engineer had read the primer, section on variable precedence trap, they would have learned: Single source per scope and ansible debug checks.

Cross-References