One Character from Disaster¶
Category: The Close Call Domains: ansible, infrastructure-as-code Read time: ~5 min
Setting the Scene¶
We had 1,400 servers across three data centers. Our Ansible inventory was organized into groups: staging, production-web, production-db, production-cache. Everything was managed through a single monorepo with strict CI gating. Or so we thought.
It was a Tuesday, and I was writing a playbook to reformat and re-partition disks on staging machines that were being repurposed from an old project. The playbook called wipefs --all and parted mklabel gpt on every disk except the boot volume.
What Happened¶
I wrote the playbook in about 30 minutes. The hosts line at the top was supposed to read hosts: staging. I typed hosts: all.
Two characters. The difference between "wipe 23 staging boxes that nobody cares about" and "wipe every disk on all 1,400 servers including the production databases holding 8TB of customer financial data."
I committed it, pushed to the feature branch, and opened a merge request. Our CI ran ansible-lint and yamllint — both passed. ansible-playbook --check passed too, because --check mode for command and shell modules just skips execution entirely. The syntax was perfect. The logic was catastrophic.
The playbook was scheduled to merge at 2 PM and run via our GitLab CI pipeline at 4 PM. At 11:47 AM, Priya from the platform team was doing her code review queue. She opened my MR, read the first line, and typed a single comment: "This says hosts: all. Did you mean staging?"
I stared at my screen for about ten seconds. Then I felt my stomach drop.
The Moment of Truth¶
Priya caught it 2 hours and 13 minutes before the pipeline would have executed. If she'd been in a meeting, if she'd skipped that review, if she'd only glanced at the task list without reading the header — we would have reformatted 1,400 servers. Recovery would have taken days. The business impact would have been in the millions. She caught it because she reads every line, even the boring ones.
The Aftermath¶
We added three controls. First, a custom ansible-lint rule that flags any playbook with hosts: all — it now requires an explicit override comment # ALL_HOSTS_INTENDED: <justification>. Second, every destructive playbook must use --limit in the CI job definition. Third, we added a pre-execution step that counts target hosts and pauses if the count exceeds a configurable threshold (default: 50). We also bought Priya lunch for a month.
The Lessons¶
- Code review for infrastructure is non-negotiable: Linters catch syntax. Humans catch intent.
hosts: allis valid YAML and valid Ansible — only a human reviewer would question whether it was the right choice. - Limit blast radius with host groups: Never use
hosts: allin production playbooks. Create explicit groups. Use--limitflags. Make the default scope narrow. - Dry-run mode must actually test destructive operations:
ansible-playbook --checkskipscommandandshellmodules entirely. It gave us false confidence. We now use--diffcombined with a custom callback plugin that logs what would execute.
What I'd Do Differently¶
I'd make the inventory group all an alias for nothing. Seriously. I'd restructure the inventory so that all is an empty group and every playbook must target a named group. It sounds extreme, but it removes the footgun entirely.
The Quote¶
"The scariest bugs aren't the ones that crash. They're the ones that work perfectly — on the wrong targets."
Cross-References¶
- Topic Packs: Ansible, Infrastructure as Code
- Case Studies: Blast Radius Control (if relevant)