The Config Management Lie¶

Category: The Hard Lesson Domains: ansible, configuration-management Read time: ~5 min

Setting the Scene¶

We had 140 servers managed by Ansible. Our playbooks were clean, well-structured, roles separated by concern. The ansible-playbook site.yml run completed in 22 minutes and reported changed=0 on every host. Green across the board. Configuration as code, fully converged. At least, that's what we told ourselves.

What we didn't know was that over the previous 6 months, engineers had SSH'd into production servers 312 times to make "quick fixes" that never made it back into the playbooks.

What Happened¶

It started with a kernel upgrade. Security had flagged CVE-2023-xxxxx, and we needed to patch 140 servers. Standard procedure: update the kernel_version variable in group_vars/all.yml, run the playbook, reboot in rolling fashion. We'd done it a dozen times.

The playbook ran. On 118 servers, it worked perfectly. On 22 servers, it broke things in 22 different ways.

Server web-17 had a custom sysctl.conf that an engineer had tweaked for a traffic spike in March. Ansible overwrote it with the default. Connection tracking table filled up in minutes. Server started dropping packets.

Servers app-04 through app-08 had a manually installed version of libssl that was newer than what the playbook specified. Ansible downgraded it. The application's TLS handshake broke, and five app servers started refusing HTTPS connections.

Server db-replica-03 had a custom pg_hba.conf entry that allowed a specific reporting tool to connect. Ansible replaced it with the template. The reporting team's nightly ETL failed, and they didn't notice until the CEO asked why the Monday dashboard was empty.

The worst was queue-01, our RabbitMQ server. Someone had manually set vm_memory_high_watermark to 0.8 to handle a message backlog. The playbook reset it to 0.4. RabbitMQ hit the watermark, blocked all publishers, and 3 million messages backed up in the producer buffers.

We spent 14 hours putting out fires. Each server had its own unique snowflake configuration that some engineer had applied months ago for a good reason, never documented, and never committed to the playbooks.

The Moment of Truth¶

I ran ansible-playbook site.yml --check --diff after the dust settled and stared at the output. Hundreds of lines of diff showing things Ansible "wanted" to change. All the manual changes, laid bare. The changed=0 we'd been seeing in regular runs was a lie — we'd stopped running the full playbook months ago because "nothing changes." We'd only been running targeted plays for specific deployments.

The Aftermath¶

We introduced three changes. First, a weekly full-playbook run in --check --diff mode with the output posted to Slack. Any drift gets a ticket. Second, we removed SSH access for all engineers and required all changes go through Ansible, enforced by SELinux policies that prevented modification of managed files. Third, we added an ansible-lint rule that flags any task without a # Managed by Ansible comment in the template.

The cultural shift was harder. Engineers pushed back hard on losing SSH access. It took two months before people stopped trying to work around it.

The Lessons¶

Config drift is a silent killer: It doesn't announce itself. It accumulates one "quick fix" at a time until the gap between your declared state and actual state is a chasm.
Enforce immutability: If humans can SSH in and change things, they will. Remove the ability, not just the permission. Make the managed path easier than the manual path.
Regular drift detection: Run your configuration management in check mode on a schedule. Treat any drift as a bug, not a curiosity.

What I'd Do Differently¶

I'd set up a daily --check --diff cron job from week one, with the output feeding a dashboard that shows drift count per host. I'd implement file integrity monitoring (AIDE or Tripwire) on all Ansible-managed files, alerting on any out-of-band modification. And I'd make "update the playbook" part of the incident resolution checklist — if you changed it manually during an incident, the ticket isn't closed until the playbook matches.

The Config Management Lie¶

Setting the Scene¶

What Happened¶

The Moment of Truth¶

The Aftermath¶

The Lessons¶

What I'd Do Differently¶

The Quote¶

Cross-References¶

Pages that link here¶