Ansible for Infrastructure Automation - Street Ops¶
What experienced Ansible operators know that the documentation doesn't emphasize enough.
One-liner: Ansible is SSH in a trench coat. If you can SSH to a host and run commands, Ansible can manage it. If SSH is broken, Ansible is broken.
Incident Runbooks¶
Playbook Fails Mid-Run¶
1. Read the error output carefully:
- TASK name tells you exactly which task failed
- "msg" field gives the specific error
- "stdout" / "stderr" fields show command output
2. Common failure patterns:
"Unreachable" - connection failed:
- SSH connectivity: can you ssh to the host manually?
- Wrong user: check ansible_user in inventory
- Wrong key: check ansible_ssh_private_key_file
- SSH host key changed: ssh-keygen -R <host>
- Timeout: check ansible_ssh_timeout, increase for slow networks
"Permission denied" - sudo/privilege issue:
- become: yes missing from the play or task
- sudo password required: use --ask-become-pass or ansible_become_password
- User not in sudoers on the target host
"Module failure" - task-specific error:
- Package not found: wrong package name for the OS
- Service not found: service name differs between distros
- File not found: template or source file path is wrong
- Python dependency missing on target
3. Retry from the failed task:
ansible-playbook site.yml --start-at-task="Install nginx"
# Or use the retry file:
ansible-playbook site.yml --limit @site.retry
Connection Issues at Scale¶
1. SSH connection limits:
- Running against 200 hosts, connections start failing
- Default forks = 5 (only 5 hosts in parallel)
- Increase: ansible-playbook site.yml -f 50
- But don't go too high: SSH connections have overhead
2. SSH multiplexing (ControlPersist):
# ansible.cfg
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
pipelining = True
# Pipelining reduces SSH operations per task
3. Mitogen (performance plugin):
- Replaces SSH with a more efficient connection
- 2-7x faster for many workloads
- Drop-in replacement: just change strategy plugin
> **Scale note:** ControlPersist + pipelining is the single biggest performance win for Ansible over SSH. Without pipelining, each task requires multiple SSH round trips (copy module, execute, fetch result). With pipelining, everything happens in one SSH session. Enable both in ansible.cfg for any fleet over 20 hosts.
4. SSH timeout cascade:
- One host is unreachable, holding up the whole play
- Set timeout: ansible_ssh_timeout=10 in inventory
- Use async for long-running tasks:
- name: Run migration
command: /opt/migrate.sh
async: 3600 # Max runtime 1 hour
poll: 30 # Check every 30 seconds
Fact Caching¶
1. Problem: gathering facts on 500 hosts takes 5+ minutes every run
2. Enable fact caching:
# ansible.cfg
[defaults]
gathering = smart # Only gather if cache is stale
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400 # 24 hours
# Or use Redis for shared caching across CI runners:
fact_caching = redis
fact_caching_connection = redis://localhost:6379/0
3. Disable fact gathering for tasks that don't need it:
- hosts: webservers
gather_facts: no
tasks:
- name: Restart nginx
service:
name: nginx
state: restarted
Role Dependency Hell¶
1. Problem: roles have conflicting dependencies or circular imports
2. Check dependencies:
cat roles/webserver/meta/main.yml
# dependencies:
# - common
# - { role: firewall, firewall_ports: [80, 443] }
3. Dependency runs multiple times:
- By default, a role runs only once even if listed as dependency by multiple roles
- Override: allow_duplicates: true in meta/main.yml
- Usually you want the default behavior
4. Dependency version conflicts:
- Role A needs package-x >= 2.0
- Role B needs package-x < 2.0
- No built-in resolution: you must align versions manually
- Use requirements.yml with pinned versions:
roles:
- name: geerlingguy.nginx
version: "3.1.0"
5. Install roles:
ansible-galaxy install -r requirements.yml
ansible-galaxy install -r requirements.yml --force # Update
Idempotency Gotchas¶
1. Shell/command modules are NOT idempotent:
# BAD: runs every time
- name: Create database
command: createdb myapp
# GOOD: only runs if database doesn't exist
- name: Create database
command: createdb myapp
args:
creates: /var/lib/postgresql/myapp
# BETTER: use the proper module
- name: Create database
postgresql_db:
name: myapp
state: present
2. lineinfile with changing values:
# BAD: adds a new line every run if timestamp changes
- lineinfile:
path: /etc/config
line: "updated={{ ansible_date_time.iso8601 }}"
# GOOD: use regexp to find and replace
- lineinfile:
path: /etc/config
regexp: "^updated="
line: "updated={{ ansible_date_time.iso8601 }}"
3. Testing idempotency:
# Run twice, second run should show 0 changed:
ansible-playbook site.yml
ansible-playbook site.yml
# If anything shows "changed" on the second run, you have an idempotency bug
Gotchas & War Stories¶
The template that destroyed prod configs
Gotcha: The
validateparameter ontemplateandcopymodules runs validation before the file is written to the final destination. If validation fails, the original file is untouched. Always usevalidatefor any config file that has a syntax checker: nginx (nginx -t -c %s), Apache (apachectl configtest), sshd (sshd -t -f %s), sudoers (visudo -cf %s).
Someone used template with the wrong dest path and overwrote /etc/nginx/nginx.conf across all servers with an incomplete template. The notify: restart nginx handler then restarted nginx with a broken config, taking down the entire web tier.
Prevention:
# Always validate before restarting
- name: Copy nginx config
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
validate: nginx -t -c %s # Validates BEFORE writing
notify: Restart nginx
Variable scope confusion
A variable set in one play doesn't automatically carry to another play in the same playbook. Use set_fact to make variables persist across tasks within a play, or use host_vars/group_vars for cross-play persistence. Variables set with -e are global and override everything.
The gather_facts timeout
On a large fleet, one slow host causes the entire play to wait during fact gathering. Use gather_timeout in ansible.cfg and consider gather_facts: no with explicit fact gathering for specific hosts.
become vs become_user
Remember: The privilege escalation chain is: SSH user -> become -> become_user. Mnemonic: you connect as one user, become another.
become: yeswithoutbecome_userdefaults to root.become_user: deployrequires the SSH user to have sudo access to that specific user.
become: yes escalates to root. become_user: deploy escalates to the deploy user. But the initial SSH connection still uses your SSH user. Common confusion: you SSH as ansible_user, then become to another user. If the target user doesn't exist or sudo isn't configured, you get cryptic errors.
Serial deployment for zero downtime
By default, Ansible runs tasks on all hosts simultaneously. For a deployment behind a load balancer, use serial:
- hosts: webservers
serial: "25%" # Deploy to 25% of hosts at a time
# or serial: 1 # One at a time
tasks:
- name: Deploy application
# ...
Essential Debugging Commands¶
# Verbose output (add more v's for more detail)
ansible-playbook site.yml -v # Task results
ansible-playbook site.yml -vv # Task input parameters
ansible-playbook site.yml -vvv # SSH connection details
ansible-playbook site.yml -vvvv # Everything including SSH protocol
# Check mode (dry run)
ansible-playbook site.yml --check --diff
# List what would be affected
ansible-playbook site.yml --list-hosts
ansible-playbook site.yml --list-tasks
ansible-playbook site.yml --list-tags
# Syntax check
ansible-playbook site.yml --syntax-check
# Step through task by task
ansible-playbook site.yml --step
# Debug a variable
ansible -m debug -a "var=hostvars[inventory_hostname]" web1.example.com
# Test connectivity
ansible all -m ping
# Run ad-hoc commands
ansible webservers -m shell -a "df -h" -b
ansible webservers -m service -a "name=nginx state=restarted" -b
Quick Reference¶
- Cheatsheet: Ansible