Ansible Deep Dive - Footguns¶
Advanced mistakes that experienced operators still make. These go beyond the basics into the territory where real infrastructure gets misconfigured.
1. Variable Precedence Confusion (extra-vars Wins All)¶
You carefully set app_port: 8080 in group_vars/production.yml. A colleague's CI pipeline passes -e app_port=9090 and doesn't tell you. Every deploy now uses port 9090. You debug for hours looking at group_vars, role defaults, and host_vars before discovering the extra var.
Extra vars (-e) have the highest precedence in Ansible. They override everything, silently.
Fix: Avoid using -e for values that should be persistent. Use it only for one-time overrides. Document which variables are expected to be set via extra vars. In CI, log the extra vars being passed.
# Debug: see what value a host actually gets
ansible -i inventory/ web1.example.com -m debug -a "var=app_port"
2. Not Using become When Needed (Silent Permission Denied)¶
Your task installs a package but you forgot become: true. On some systems, apt fails loudly. On others, it silently creates files in the wrong place or partially succeeds. The task might even report "ok" while the package isn't actually installed.
# WRONG: runs as the SSH user, not root
- name: Install nginx
ansible.builtin.apt:
name: nginx
state: present
# RIGHT
- name: Install nginx
ansible.builtin.apt:
name: nginx
state: present
become: true
Fix: Set become: true on individual tasks that need it. Don't set it at the play level (that causes a different footgun -- everything runs as root when it shouldn't).
3. Handler Not Notified (Task Didn't Change)¶
You update a config template but the rendered output is identical to what's already on disk. Ansible reports "ok" (not "changed"). The handler never fires. You expected a service restart but it didn't happen.
- name: Deploy config
ansible.builtin.template:
src: app.conf.j2
dest: /etc/app/config.yml
notify: Restart application # Only fires if the task status is "changed"
Fix: If you need the handler to run regardless, use meta: flush_handlers or call the handler directly as a task. If you changed the template but it renders the same content, the handler correctly doesn't fire -- the config didn't actually change.
# Force handler to run
- name: Force restart
ansible.builtin.meta: flush_handlers
# Or trigger it manually
- name: Restart application unconditionally
ansible.builtin.systemd:
name: myapp
state: restarted
changed_when: true
4. include vs import (Conditional Behavior Differs)¶
# import_tasks: condition applies to EACH task inside the file
- ansible.builtin.import_tasks: setup.yml
when: needs_setup
# If setup.yml has 5 tasks, each one gets "when: needs_setup"
# A task inside that sets needs_setup=false won't affect later tasks
# because the condition was evaluated at parse time
# include_tasks: condition applies to the INCLUDE decision only
- ansible.builtin.include_tasks: setup.yml
when: needs_setup
# If needs_setup is true, ALL 5 tasks run regardless
# If needs_setup is false, NO tasks run
Fix: Use import_tasks when the file always exists and you want per-task conditions. Use include_tasks when the filename is dynamic or you want all-or-nothing inclusion. When in doubt, test with --check -vvv to see what gets evaluated.
5. Jinja2 Whitespace in Templates¶
This produces:
Extra blank lines between entries because the {% %} tags include their own newline. NGINX might not care, but YAML or Python configs will break.
Fix: Use whitespace control:
The - strips whitespace before ({%-) or after (-%}) the tag. Test template output with:
6. Vault Password in Shell History¶
# This is in your bash history forever
ansible-vault encrypt_string 'actual_database_password' --name 'db_password'
Anyone with access to ~/.bash_history can read the password.
Fix: Use a password file or prompt:
# From a file
ansible-vault encrypt_string --vault-password-file ~/.vault_pass --stdin-name 'db_password' <<< 'actual_database_password'
# Or pipe it
echo -n 'actual_database_password' | ansible-vault encrypt_string --vault-password-file ~/.vault_pass --stdin-name 'db_password'
# Or use prompt (password never in history)
ansible-vault encrypt_string --vault-id prod@prompt --stdin-name 'db_password'
7. Not Quoting Jinja2 in YAML ({{ Must Be Quoted)¶
# BROKEN: YAML parser interprets {{ as a mapping
vars:
message: {{ greeting }} world
# CORRECT
vars:
message: "{{ greeting }} world"
YAML treats { as the start of a mapping. Unquoted Jinja2 expressions cause parse errors that are cryptic:
Fix: Always quote values that start with {{. This is the single most common Ansible YAML error.
# All of these are correct
simple: "{{ my_var }}"
combined: "prefix-{{ my_var }}-suffix"
bool_cast: "{{ my_var | bool }}"
8. serial: 1 Performance¶
Each host goes through the full play sequentially. With 100 hosts and a 3-minute play, that's 5 hours. If you're doing rolling updates, serial: 1 is unnecessarily cautious.
Fix: Use graduated serial:
serial:
- 1 # First: canary (catch obvious failures)
- "10%" # Second: 10% of remaining
- "50%" # Third: half of remaining
- "100%" # Fourth: the rest
max_fail_percentage: 10
9. No Retry on Transient SSH Failures¶
A network blip during a large-scale playbook run fails one task on one host. The entire play fails for that host. With serial, this means the rolling update stops.
Fix: Configure retries at the SSH level:
And at the task level for flaky operations:
- name: Download artifact
ansible.builtin.get_url:
url: "https://artifacts.example.com/app-{{ version }}.tar.gz"
dest: /tmp/app.tar.gz
retries: 3
delay: 5
register: download
until: download is succeeded
10. ansible_facts Caching Stale¶
You enabled fact caching for performance:
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400
Someone added a disk to a server yesterday. Your cached facts still show the old disk layout. Your playbook that partitions disks uses stale data and fails or does the wrong thing.
Fix: Set an appropriate cache timeout. Use gather_facts: true (or ansible.builtin.setup task) explicitly when you need fresh facts for critical decisions. Clear the cache before important runs:
# Clear fact cache
rm -rf /tmp/ansible_facts/*
# Or force fresh facts in playbook
- hosts: all
gather_facts: false
tasks:
- name: Gather fresh facts
ansible.builtin.setup:
11. delegate_to Not Setting Host Context¶
- hosts: webservers
tasks:
- name: Add DNS record
amazon.aws.route53:
zone: example.com
record: "{{ inventory_hostname }}"
type: A
value: "{{ ansible_default_ipv4.address }}"
delegate_to: localhost
This runs on localhost but uses facts from the target host. That's usually what you want. But if you access hostvars[inventory_hostname] inside the delegated task, you get the target host's vars. If you access ansible_hostname, you get the target's hostname, not localhost's.
Fix: Be explicit about which host's variables you're using. When in doubt, print both:
- name: Debug delegation context
ansible.builtin.debug:
msg: |
inventory_hostname: {{ inventory_hostname }}
ansible_hostname: {{ ansible_hostname }}
delegated to: {{ ansible_delegated_vars[inventory_hostname]['ansible_host'] | default('localhost') }}
delegate_to: localhost
12. raw Module Needed Before Python Is Installed¶
Ansible modules require Python on the target host. On a fresh minimal OS install, Python may not be present. Standard modules fail:
Fix: Use the raw module for bootstrap, then install Python:
- hosts: new_servers
gather_facts: false # Can't gather facts without Python
tasks:
- name: Install Python (bootstrap)
ansible.builtin.raw: |
if command -v apt-get >/dev/null 2>&1; then
apt-get update && apt-get install -y python3
elif command -v dnf >/dev/null 2>&1; then
dnf install -y python3
fi
changed_when: true
- name: Now gather facts
ansible.builtin.setup:
- name: Continue with normal modules
ansible.builtin.apt:
name: nginx
state: present
when: ansible_os_family == "Debian"
13. Gathering Facts on 1000 Hosts (Slow Start)¶
Default behavior: Ansible connects to every host in the play, runs setup module, downloads all facts. With 1000 hosts and forks: 5, just gathering facts takes 10+ minutes before a single task runs.
# SLOW: default behavior
- hosts: all
tasks:
- name: Install monitoring agent
ansible.builtin.package:
name: node-exporter
state: present
Fix: Disable fact gathering when you don't need facts:
- hosts: all
gather_facts: false # Skip the 10-minute wait
tasks:
- name: Install monitoring agent
ansible.builtin.package:
name: node-exporter
state: present
become: true
If you need some facts, gather only what you need:
- hosts: all
gather_facts: false
tasks:
- name: Gather minimal facts
ansible.builtin.setup:
gather_subset:
- '!all'
- '!min'
- distribution
Combine with increased forks and fact caching for large environments:
# ansible.cfg
[defaults]
forks = 50
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600