Ansible Deep Dive - Footguns¶

Advanced mistakes that experienced operators still make. These go beyond the basics into the territory where real infrastructure gets misconfigured.

1. Variable Precedence Confusion (extra-vars Wins All)¶

You carefully set app_port: 8080 in group_vars/production.yml. A colleague's CI pipeline passes -e app_port=9090 and doesn't tell you. Every deploy now uses port 9090. You debug for hours looking at group_vars, role defaults, and host_vars before discovering the extra var.

Extra vars (-e) have the highest precedence in Ansible. They override everything, silently.

Fix: Avoid using -e for values that should be persistent. Use it only for one-time overrides. Document which variables are expected to be set via extra vars. In CI, log the extra vars being passed.

# Debug: see what value a host actually gets
ansible -i inventory/ web1.example.com -m debug -a "var=app_port"

2. Not Using `become` When Needed (Silent Permission Denied)¶

Your task installs a package but you forgot become: true. On some systems, apt fails loudly. On others, it silently creates files in the wrong place or partially succeeds. The task might even report "ok" while the package isn't actually installed.

# WRONG: runs as the SSH user, not root
- name: Install nginx
  ansible.builtin.apt:
    name: nginx
    state: present

# RIGHT
- name: Install nginx
  ansible.builtin.apt:
    name: nginx
    state: present
  become: true

Fix: Set become: true on individual tasks that need it. Don't set it at the play level (that causes a different footgun -- everything runs as root when it shouldn't).

3. Handler Not Notified (Task Didn't Change)¶

You update a config template but the rendered output is identical to what's already on disk. Ansible reports "ok" (not "changed"). The handler never fires. You expected a service restart but it didn't happen.

- name: Deploy config
  ansible.builtin.template:
    src: app.conf.j2
    dest: /etc/app/config.yml
  notify: Restart application  # Only fires if the task status is "changed"

Fix: If you need the handler to run regardless, use meta: flush_handlers or call the handler directly as a task. If you changed the template but it renders the same content, the handler correctly doesn't fire -- the config didn't actually change.

# Force handler to run
- name: Force restart
  ansible.builtin.meta: flush_handlers

# Or trigger it manually
- name: Restart application unconditionally
  ansible.builtin.systemd:
    name: myapp
    state: restarted
  changed_when: true

4. `include` vs `import` (Conditional Behavior Differs)¶

# import_tasks: condition applies to EACH task inside the file
- ansible.builtin.import_tasks: setup.yml
  when: needs_setup
  # If setup.yml has 5 tasks, each one gets "when: needs_setup"
  # A task inside that sets needs_setup=false won't affect later tasks
  # because the condition was evaluated at parse time

# include_tasks: condition applies to the INCLUDE decision only
- ansible.builtin.include_tasks: setup.yml
  when: needs_setup
  # If needs_setup is true, ALL 5 tasks run regardless
  # If needs_setup is false, NO tasks run

Fix: Use import_tasks when the file always exists and you want per-task conditions. Use include_tasks when the filename is dynamic or you want all-or-nothing inclusion. When in doubt, test with --check -vvv to see what gets evaluated.

5. Jinja2 Whitespace in Templates¶

{% for server in backend_servers %}
  server {{ server }}:{{ backend_port }};
{% endfor %}

This produces:

  server 10.0.1.10:8080;

  server 10.0.1.11:8080;

Extra blank lines between entries because the {% %} tags include their own newline. NGINX might not care, but YAML or Python configs will break.

Fix: Use whitespace control:

{%- for server in backend_servers %}
  server {{ server }}:{{ backend_port }};
{%- endfor %}

The - strips whitespace before ({%-) or after (-%}) the tag. Test template output with:

ansible localhost -m template -a "src=template.j2 dest=/dev/stdout"

6. Vault Password in Shell History¶

# This is in your bash history forever
ansible-vault encrypt_string 'actual_database_password' --name 'db_password'

Anyone with access to ~/.bash_history can read the password.

Fix: Use a password file or prompt:

# From a file
ansible-vault encrypt_string --vault-password-file ~/.vault_pass --stdin-name 'db_password' <<< 'actual_database_password'

# Or pipe it
echo -n 'actual_database_password' | ansible-vault encrypt_string --vault-password-file ~/.vault_pass --stdin-name 'db_password'

# Or use prompt (password never in history)
ansible-vault encrypt_string --vault-id prod@prompt --stdin-name 'db_password'

7. Not Quoting Jinja2 in YAML (`{{` Must Be Quoted)¶

# BROKEN: YAML parser interprets {{ as a mapping
vars:
  message: {{ greeting }} world

# CORRECT
vars:
  message: "{{ greeting }} world"

YAML treats { as the start of a mapping. Unquoted Jinja2 expressions cause parse errors that are cryptic:

ERROR! Syntax Error while loading YAML.
  mapping values are not allowed in this context

Fix: Always quote values that start with {{. This is the single most common Ansible YAML error.

# All of these are correct
simple: "{{ my_var }}"
combined: "prefix-{{ my_var }}-suffix"
bool_cast: "{{ my_var | bool }}"

8. `serial: 1` Performance¶

- hosts: webservers   # 100 hosts
  serial: 1           # One at a time

Each host goes through the full play sequentially. With 100 hosts and a 3-minute play, that's 5 hours. If you're doing rolling updates, serial: 1 is unnecessarily cautious.

Fix: Use graduated serial:

serial:
  - 1        # First: canary (catch obvious failures)
  - "10%"    # Second: 10% of remaining
  - "50%"    # Third: half of remaining
  - "100%"   # Fourth: the rest
max_fail_percentage: 10

9. No Retry on Transient SSH Failures¶

A network blip during a large-scale playbook run fails one task on one host. The entire play fails for that host. With serial, this means the rolling update stops.

Fix: Configure retries at the SSH level:

# ansible.cfg
[ssh_connection]
retries = 3

And at the task level for flaky operations:

- name: Download artifact
  ansible.builtin.get_url:
    url: "https://artifacts.example.com/app-{{ version }}.tar.gz"
    dest: /tmp/app.tar.gz
  retries: 3
  delay: 5
  register: download
  until: download is succeeded

10. `ansible_facts` Caching Stale¶

You enabled fact caching for performance:

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400

Someone added a disk to a server yesterday. Your cached facts still show the old disk layout. Your playbook that partitions disks uses stale data and fails or does the wrong thing.

Fix: Set an appropriate cache timeout. Use gather_facts: true (or ansible.builtin.setup task) explicitly when you need fresh facts for critical decisions. Clear the cache before important runs:

# Clear fact cache
rm -rf /tmp/ansible_facts/*

# Or force fresh facts in playbook
- hosts: all
  gather_facts: false
  tasks:
    - name: Gather fresh facts
      ansible.builtin.setup:

11. `delegate_to` Not Setting Host Context¶

- hosts: webservers
  tasks:
    - name: Add DNS record
      amazon.aws.route53:
        zone: example.com
        record: "{{ inventory_hostname }}"
        type: A
        value: "{{ ansible_default_ipv4.address }}"
      delegate_to: localhost

This runs on localhost but uses facts from the target host. That's usually what you want. But if you access hostvars[inventory_hostname] inside the delegated task, you get the target host's vars. If you access ansible_hostname, you get the target's hostname, not localhost's.

Fix: Be explicit about which host's variables you're using. When in doubt, print both:

- name: Debug delegation context
  ansible.builtin.debug:
    msg: |
      inventory_hostname: {{ inventory_hostname }}
      ansible_hostname: {{ ansible_hostname }}
      delegated to: {{ ansible_delegated_vars[inventory_hostname]['ansible_host'] | default('localhost') }}
  delegate_to: localhost

12. `raw` Module Needed Before Python Is Installed¶

Ansible modules require Python on the target host. On a fresh minimal OS install, Python may not be present. Standard modules fail:

FAILED! => {"msg": "ansible requires a python interpreter on the target host"}

Fix: Use the raw module for bootstrap, then install Python:

- hosts: new_servers
  gather_facts: false   # Can't gather facts without Python

  tasks:
    - name: Install Python (bootstrap)
      ansible.builtin.raw: |
        if command -v apt-get >/dev/null 2>&1; then
          apt-get update && apt-get install -y python3
        elif command -v dnf >/dev/null 2>&1; then
          dnf install -y python3
        fi
      changed_when: true

    - name: Now gather facts
      ansible.builtin.setup:

    - name: Continue with normal modules
      ansible.builtin.apt:
        name: nginx
        state: present
      when: ansible_os_family == "Debian"

13. Gathering Facts on 1000 Hosts (Slow Start)¶

Default behavior: Ansible connects to every host in the play, runs setup module, downloads all facts. With 1000 hosts and forks: 5, just gathering facts takes 10+ minutes before a single task runs.

# SLOW: default behavior
- hosts: all
  tasks:
    - name: Install monitoring agent
      ansible.builtin.package:
        name: node-exporter
        state: present

Fix: Disable fact gathering when you don't need facts:

- hosts: all
  gather_facts: false   # Skip the 10-minute wait
  tasks:
    - name: Install monitoring agent
      ansible.builtin.package:
        name: node-exporter
        state: present
      become: true

If you need some facts, gather only what you need:

- hosts: all
  gather_facts: false
  tasks:
    - name: Gather minimal facts
      ansible.builtin.setup:
        gather_subset:
          - '!all'
          - '!min'
          - distribution

Combine with increased forks and fact caching for large environments:

# ansible.cfg
[defaults]
forks = 50
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600

Ansible Deep Dive - Footguns¶

1. Variable Precedence Confusion (extra-vars Wins All)¶

2. Not Using become When Needed (Silent Permission Denied)¶

3. Handler Not Notified (Task Didn't Change)¶

4. include vs import (Conditional Behavior Differs)¶

5. Jinja2 Whitespace in Templates¶

6. Vault Password in Shell History¶

7. Not Quoting Jinja2 in YAML ({{ Must Be Quoted)¶

8. serial: 1 Performance¶

9. No Retry on Transient SSH Failures¶

10. ansible_facts Caching Stale¶

11. delegate_to Not Setting Host Context¶

12. raw Module Needed Before Python Is Installed¶