Skip to content

Ansible Deep Dive - Footguns

Advanced mistakes that experienced operators still make. These go beyond the basics into the territory where real infrastructure gets misconfigured.


1. Variable Precedence Confusion (extra-vars Wins All)

You carefully set app_port: 8080 in group_vars/production.yml. A colleague's CI pipeline passes -e app_port=9090 and doesn't tell you. Every deploy now uses port 9090. You debug for hours looking at group_vars, role defaults, and host_vars before discovering the extra var.

Extra vars (-e) have the highest precedence in Ansible. They override everything, silently.

Fix: Avoid using -e for values that should be persistent. Use it only for one-time overrides. Document which variables are expected to be set via extra vars. In CI, log the extra vars being passed.

# Debug: see what value a host actually gets
ansible -i inventory/ web1.example.com -m debug -a "var=app_port"

2. Not Using become When Needed (Silent Permission Denied)

Your task installs a package but you forgot become: true. On some systems, apt fails loudly. On others, it silently creates files in the wrong place or partially succeeds. The task might even report "ok" while the package isn't actually installed.

# WRONG: runs as the SSH user, not root
- name: Install nginx
  ansible.builtin.apt:
    name: nginx
    state: present

# RIGHT
- name: Install nginx
  ansible.builtin.apt:
    name: nginx
    state: present
  become: true

Fix: Set become: true on individual tasks that need it. Don't set it at the play level (that causes a different footgun -- everything runs as root when it shouldn't).


3. Handler Not Notified (Task Didn't Change)

You update a config template but the rendered output is identical to what's already on disk. Ansible reports "ok" (not "changed"). The handler never fires. You expected a service restart but it didn't happen.

- name: Deploy config
  ansible.builtin.template:
    src: app.conf.j2
    dest: /etc/app/config.yml
  notify: Restart application  # Only fires if the task status is "changed"

Fix: If you need the handler to run regardless, use meta: flush_handlers or call the handler directly as a task. If you changed the template but it renders the same content, the handler correctly doesn't fire -- the config didn't actually change.

# Force handler to run
- name: Force restart
  ansible.builtin.meta: flush_handlers

# Or trigger it manually
- name: Restart application unconditionally
  ansible.builtin.systemd:
    name: myapp
    state: restarted
  changed_when: true

4. include vs import (Conditional Behavior Differs)

# import_tasks: condition applies to EACH task inside the file
- ansible.builtin.import_tasks: setup.yml
  when: needs_setup
  # If setup.yml has 5 tasks, each one gets "when: needs_setup"
  # A task inside that sets needs_setup=false won't affect later tasks
  # because the condition was evaluated at parse time

# include_tasks: condition applies to the INCLUDE decision only
- ansible.builtin.include_tasks: setup.yml
  when: needs_setup
  # If needs_setup is true, ALL 5 tasks run regardless
  # If needs_setup is false, NO tasks run

Fix: Use import_tasks when the file always exists and you want per-task conditions. Use include_tasks when the filename is dynamic or you want all-or-nothing inclusion. When in doubt, test with --check -vvv to see what gets evaluated.


5. Jinja2 Whitespace in Templates

{% for server in backend_servers %}
  server {{ server }}:{{ backend_port }};
{% endfor %}

This produces:

  server 10.0.1.10:8080;

  server 10.0.1.11:8080;

Extra blank lines between entries because the {% %} tags include their own newline. NGINX might not care, but YAML or Python configs will break.

Fix: Use whitespace control:

{%- for server in backend_servers %}
  server {{ server }}:{{ backend_port }};
{%- endfor %}

The - strips whitespace before ({%-) or after (-%}) the tag. Test template output with:

ansible localhost -m template -a "src=template.j2 dest=/dev/stdout"

6. Vault Password in Shell History

# This is in your bash history forever
ansible-vault encrypt_string 'actual_database_password' --name 'db_password'

Anyone with access to ~/.bash_history can read the password.

Fix: Use a password file or prompt:

# From a file
ansible-vault encrypt_string --vault-password-file ~/.vault_pass --stdin-name 'db_password' <<< 'actual_database_password'

# Or pipe it
echo -n 'actual_database_password' | ansible-vault encrypt_string --vault-password-file ~/.vault_pass --stdin-name 'db_password'

# Or use prompt (password never in history)
ansible-vault encrypt_string --vault-id prod@prompt --stdin-name 'db_password'

7. Not Quoting Jinja2 in YAML ({{ Must Be Quoted)

# BROKEN: YAML parser interprets {{ as a mapping
vars:
  message: {{ greeting }} world

# CORRECT
vars:
  message: "{{ greeting }} world"

YAML treats { as the start of a mapping. Unquoted Jinja2 expressions cause parse errors that are cryptic:

ERROR! Syntax Error while loading YAML.
  mapping values are not allowed in this context

Fix: Always quote values that start with {{. This is the single most common Ansible YAML error.

# All of these are correct
simple: "{{ my_var }}"
combined: "prefix-{{ my_var }}-suffix"
bool_cast: "{{ my_var | bool }}"

8. serial: 1 Performance

- hosts: webservers   # 100 hosts
  serial: 1           # One at a time

Each host goes through the full play sequentially. With 100 hosts and a 3-minute play, that's 5 hours. If you're doing rolling updates, serial: 1 is unnecessarily cautious.

Fix: Use graduated serial:

serial:
  - 1        # First: canary (catch obvious failures)
  - "10%"    # Second: 10% of remaining
  - "50%"    # Third: half of remaining
  - "100%"   # Fourth: the rest
max_fail_percentage: 10

9. No Retry on Transient SSH Failures

A network blip during a large-scale playbook run fails one task on one host. The entire play fails for that host. With serial, this means the rolling update stops.

Fix: Configure retries at the SSH level:

# ansible.cfg
[ssh_connection]
retries = 3

And at the task level for flaky operations:

- name: Download artifact
  ansible.builtin.get_url:
    url: "https://artifacts.example.com/app-{{ version }}.tar.gz"
    dest: /tmp/app.tar.gz
  retries: 3
  delay: 5
  register: download
  until: download is succeeded

10. ansible_facts Caching Stale

You enabled fact caching for performance:

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400

Someone added a disk to a server yesterday. Your cached facts still show the old disk layout. Your playbook that partitions disks uses stale data and fails or does the wrong thing.

Fix: Set an appropriate cache timeout. Use gather_facts: true (or ansible.builtin.setup task) explicitly when you need fresh facts for critical decisions. Clear the cache before important runs:

# Clear fact cache
rm -rf /tmp/ansible_facts/*

# Or force fresh facts in playbook
- hosts: all
  gather_facts: false
  tasks:
    - name: Gather fresh facts
      ansible.builtin.setup:

11. delegate_to Not Setting Host Context

- hosts: webservers
  tasks:
    - name: Add DNS record
      amazon.aws.route53:
        zone: example.com
        record: "{{ inventory_hostname }}"
        type: A
        value: "{{ ansible_default_ipv4.address }}"
      delegate_to: localhost

This runs on localhost but uses facts from the target host. That's usually what you want. But if you access hostvars[inventory_hostname] inside the delegated task, you get the target host's vars. If you access ansible_hostname, you get the target's hostname, not localhost's.

Fix: Be explicit about which host's variables you're using. When in doubt, print both:

- name: Debug delegation context
  ansible.builtin.debug:
    msg: |
      inventory_hostname: {{ inventory_hostname }}
      ansible_hostname: {{ ansible_hostname }}
      delegated to: {{ ansible_delegated_vars[inventory_hostname]['ansible_host'] | default('localhost') }}
  delegate_to: localhost

12. raw Module Needed Before Python Is Installed

Ansible modules require Python on the target host. On a fresh minimal OS install, Python may not be present. Standard modules fail:

FAILED! => {"msg": "ansible requires a python interpreter on the target host"}

Fix: Use the raw module for bootstrap, then install Python:

- hosts: new_servers
  gather_facts: false   # Can't gather facts without Python

  tasks:
    - name: Install Python (bootstrap)
      ansible.builtin.raw: |
        if command -v apt-get >/dev/null 2>&1; then
          apt-get update && apt-get install -y python3
        elif command -v dnf >/dev/null 2>&1; then
          dnf install -y python3
        fi
      changed_when: true

    - name: Now gather facts
      ansible.builtin.setup:

    - name: Continue with normal modules
      ansible.builtin.apt:
        name: nginx
        state: present
      when: ansible_os_family == "Debian"

13. Gathering Facts on 1000 Hosts (Slow Start)

Default behavior: Ansible connects to every host in the play, runs setup module, downloads all facts. With 1000 hosts and forks: 5, just gathering facts takes 10+ minutes before a single task runs.

# SLOW: default behavior
- hosts: all
  tasks:
    - name: Install monitoring agent
      ansible.builtin.package:
        name: node-exporter
        state: present

Fix: Disable fact gathering when you don't need facts:

- hosts: all
  gather_facts: false   # Skip the 10-minute wait
  tasks:
    - name: Install monitoring agent
      ansible.builtin.package:
        name: node-exporter
        state: present
      become: true

If you need some facts, gather only what you need:

- hosts: all
  gather_facts: false
  tasks:
    - name: Gather minimal facts
      ansible.builtin.setup:
        gather_subset:
          - '!all'
          - '!min'
          - distribution

Combine with increased forks and fact caching for large environments:

# ansible.cfg
[defaults]
forks = 50
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600