Skip to content

Ansible Deep Dive - Street Ops

What experienced Ansible operators deal with daily. Debugging, testing, performance, and production patterns.

Debugging Failed Tasks

Increasing Verbosity

# Standard verbosity levels
ansible-playbook site.yml -v      # Show task results
ansible-playbook site.yml -vv     # Show task input parameters
ansible-playbook site.yml -vvv    # Show SSH connection details
ansible-playbook site.yml -vvvv   # Show connection plugin internals

# Start at a specific task (skip everything before it)
ansible-playbook site.yml --start-at-task="Deploy application"

# Step through tasks one at a time (interactive confirmation)
ansible-playbook site.yml --step

# List all tasks without executing
ansible-playbook site.yml --list-tasks

# List all hosts that would be targeted
ansible-playbook site.yml --list-hosts

Register and Debug

tasks:
  - name: Check disk space
    ansible.builtin.command: df -h /
    register: disk_result
    changed_when: false

  - name: Show full result object
    ansible.builtin.debug:
      var: disk_result

  - name: Show just stdout
    ansible.builtin.debug:
      msg: "Disk usage: {{ disk_result.stdout_lines[-1] }}"

  - name: Show return code
    ansible.builtin.debug:
      msg: "Return code: {{ disk_result.rc }}"

The register variable contains: stdout, stdout_lines, stderr, stderr_lines, rc (return code), changed, failed, skipped, and module-specific keys.

Common Debug Patterns

tasks:
  # Print all facts for a host
  - name: Dump all facts
    ansible.builtin.debug:
      var: ansible_facts

  # Print a specific fact
  - name: Show OS family
    ansible.builtin.debug:
      var: ansible_os_family

  # Print variable with its type
  - name: Inspect variable
    ansible.builtin.debug:
      msg: "Value: {{ my_var }} (type: {{ my_var | type_debug }})"

  # Conditional debug
  - name: Alert on low memory
    ansible.builtin.debug:
      msg: "WARNING: Only {{ ansible_memfree_mb }}MB free on {{ inventory_hostname }}"
    when: ansible_memfree_mb < 512

Testing Playbooks

ansible-lint

# Install
pip install ansible-lint

# Run on a playbook
ansible-lint site.yml

# Run on all YAML in a directory
ansible-lint roles/

# Skip specific rules
ansible-lint -x command-instead-of-module,no-changed-when site.yml

# Show all available rules
ansible-lint -L

Common lint rules that catch real problems: - command-instead-of-module: Using shell: apt install instead of apt module - no-changed-when: Command/shell tasks without changed_when - risky-file-permissions: Files created without explicit mode - yaml[truthy]: Using yes/no instead of true/false - name[missing]: Tasks without names

Molecule (Role Testing Framework)

# Install molecule with Docker driver
pip install 'molecule[docker]'

# Initialize a new role with molecule tests
molecule init role my_role

# Or add molecule to an existing role
cd roles/nginx
molecule init scenario
# molecule/default/molecule.yml
---
driver:
  name: docker
platforms:
  - name: ubuntu-noble
    image: ubuntu:noble
    pre_build_image: true
    command: /lib/systemd/systemd
    privileged: true
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
  - name: rocky-9
    image: rockylinux:9
    pre_build_image: true
    command: /lib/systemd/systemd
    privileged: true
provisioner:
  name: ansible
  playbooks:
    converge: converge.yml
    verify: verify.yml
verifier:
  name: ansible
# molecule/default/converge.yml
---
- name: Converge
  hosts: all
  roles:
    - role: nginx
# molecule/default/verify.yml
---
- name: Verify
  hosts: all
  tasks:
    - name: Check nginx is installed
      ansible.builtin.package_facts:
        manager: auto

    - name: Assert nginx is installed
      ansible.builtin.assert:
        that: "'nginx' in ansible_facts.packages"

    - name: Check nginx is running
      ansible.builtin.service_facts:

    - name: Assert nginx is running
      ansible.builtin.assert:
        that: "ansible_facts.services['nginx.service'].state == 'running'"

    - name: Check nginx responds
      ansible.builtin.uri:
        url: http://localhost:80
      register: result

    - name: Assert nginx responds with 200
      ansible.builtin.assert:
        that: result.status == 200
# Run the full test sequence
molecule test

# Sequence: dependency → lint → cleanup → destroy → syntax → create →
#           prepare → converge → idempotence → verify → cleanup → destroy

# Just converge (create + run playbook, leave containers up)
molecule converge

# Run verification
molecule verify

# SSH into the test container
molecule login -h ubuntu-noble

# Destroy test containers
molecule destroy

Check Mode Testing

# Dry run -- shows what would change without changing anything
ansible-playbook site.yml --check --diff

# Useful in CI: run check mode and fail if anything would change
# (validates idempotency -- second run should show no changes)
ansible-playbook site.yml --check --diff 2>&1 | tee check_output.txt
if grep -q "changed=" check_output.txt; then
  echo "NOT IDEMPOTENT: changes detected on second run"
  exit 1
fi

Rolling Updates

Serial Execution

- name: Rolling update web servers
  hosts: webservers
  serial: 2                    # Update 2 hosts at a time
  max_fail_percentage: 25      # Abort if >25% fail

  pre_tasks:
    - name: Remove from load balancer
      ansible.builtin.command: /usr/local/bin/lb-remove {{ inventory_hostname }}
      delegate_to: loadbalancer.example.com

  tasks:
    - name: Stop application
      ansible.builtin.systemd:
        name: myapp
        state: stopped

    - name: Deploy new version
      ansible.builtin.copy:
        src: "app-{{ app_version }}.jar"
        dest: /opt/app/app.jar
      notify: Restart application

    - name: Start application
      ansible.builtin.systemd:
        name: myapp
        state: started

    - name: Wait for application to be ready
      ansible.builtin.uri:
        url: http://localhost:8080/health
      register: health
      retries: 10
      delay: 5
      until: health.status == 200

  post_tasks:
    - name: Add back to load balancer
      ansible.builtin.command: /usr/local/bin/lb-add {{ inventory_hostname }}
      delegate_to: loadbalancer.example.com

  handlers:
    - name: Restart application
      ansible.builtin.systemd:
        name: myapp
        state: restarted

Serial can be a number, percentage, or list:

serial:
  - 1         # First batch: 1 host (canary)
  - 5         # Second batch: 5 hosts
  - "25%"     # Remaining: 25% at a time

Secret Management with Vault

Operational Patterns

# Use a vault password file (no interactive prompt)
echo "mypassword" > ~/.vault_pass.txt
chmod 600 ~/.vault_pass.txt
export ANSIBLE_VAULT_PASSWORD_FILE=~/.vault_pass.txt

# Or set in ansible.cfg
# [defaults]
# vault_password_file = ~/.vault_pass.txt

# Encrypt specific variables in a file (not the whole file)
ansible-vault encrypt_string 'db_p@ssw0rd' --name 'db_password' >> vars/secrets.yml

# View encrypted variable values
ansible localhost -m debug -a "var=db_password" -e @vars/secrets.yml --vault-password-file ~/.vault_pass.txt

Vault in CI/CD

# GitHub Actions example
- name: Run Ansible playbook
  env:
    ANSIBLE_VAULT_PASSWORD: ${{ secrets.ANSIBLE_VAULT_PASSWORD }}
  run: |
    echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/.vault_pass
    ansible-playbook -i inventory/production site.yml \
      --vault-password-file /tmp/.vault_pass
    rm -f /tmp/.vault_pass

Dynamic Inventory for Cloud

AWS EC2 Pattern

# inventory/aws_ec2.yml
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
keyed_groups:
  - key: tags.Role
    prefix: role
  - key: tags.Environment
    prefix: env
filters:
  tag:ManagedBy: ansible
  instance-state-name: running
hostnames:
  - private-ip-address
compose:
  ansible_host: private_ip_address
  ansible_user: "'ubuntu'"
# Test the inventory
ansible-inventory -i inventory/aws_ec2.yml --graph

# Output:
# @all:
#   |--@env_production:
#   |  |--10.0.1.10
#   |  |--10.0.1.11
#   |--@role_webserver:
#   |  |--10.0.1.10
#   |--@role_database:
#   |  |--10.0.1.11

# Use in playbook
ansible-playbook -i inventory/aws_ec2.yml site.yml --limit role_webserver

GCP Compute Pattern

# inventory/gcp_compute.yml
plugin: google.cloud.gcp_compute
projects:
  - my-project-123456
auth_kind: serviceaccount
service_account_file: /path/to/sa.json
keyed_groups:
  - key: labels.role
    prefix: role
  - key: zone
    prefix: zone
filters:
  - labels.managed_by = ansible
  - status = RUNNING
compose:
  ansible_host: networkInterfaces[0].accessConfigs[0].natIP | default(networkInterfaces[0].networkIP)

Handling Different OS Families

tasks:
  - name: Include OS-specific variables
    ansible.builtin.include_vars: "{{ ansible_os_family | lower }}.yml"

  - name: Install packages (Debian/Ubuntu)
    ansible.builtin.apt:
      name: "{{ packages }}"
      state: present
      update_cache: true
      cache_valid_time: 3600
    when: ansible_os_family == "Debian"

  - name: Install packages (RHEL/Rocky)
    ansible.builtin.dnf:
      name: "{{ packages }}"
      state: present
    when: ansible_os_family == "RedHat"

  # Or use the generic package module (less control but cross-platform)
  - name: Install packages (any OS)
    ansible.builtin.package:
      name: "{{ common_packages }}"
      state: present
# vars/debian.yml
packages:
  - nginx
  - python3-pip
  - libpq-dev
service_name: nginx

# vars/redhat.yml
packages:
  - nginx
  - python3-pip
  - libpq-devel
service_name: nginx

Idempotency Checking

tasks:
  # BAD: always shows "changed"
  - name: Set hostname
    ansible.builtin.command: hostnamectl set-hostname {{ inventory_hostname }}

  # GOOD: only "changed" when hostname actually changes
  - name: Set hostname
    ansible.builtin.hostname:
      name: "{{ inventory_hostname }}"

  # When you must use command, add changed_when
  - name: Check current hostname
    ansible.builtin.command: hostname
    register: current_hostname
    changed_when: false

  - name: Set hostname if different
    ansible.builtin.command: hostnamectl set-hostname {{ inventory_hostname }}
    when: current_hostname.stdout != inventory_hostname

Performance Optimization

SSH Pipelining

Reduces the number of SSH connections per task from 2 to 1:

# ansible.cfg
[ssh_connection]
pipelining = True

Requires requiretty to be disabled in sudoers on target hosts. Most modern distros don't have this issue.

Forks

Control parallelism:

# ansible.cfg
[defaults]
forks = 30   # Default is 5, increase for large inventories
# Override at runtime
ansible-playbook site.yml --forks 50

Fact Caching

Gathering facts on 500 hosts is slow. Cache them:

# ansible.cfg
[defaults]
gathering = smart                       # Only gather if not cached
fact_caching = jsonfile                 # Or redis, memcached
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400            # 24 hours

Selective Fact Gathering

# Disable facts entirely if you don't need them
- hosts: webservers
  gather_facts: false
  tasks:
    - name: Deploy config
      ansible.builtin.copy:
        src: app.conf
        dest: /etc/app/config

# Or gather only specific facts
- hosts: webservers
  gather_facts: false
  tasks:
    - name: Gather only network facts
      ansible.builtin.setup:
        gather_subset:
          - network
          - hardware

Mitogen (Third-Party Accelerator)

Mitogen replaces Ansible's SSH-based execution with a more efficient RPC mechanism. Can be 2-7x faster.

# ansible.cfg
[defaults]
strategy_plugins = /path/to/mitogen/ansible_mitogen/plugins/strategy
strategy = mitogen_linear

Caveat: Mitogen is a third-party project. Test thoroughly before using in production. It may not support all connection plugins or become methods.

Async for Long Tasks

tasks:
  # Run package updates in parallel across all hosts
  - name: Update all packages
    ansible.builtin.apt:
      upgrade: dist
    async: 1800     # Allow up to 30 minutes
    poll: 0         # Don't wait
    register: update_job

  # Continue with other tasks...

  - name: Wait for updates to complete
    ansible.builtin.async_status:
      jid: "{{ update_job.ansible_job_id }}"
    register: update_result
    until: update_result.finished
    retries: 60
    delay: 30

Managing Large Inventories

# Test connectivity to all hosts
ansible all -i inventory/ -m ping --forks 50

# Run ad-hoc commands
ansible webservers -i inventory/ -m command -a "uptime" --forks 30

# Gather facts and cache them
ansible all -i inventory/ -m setup --forks 50

# Check which hosts match a pattern
ansible-inventory -i inventory/ --list --limit 'webservers:&production'

# Graph the inventory structure
ansible-inventory -i inventory/ --graph

Inventory Patterns

# All hosts in webservers AND production
ansible 'webservers:&production' -m ping

# All hosts in webservers OR dbservers
ansible 'webservers:dbservers' -m ping

# All hosts in webservers but NOT in maintenance
ansible 'webservers:!maintenance' -m ping

# Regex match
ansible '~web[0-9]+\.example\.com' -m ping

ansible-navigator vs ansible-playbook

ansible-navigator is the modern replacement for ansible-playbook:

# Install
pip install ansible-navigator

# Run a playbook (same as ansible-playbook but with a TUI)
ansible-navigator run site.yml -i inventory/

# Run in stdout mode (like ansible-playbook)
ansible-navigator run site.yml -i inventory/ --mode stdout

# Explore inventory
ansible-navigator inventory -i inventory/ --mode interactive

# View documentation
ansible-navigator doc ansible.builtin.copy

# Replay a previous run
ansible-navigator replay /path/to/artifact.json

ansible-navigator uses execution environments (container images with Ansible and dependencies). This ensures consistent execution regardless of the control node's installed packages.

# ansible-navigator.yml
---
ansible-navigator:
  execution-environment:
    image: quay.io/ansible/creator-ee:latest
    pull:
      policy: missing
  mode: stdout
  playbook-artifact:
    enable: true
    save-as: artifacts/{playbook_name}-{ts_utc}.json

Ad-Hoc Commands

Quick tasks without writing a playbook:

# Check connectivity
ansible all -m ping

# Run a command
ansible webservers -m command -a "df -h /"

# Copy a file
ansible webservers -m copy -a "src=hotfix.conf dest=/etc/app/config.conf backup=yes" --become

# Install a package
ansible webservers -m apt -a "name=htop state=present" --become

# Restart a service
ansible webservers -m systemd -a "name=nginx state=restarted" --become

# Gather facts
ansible web1.example.com -m setup -a "filter=ansible_distribution*"