Ansible for Infrastructure Automation - Street Ops¶

What experienced Ansible operators know that the documentation doesn't emphasize enough.

One-liner: Ansible is SSH in a trench coat. If you can SSH to a host and run commands, Ansible can manage it. If SSH is broken, Ansible is broken.

Incident Runbooks¶

Playbook Fails Mid-Run¶

1. Read the error output carefully:
   - TASK name tells you exactly which task failed
   - "msg" field gives the specific error
   - "stdout" / "stderr" fields show command output

2. Common failure patterns:

   "Unreachable" - connection failed:
   - SSH connectivity: can you ssh to the host manually?
   - Wrong user: check ansible_user in inventory
   - Wrong key: check ansible_ssh_private_key_file
   - SSH host key changed: ssh-keygen -R <host>
   - Timeout: check ansible_ssh_timeout, increase for slow networks

   "Permission denied" - sudo/privilege issue:
   - become: yes missing from the play or task
   - sudo password required: use --ask-become-pass or ansible_become_password
   - User not in sudoers on the target host

   "Module failure" - task-specific error:
   - Package not found: wrong package name for the OS
   - Service not found: service name differs between distros
   - File not found: template or source file path is wrong
   - Python dependency missing on target

3. Retry from the failed task:
   ansible-playbook site.yml --start-at-task="Install nginx"
   # Or use the retry file:
   ansible-playbook site.yml --limit @site.retry

Connection Issues at Scale¶

1. SSH connection limits:
   - Running against 200 hosts, connections start failing
   - Default forks = 5 (only 5 hosts in parallel)
   - Increase: ansible-playbook site.yml -f 50
   - But don't go too high: SSH connections have overhead

2. SSH multiplexing (ControlPersist):
   # ansible.cfg
   [ssh_connection]
   ssh_args = -o ControlMaster=auto -o ControlPersist=60s
   pipelining = True
   # Pipelining reduces SSH operations per task

3. Mitogen (performance plugin):
   - Replaces SSH with a more efficient connection
   - 2-7x faster for many workloads
   - Drop-in replacement: just change strategy plugin

> **Scale note:** ControlPersist + pipelining is the single biggest performance win for Ansible over SSH. Without pipelining, each task requires multiple SSH round trips (copy module, execute, fetch result). With pipelining, everything happens in one SSH session. Enable both in ansible.cfg for any fleet over 20 hosts.

4. SSH timeout cascade:
   - One host is unreachable, holding up the whole play
   - Set timeout: ansible_ssh_timeout=10 in inventory
   - Use async for long-running tasks:
     - name: Run migration
       command: /opt/migrate.sh
       async: 3600    # Max runtime 1 hour
       poll: 30       # Check every 30 seconds

Fact Caching¶

1. Problem: gathering facts on 500 hosts takes 5+ minutes every run

2. Enable fact caching:
   # ansible.cfg
   [defaults]
   gathering = smart           # Only gather if cache is stale
   fact_caching = jsonfile
   fact_caching_connection = /tmp/ansible_facts_cache
   fact_caching_timeout = 86400  # 24 hours

   # Or use Redis for shared caching across CI runners:
   fact_caching = redis
   fact_caching_connection = redis://localhost:6379/0

3. Disable fact gathering for tasks that don't need it:
   - hosts: webservers
     gather_facts: no
     tasks:
       - name: Restart nginx
         service:
           name: nginx
           state: restarted

Role Dependency Hell¶

1. Problem: roles have conflicting dependencies or circular imports

2. Check dependencies:
   cat roles/webserver/meta/main.yml
   # dependencies:
   #   - common
   #   - { role: firewall, firewall_ports: [80, 443] }

3. Dependency runs multiple times:
   - By default, a role runs only once even if listed as dependency by multiple roles
   - Override: allow_duplicates: true in meta/main.yml
   - Usually you want the default behavior

4. Dependency version conflicts:
   - Role A needs package-x >= 2.0
   - Role B needs package-x < 2.0
   - No built-in resolution: you must align versions manually
   - Use requirements.yml with pinned versions:
     roles:
       - name: geerlingguy.nginx
         version: "3.1.0"

5. Install roles:
   ansible-galaxy install -r requirements.yml
   ansible-galaxy install -r requirements.yml --force  # Update

Idempotency Gotchas¶

1. Shell/command modules are NOT idempotent:
   # BAD: runs every time
   - name: Create database
     command: createdb myapp

   # GOOD: only runs if database doesn't exist
   - name: Create database
     command: createdb myapp
     args:
       creates: /var/lib/postgresql/myapp

   # BETTER: use the proper module
   - name: Create database
     postgresql_db:
       name: myapp
       state: present

2. lineinfile with changing values:
   # BAD: adds a new line every run if timestamp changes
   - lineinfile:
       path: /etc/config
       line: "updated={{ ansible_date_time.iso8601 }}"

   # GOOD: use regexp to find and replace
   - lineinfile:
       path: /etc/config
       regexp: "^updated="
       line: "updated={{ ansible_date_time.iso8601 }}"

3. Testing idempotency:
   # Run twice, second run should show 0 changed:
   ansible-playbook site.yml
   ansible-playbook site.yml
   # If anything shows "changed" on the second run, you have an idempotency bug

Gotchas & War Stories¶

The template that destroyed prod configs

Gotcha: The validate parameter on template and copy modules runs validation before the file is written to the final destination. If validation fails, the original file is untouched. Always use validate for any config file that has a syntax checker: nginx (nginx -t -c %s), Apache (apachectl configtest), sshd (sshd -t -f %s), sudoers (visudo -cf %s).

Someone used template with the wrong dest path and overwrote /etc/nginx/nginx.conf across all servers with an incomplete template. The notify: restart nginx handler then restarted nginx with a broken config, taking down the entire web tier.

Prevention:

# Always validate before restarting
- name: Copy nginx config
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    validate: nginx -t -c %s    # Validates BEFORE writing
  notify: Restart nginx

Variable scope confusion A variable set in one play doesn't automatically carry to another play in the same playbook. Use set_fact to make variables persist across tasks within a play, or use host_vars/group_vars for cross-play persistence. Variables set with -e are global and override everything.

The gather_facts timeout On a large fleet, one slow host causes the entire play to wait during fact gathering. Use gather_timeout in ansible.cfg and consider gather_facts: no with explicit fact gathering for specific hosts.

become vs become_user

Remember: The privilege escalation chain is: SSH user -> become -> become_user. Mnemonic: you connect as one user, become another. become: yes without become_user defaults to root. become_user: deploy requires the SSH user to have sudo access to that specific user.

become: yes escalates to root. become_user: deploy escalates to the deploy user. But the initial SSH connection still uses your SSH user. Common confusion: you SSH as ansible_user, then become to another user. If the target user doesn't exist or sudo isn't configured, you get cryptic errors.

Serial deployment for zero downtime By default, Ansible runs tasks on all hosts simultaneously. For a deployment behind a load balancer, use serial:

- hosts: webservers
  serial: "25%"          # Deploy to 25% of hosts at a time
  # or serial: 1         # One at a time
  tasks:
    - name: Deploy application
      # ...

Essential Debugging Commands¶

# Verbose output (add more v's for more detail)
ansible-playbook site.yml -v        # Task results
ansible-playbook site.yml -vv       # Task input parameters
ansible-playbook site.yml -vvv      # SSH connection details
ansible-playbook site.yml -vvvv     # Everything including SSH protocol

# Check mode (dry run)
ansible-playbook site.yml --check --diff

# List what would be affected
ansible-playbook site.yml --list-hosts
ansible-playbook site.yml --list-tasks
ansible-playbook site.yml --list-tags

# Syntax check
ansible-playbook site.yml --syntax-check

# Step through task by task
ansible-playbook site.yml --step

# Debug a variable
ansible -m debug -a "var=hostvars[inventory_hostname]" web1.example.com

# Test connectivity
ansible all -m ping

# Run ad-hoc commands
ansible webservers -m shell -a "df -h" -b
ansible webservers -m service -a "name=nginx state=restarted" -b

Quick Reference¶

Cheatsheet: Ansible