Skip to content

Ansible The Complete Guide

  • lesson
  • ansible-architecture
  • inventory
  • playbooks
  • modules
  • roles
  • jinja2-templating
  • variables-&-precedence
  • handlers
  • vault
  • molecule-testing
  • rolling-updates
  • debugging
  • performance-tuning
  • tower/awx
  • ansible-galaxy
  • collections
  • error-handling
  • dynamic-inventory
  • delegation
  • callback-plugins
  • custom-modules ---# Ansible — The Complete Guide: From Zero to Production

Topics: Ansible architecture, inventory, playbooks, modules, roles, Jinja2 templating, variables & precedence, handlers, Vault, Molecule testing, rolling updates, debugging, performance tuning, Tower/AWX, Ansible Galaxy, collections, error handling, dynamic inventory, delegation, callback plugins, custom modules Strategy: Build-up (ground zero → production mastery) with war stories, trivia, and drills woven throughout Level: L1–L2 (Foundations → Operations) Time: 3–4 hours (designed for deep study in one or multiple sittings) Prerequisites: SSH access to at least one Linux host (or a local VM). Familiarity with YAML helps but is not required.


The Mission

You're a new platform engineer. Your team manages 200 servers across dev, staging, and production. The previous engineer did everything by hand — SSHing into each server, running commands, copy-pasting configs. It took four hours to roll out a single update and they still missed two servers. The database password is in a shell history file. There's no record of what changed or when.

You're going to replace all of that with Ansible. By the end of this guide you'll understand every major concept from the ground up, have production-grade patterns for real work, and know the traps that catch experienced engineers. This is the one document you need to go from "what is Ansible?" to "I can run a zero-downtime rolling upgrade across 50 servers with encrypted secrets and automated testing."


Table of Contents

  1. What Is Ansible?
  2. Installation and First Commands
  3. Inventory — Who Are We Talking To?
  4. Playbooks — What Do We Want to Happen?
  5. Modules — The Building Blocks
  6. Variables, Facts, and the Precedence Nightmare
  7. Jinja2 Templating — Dynamic Config Files
  8. Handlers — Actions That Only Fire When Needed
  9. Roles — Reusable, Testable Building Blocks
  10. Ansible Vault — Secrets That Are Safe to Commit
  11. Error Handling — block/rescue, ignore_errors, failed_when
  12. Conditionals and Loops
  13. Rolling Updates and Zero-Downtime Deploys
  14. Debugging — Why Did It Do That?
  15. Performance at Scale
  16. Molecule — Testing Before Production
  17. Ansible Galaxy and Collections
  18. Tower/AWX — Centralized Automation
  19. Ansible vs Terraform vs Helm — When to Use Which
  20. Common Production Patterns
  21. Footguns — Mistakes That Brick Servers
  22. Real-World Case Studies
  23. Glossary
  24. Trivia and History
  25. Flashcard Review
  26. Drills
  27. Cheat Sheet
  28. Self-Assessment

Part 1: What Is Ansible?

Ansible is an agentless automation tool that manages server configuration, application deployment, and orchestration over SSH. You describe the desired state of your infrastructure in YAML files (playbooks), and Ansible makes it so.

One-liner definition: Agentless automation over SSH using YAML playbooks.

The Mental Model

Control Node (your laptop / CI server)
     |
     | SSH (or WinRM for Windows)
     v
Managed Nodes (target servers)
     |
     v
Module executes → reports changed/ok/failed → returns result

Ansible connects to your servers over SSH, pushes small programs called modules, executes them, and returns structured results. No agent to install. No daemon to manage. No custom protocol to debug. If you can SSH to a host and it has Python, Ansible can manage it.

One-liner from the street: Ansible is SSH in a trench coat. If SSH is broken, Ansible is broken.

Why Ansible Exists

Before Ansible, configuration management meant either: - Manual work: SSH into each server, run commands, hope you didn't miss one - Chef/Puppet: Install and manage agents on every server, maintain a PKI infrastructure, learn Ruby DSLs

Michael DeHaan created Ansible in February 2012 as a deliberate reaction to this complexity. His philosophy: if a machine has SSH and Python, it is already ready for configuration management. No bootstrap. No agent. No custom certificates. This zero-bootstrap approach is why Ansible became the default choice for network device automation — switches and routers have SSH but cannot run Ruby or install agents.

Etymology: The name "Ansible" comes from Ursula K. Le Guin's 1966 novel Rocannon's World, where an "ansible" is a device for instantaneous communication across any distance. Later popularized in Orson Scott Card's Ender's Game. DeHaan chose it because the tool was designed for instant, agentless communication with remote servers.

Core Principles

Principle What It Means
Agentless Nothing to install on managed hosts (just Python + SSH)
Idempotent Running the same playbook twice produces the same result — no unintended side effects
Declarative You describe the desired state ("nginx should be installed"), not the steps ("apt-get install nginx")
Push-based You run Ansible from a control node; it pushes changes to targets (vs. pull-based like Puppet)
YAML Human-readable configuration language — no programming required

Mnemonic — AIDPY: Agentless, Idempotent, Declarative, Push-based, YAML. These five properties define Ansible's design.


Part 2: Installation and First Commands

Installing Ansible

# On Ubuntu/Debian
sudo apt update && sudo apt install -y ansible

# On macOS
brew install ansible

# Via pip (any platform, most up-to-date)
pip install ansible

# Via pipx (isolated install, preferred for CLI tools)
pipx install ansible-core

# Verify
ansible --version

Ansible only needs to be installed on the control node (your laptop or CI server). Managed nodes need only Python 3 and an SSH server — both come pre-installed on virtually every Linux distribution.

ansible-core vs ansible

There are two PyPI packages, and the distinction matters:

Package What You Get When to Use
ansible-core Engine + CLI tools + ansible.builtin content + plugin framework When you want minimal, controlled dependencies and install collections explicitly
ansible ansible-core + curated community collections (batteries-included) Quick start, learning, small teams where convenience beats precision

For team reproducibility, prefer ansible-core with a pinned requirements.yml listing exactly which collections you need. This avoids "works on my machine" drift where different ansible package versions ship different collection versions.

The next step beyond pinned collections is Execution Environments — containerized Ansible runtimes that freeze the entire controller-side dependency tree (see the Execution Environments section later in this guide).

Your First Ad-Hoc Command

Ad-hoc commands let you run a single task on one or many servers without writing a playbook. They're the Ansible equivalent of one-liner shell commands.

# Ping all hosts (tests connection, not ICMP)
ansible all -m ping -i inventory.yml

# Check disk usage on web servers
ansible webservers -m command -a "df -h" -i inventory.yml

# Restart nginx (needs sudo)
ansible webservers -m service -a "name=nginx state=restarted" -i inventory.yml -b

# Install a package
ansible all -m apt -a "name=curl state=present" -i inventory.yml -b

# Copy a file to all servers
ansible all -m copy -a "src=motd.txt dest=/etc/motd" -i inventory.yml -b
Flag Meaning Mnemonic
-i Inventory file inventory
-m Module name module
-a Module arguments arguments
-b Become (sudo) become root
-u Remote user user
-k Ask for SSH password key/password prompt
--check Dry-run (don't change anything)
--diff Show file changes

Remember: Ad-hoc module categories: command (raw shell, no pipes), shell (supports pipes and redirects), copy (files to remote), service (start/stop/restart), package (install/remove). Mnemonic: CSCSPCommand, Shell, Copy, Service, Package.

Gotcha: The command module does NOT support pipes, redirects, or shell builtins. ansible all -m command -a "cat /etc/passwd | grep root" fails. Use -m shell for anything that needs shell features. This is a common interview trip-up.


Part 3: Inventory — Who Are We Talking To?

Before Ansible does anything, it needs to know who to talk to. That's the inventory — the list of hosts and their groupings.

Static Inventory

The simplest inventory is a file that lists your hosts:

# inventory/hosts.ini (INI format)
[webservers]
web1.example.com
web2.example.com ansible_host=10.0.1.10

[dbservers]
db1.example.com ansible_port=2222

[production:children]
webservers
dbservers

[webservers:vars]
http_port=8080
app_env=production
# inventory/hosts.yml (YAML format — same thing, different syntax)
all:
  children:
    webservers:
      hosts:
        web1.example.com:
        web2.example.com:
          ansible_host: 10.0.1.10
      vars:
        http_port: 8080
    dbservers:
      hosts:
        db1.example.com:
          ansible_port: 2222
Syntax What It Does
[webservers] Defines a group named "webservers"
ansible_host=10.0.1.10 Override the connection IP (when DNS doesn't resolve)
[production:children] Create a parent group containing other groups
[webservers:vars] Variables applied to every host in the group

Dynamic Inventory — Let the Cloud Tell You

Static inventory works for 5 servers. For 50 EC2 instances that scale up and down? You need dynamic inventory — scripts or plugins that query cloud APIs at runtime.

# inventory/aws_ec2.yml
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
  - us-west-2
keyed_groups:
  - key: tags.Role
    prefix: role
  - key: tags.Environment
    prefix: env
  - key: placement.availability_zone
    prefix: az
filters:
  tag:ManagedBy: ansible
  instance-state-name: running
compose:
  ansible_host: private_ip_address
  ansible_user: "'ubuntu'"
# See what Ansible discovers
ansible-inventory -i inventory/aws_ec2.yml --graph

# Output:
# @all:
#   |--@env_production:
#   |  |--10.0.1.10
#   |  |--10.0.1.11
#   |--@role_webserver:
#   |  |--10.0.1.10
#   |--@az_us_east_1a:
#   |  |--10.0.1.10

Under the Hood: The aws_ec2 plugin calls the EC2 DescribeInstances API. keyed_groups creates Ansible groups from instance metadata. The compose block builds per-host variables using Jinja2 expressions. The single quotes around 'ubuntu' in compose are deliberate — without them, Ansible would try to resolve ubuntu as a variable name.

Gotcha: The aws_ec2 plugin requires the amazon.aws collection and the boto3 Python package. Install both: ansible-galaxy collection install amazon.aws and pip install boto3. If you forget boto3, the error is unhelpfully vague: "Failed to import the required Python library".

group_vars and host_vars

Variables that apply to groups or specific hosts live in directories alongside the inventory:

inventory/
  aws_ec2.yml
  group_vars/
    all.yml                    # Every host gets these
    role_webserver.yml         # Only hosts tagged Role=webserver
    env_production.yml         # Only hosts tagged Environment=production
  host_vars/
    10.0.1.10.yml              # Overrides for one specific host
# inventory/group_vars/all.yml
ntp_servers:
  - 169.254.169.123   # AWS time sync service
timezone: UTC
monitoring_agent: prometheus-node-exporter

# inventory/group_vars/role_webserver.yml
nginx_worker_processes: auto
nginx_worker_connections: 2048
app_port: 8080
health_check_path: /health

Flashcard Check: Inventory

Question Answer
What's the difference between static and dynamic inventory? Static = manually maintained file. Dynamic = plugin/script queries an API at runtime.
What does keyed_groups do in a dynamic inventory plugin? Creates Ansible groups from instance metadata (tags, zones, types).
Where do you put variables that apply to all hosts in a group? group_vars/<group_name>.yml alongside the inventory file.
What always wins in variable precedence? Extra vars (-e on the command line) — they override everything.

Part 4: Playbooks — What Do We Want to Happen?

Playbooks are YAML files that define the desired state of your infrastructure. A playbook contains plays; each play targets hosts and runs tasks.

---
- name: Configure web servers
  hosts: webservers
  become: yes                    # Run as root (sudo)

  vars:
    app_port: 8080
    packages:
      - nginx
      - python3
      - certbot

  tasks:
    - name: Install required packages
      ansible.builtin.apt:
        name: "{{ packages }}"
        state: present
        update_cache: yes

    - name: Copy nginx config
      ansible.builtin.template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/sites-available/default
        owner: root
        group: root
        mode: '0644'
      notify: Restart nginx

    - name: Ensure nginx is running
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: yes

  handlers:
    - name: Restart nginx
      ansible.builtin.service:
        name: nginx
        state: restarted

Anatomy of a Playbook

Component Purpose
name Human-readable description (shows in output)
hosts Which inventory hosts/groups to target
become Escalate privileges (sudo)
vars Variables for this play
tasks Ordered list of actions to perform
handlers Actions triggered by notify (run once at end of play)
pre_tasks Tasks that run before roles
post_tasks Tasks that run after roles
roles Reusable role includes

Running a Playbook

# Basic run
ansible-playbook site.yml -i inventory.yml

# Dry run — see what would change
ansible-playbook site.yml --check --diff

# Limit to one host
ansible-playbook site.yml --limit web1.example.com

# Run only tagged tasks
ansible-playbook site.yml --tags "nginx"

# Pass extra variables
ansible-playbook site.yml -e "app_version=2.1.0"

# Step through task by task (interactive)
ansible-playbook site.yml --step

Part 5: Modules — The Building Blocks

Modules are the units of work in Ansible. There are 7,000+ modules covering everything from package management to cloud provisioning. Each module is idempotent by design — it checks current state and only changes what's needed.

Key Module Categories

Category Modules Purpose Example
Package apt, dnf, yum, pip Install/remove packages apt: name=nginx state=present
File file, copy, template, lineinfile Manage files and content template: src=app.conf.j2 dest=/etc/app.conf
Service service, systemd Manage services service: name=nginx state=started enabled=true
User user, group, authorized_key Manage users and access user: name=deploy groups=sudo
Command command, shell, script, raw Run arbitrary commands command: /opt/deploy.sh
Cloud ec2, gcp_compute_instance Cloud resource management ec2: instance_type=t3.medium
Debug debug, assert, fail Debugging and assertions debug: var=ansible_hostname

The Golden Rule: Prefer Modules Over Shell

# BAD — not idempotent, runs every time, slow
- name: Install nginx
  ansible.builtin.shell: apt-get install -y nginx

# GOOD — idempotent, only changes if needed, reports status correctly
- name: Install nginx
  ansible.builtin.apt:
    name: nginx
    state: present

Modules know how to check current state. The apt module checks if nginx is already installed before doing anything. The shell module has no idea — it runs the command every time, reports "changed" every time, and takes 20 minutes because it reinstalls packages every run.

Rule of thumb: Only use shell/command when there's no module for what you need. And when you do, add creates: or when: to make it idempotent.

# shell with idempotency guard
- name: Create database
  ansible.builtin.command: createdb myapp
  args:
    creates: /var/lib/postgresql/myapp   # Skip if this path exists

# Even better — use the proper module
- name: Create database
  postgresql_db:
    name: myapp
    state: present

lineinfile vs template vs blockinfile

These three modules manage file content differently:

Module Use When Gotcha
lineinfile Changing a single line in a file you don't fully manage Two tasks matching the same regex fight each other
blockinfile Adding a multi-line block to a file Adds marker comments (BEGIN/END ANSIBLE MANAGED BLOCK)
template You fully manage the file Overwrites entire file; previous manual edits are lost
# lineinfile — surgical one-line edit
- ansible.builtin.lineinfile:
    path: /etc/ssh/sshd_config
    regexp: '^PermitRootLogin'
    line: 'PermitRootLogin no'

# template — full file management
- ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    validate: nginx -t -c %s    # Validates BEFORE writing

Part 6: Variables, Facts, and the Precedence Nightmare

Variable Sources

Variables can be defined in many places:

# In the playbook
vars:
  app_port: 8080

# In group_vars
# group_vars/webservers.yml
http_port: 80

# In host_vars
# host_vars/web1.example.com.yml
http_port: 9090    # Override for this specific host

# On the command line
# ansible-playbook site.yml -e "app_port=9090"

The 22 Levels of Variable Precedence

Ansible has 22 levels of variable precedence. Yes, twenty-two. When the same variable is defined in multiple places, the highest-precedence one wins.

(lowest precedence)
 1.  command line values (not variables)
 2.  role defaults (roles/x/defaults/main.yml)
 3.  inventory file or script group vars
 4.  inventory group_vars/all
 5.  playbook group_vars/all
 6.  inventory group_vars/*
 7.  playbook group_vars/*
 8.  inventory file or script host vars
 9.  inventory host_vars/*
 10. playbook host_vars/*
 11. host facts / cached set_facts
 12. play vars
 13. play vars_prompt
 14. play vars_files
 15. role vars (roles/x/vars/main.yml)      ← higher than play vars!
 16. block vars
 17. task vars
 18. include_vars
 19. set_facts / register
 20. role (and include_role) params
 21. include params
 22. extra vars (-e on command line)         ← ALWAYS wins
(highest precedence)

The Most Common Traps

Trap 1: Role vars/ beats play vars:

# roles/nginx/vars/main.yml
nginx_worker_connections: 768     # ← This wins (precedence 15)

# playbook.yml
- hosts: webservers
  vars:
    nginx_worker_connections: 1024  # ← This loses (precedence 12)
  roles:
    - nginx
# Result: 768. Not 1024. Surprise.

Rule: If users should be able to override a variable, put it in defaults/main.yml (precedence 2). If they shouldn't, put it in vars/main.yml (precedence 15). Never put the same variable in both.

Trap 2: Extra vars override everything

# This overrides ALL other definitions of app_env
ansible-playbook site.yml -e "app_env=staging"

Passing -e app_env=staging in a CI pipeline that deploys to production will override the inventory's app_env=production. Use extra vars sparingly.

Debugging Variables

# See what value a host actually gets
ansible -m debug -a "var=app_port" webserver

# See ALL variables for a host
ansible -m debug -a "var=hostvars[inventory_hostname]" webserver

# In a playbook
- ansible.builtin.debug:
    var: nginx_worker_connections

Facts — Auto-Discovered Host Information

Facts are system information Ansible gathers automatically at the start of each play:

- ansible.builtin.debug:
    msg: "OS: {{ ansible_distribution }} {{ ansible_distribution_version }}"
    # Output: "OS: Ubuntu 22.04"

- ansible.builtin.debug:
    msg: "IP: {{ ansible_default_ipv4.address }}"

- ansible.builtin.debug:
    msg: "CPUs: {{ ansible_processor_vcpus }}"

Common facts: ansible_hostname, ansible_os_family, ansible_distribution, ansible_default_ipv4.address, ansible_processor_vcpus, ansible_memtotal_mb, ansible_fqdn.

Modern practice: Ansible now prefers accessing facts via the ansible_facts dictionary — e.g., ansible_facts['distribution'] instead of the bare ansible_distribution injection. Both work, but the dictionary form is explicit, avoids variable-name collisions, and is what ansible-lint recommends. You can enforce this by setting inject_facts_as_vars: false in ansible.cfg.

# View all facts for a host
ansible webserver -m setup

# Filter facts
ansible webserver -m setup -a "filter=ansible_distribution*"

Part 7: Jinja2 Templating — Dynamic Config Files

Templates generate config files with dynamic content. Ansible uses Jinja2, the same engine behind Flask and Django.

Basic Syntax

{# This is a comment #}
{{ variable }}                         {# Output a value #}
{{ var | default("fallback") }}        {# Filter with default #}
{% for item in list %}                 {# Loop #}
{% if condition %}                     {# Conditional #}
{% endif %}
{% endfor %}

A Real Template

{# templates/nginx.conf.j2 #}
server {
    listen {{ http_port }};
    server_name {{ ansible_fqdn }};

    location / {
        proxy_pass http://127.0.0.1:{{ app_port }};
    }

{% if enable_ssl %}
    listen 443 ssl;
    ssl_certificate /etc/ssl/{{ ansible_fqdn }}.crt;
{% endif %}
}

Dynamic Upstream from Inventory Groups

{# templates/upstream.conf.j2 #}
upstream app_servers {
{%- for host in groups['app'] %}
    server {{ hostvars[host]['ansible_host'] }}:{{ app_port | default(8080) }};
{%- endfor %}
}

The {%- syntax (note the dash) strips whitespace before the tag. Without it, {% for %} loops add blank lines between entries.

Essential Jinja2 Filters

Filter What It Does Example
default('val') Fallback if undefined {{ timeout \| default(30) }}
mandatory Fail if undefined {{ db_host \| mandatory }}
to_nice_json Pretty-print as JSON {{ config_dict \| to_nice_json }}
regex_replace Regex substitution {{ hostname \| regex_replace('\.example\.com$', '') }}
join(', ') Join list into string {{ dns_servers \| join(', ') }}
selectattr Filter list of dicts {{ users \| selectattr('active', 'true') \| list }}
b64encode Base64 encode {{ secret \| b64encode }}
password_hash Hash a password {{ pass \| password_hash('sha512') }}
basename Extract filename from path {{ '/etc/nginx/conf.d/app.conf' \| basename }}

Gotcha: YAML treats { as the start of a mapping. Any value that begins with {{ must be quoted: message: "{{ greeting }} world". Without quotes you get the cryptic error mapping values are not allowed in this context. This is the single most common Ansible YAML error.


Part 8: Handlers — Actions That Only Fire When Needed

Handlers are special tasks that run only when notified, and only once at the end of the play (even if notified multiple times).

tasks:
  - name: Update nginx config
    ansible.builtin.template:
      src: nginx.conf.j2
      dest: /etc/nginx/nginx.conf
    notify: Restart nginx

  - name: Update SSL cert
    ansible.builtin.copy:
      src: cert.pem
      dest: /etc/ssl/cert.pem
    notify: Restart nginx    # Won't restart twice

handlers:
  - name: Restart nginx
    ansible.builtin.service:
      name: nginx
      state: restarted

Key Behavior

  1. Handlers only fire if the notifying task reports "changed" (not "ok")
  2. Handlers run at the end of the play, not after the notifying task
  3. If a task notifies the same handler twice, it still only runs once
  4. If the play fails before reaching the handler phase, handlers don't run

Common Handler Mistakes

Mistake 1: Handler doesn't fire because task reports "ok"

If the config file already matches the template, the task reports "ok" (not "changed"), and the handler never fires. This is correct behavior — but if you renamed the handler and forgot to update the notify, it silently does nothing.

Mistake 2: Handler doesn't run because a later task fails

Your config changed, but a later task failed, so handlers never ran. Now the service runs with an old config that doesn't match the files on disk.

Fix: Use meta: flush_handlers if you need handlers to fire immediately:

tasks:
  - name: Update config
    ansible.builtin.template:
      src: nginx.conf.j2
      dest: /etc/nginx/nginx.conf
    notify: Restart nginx

  - ansible.builtin.meta: flush_handlers    # Force handlers to run NOW

  - name: Potentially risky task
    ansible.builtin.command: /opt/risky-thing.sh
# Force all handlers to run regardless of notification
ansible-playbook playbook.yml --force-handlers

Part 9: Roles — Reusable, Testable Building Blocks

A role is a directory structure that packages tasks, templates, variables, handlers, and metadata into a reusable unit. Think of it like a function in code — it takes inputs (variables), does work (tasks), and has side effects (handlers).

The Complete Role Directory Layout

roles/
  webapp/
    defaults/main.yml      # Default variables — LOW precedence, meant to be overridden
    vars/main.yml           # Role variables — HIGH precedence, hard to override
    tasks/main.yml          # The actual work
    handlers/main.yml       # Actions triggered by notify
    templates/              # Jinja2 templates (.j2 files)
      nginx-vhost.conf.j2
      app-config.yml.j2
    files/                  # Static files copied as-is (no templating)
      logrotate-webapp
    meta/main.yml           # Role metadata: dependencies, platforms, author
    molecule/               # Test suite
      default/
        molecule.yml
        converge.yml
        verify.yml

defaults/ vs vars/ — The Most Consequential Choice

Directory Precedence Purpose When to Use
defaults/main.yml 2 (lowest) Knobs users should turn Ports, versions, feature flags
vars/main.yml 15 (high) Constants that shouldn't change Internal paths, required packages

War Story: A team put app_port: 8080 in both defaults/ and vars/ during a refactor. Both said 8080, so testing caught nothing. Three months later, staging set app_port: 9090 in their group_vars/. It didn't work — vars/main.yml silently stomped the override (higher precedence). They spent two hours debugging before someone ran ansible -m debug -a "var=app_port" staging-web01. The fix: move app_port out of vars/. If users should override it, it goes in defaults. Period.

Creating a Role

# Generate role skeleton
ansible-galaxy init roles/postgresql
# roles/postgresql/defaults/main.yml
postgresql_version: "15"
postgresql_max_connections: 100

# roles/postgresql/tasks/main.yml
---
- name: Install PostgreSQL
  ansible.builtin.apt:
    name: "postgresql-{{ postgresql_version }}"
    state: present

- name: Configure PostgreSQL
  ansible.builtin.template:
    src: postgresql.conf.j2
    dest: "/etc/postgresql/{{ postgresql_version }}/main/postgresql.conf"
  notify: Restart PostgreSQL

# roles/postgresql/handlers/main.yml
---
- name: Restart PostgreSQL
  ansible.builtin.service:
    name: postgresql
    state: restarted

Using Roles in a Playbook

- hosts: webservers
  roles:
    - common
    - webserver
    - { role: monitoring, tags: ['monitoring'] }

# Or dynamically
- hosts: webservers
  tasks:
    - ansible.builtin.include_role:
        name: webserver
      when: deploy_web | default(true)

include_role vs import_role

Feature import_role include_role
Processing Static — resolved at parse time Dynamic — resolved at runtime
Tags Inherited by all tasks Only apply to the include itself
Conditionals Applied to every task in the role Applied once to the include
Use when You want tags/conditions to propagate You need conditional or looped inclusion

Role Dependencies

# roles/webapp/meta/main.yml
dependencies:
  - role: common
  - role: monitoring
    vars:
      monitoring_port: "{{ app_port }}"

Dependencies run before the role's tasks. By default, a role runs only once even if listed as a dependency by multiple roles (use allow_duplicates: true to change this).


Part 10: Ansible Vault — Secrets That Are Safe to Commit

Your playbooks need database passwords, API keys, and TLS certificates. These need to live in the repo so automation can use them, but they must be encrypted. Ansible Vault encrypts data with AES-256-CTR and HMAC-SHA256 for integrity, using PBKDF2 key stretching.

Essential Vault Commands

# Create a new encrypted file
ansible-vault create secrets.yml

# Encrypt an existing file
ansible-vault encrypt vars/passwords.yml

# Edit an encrypted file (decrypts to tmpfs, re-encrypts on save)
ansible-vault edit secrets.yml

# View without decrypting to disk
ansible-vault view secrets.yml

# Change the encryption password
ansible-vault rekey secrets.yml

# Encrypt a single string
echo -n 'my_secret_password' | ansible-vault encrypt_string --stdin-name 'db_password'
# Outputs:
# db_password: !vault |
#   $ANSIBLE_VAULT;1.1;AES256
#   ...

# Run playbook with vault password
ansible-playbook site.yml --ask-vault-pass
ansible-playbook site.yml --vault-password-file ~/.vault_pass

The Vault/Vars Split Pattern

Don't encrypt your entire vars file. Use two files:

inventory/
  group_vars/
    env_production/
      vars.yml          # Plaintext — references vault variables
      vault.yml          # Encrypted — contains the actual secrets
# vault.yml (encrypted)
vault_db_password: "s3cr3t_pr0d_p4ss"
vault_api_key: "ak_prod_xK9mP2qR7vN4"

# vars.yml (plaintext)
db_password: "{{ vault_db_password }}"
api_key: "{{ vault_api_key }}"

Why the split? You can grep for where db_password is used without decrypting anything. When reviewing a PR, you can see that db_password was changed without needing the vault password.

Multiple Vault IDs

Different secrets for different teams:

# Encrypt with specific vault IDs
ansible-vault encrypt --vault-id dev@prompt secrets-dev.yml
ansible-vault encrypt --vault-id prod@/path/to/prod-password secrets-prod.yml

# Run with multiple vault IDs
ansible-playbook site.yml \
  --vault-id dev@prompt \
  --vault-id prod@/path/to/prod-password

Vault Security Rules

  1. Never pass secrets on the command line: ansible-vault encrypt_string 'actual_password' puts the password in shell history. Pipe it instead.
  2. Use no_log: true on tasks that handle secrets. Without it, your database password shows up in CI logs.
  3. Store the vault password file with restrictive permissions: chmod 600 ~/.vault_pass
  4. For enterprise: Use an external secret store (HashiCorp Vault, AWS Secrets Manager) with dynamic credentials when possible.

Part 11: Error Handling

ignore_errors — The Blunt Instrument

- name: Check if legacy service exists
  ansible.builtin.command: systemctl status legacy-app
  register: result
  ignore_errors: true

- name: Stop legacy app if it exists
  ansible.builtin.service:
    name: legacy-app
    state: stopped
  when: result.rc == 0

Warning: ignore_errors: true silently swallows ALL errors. Six months later, your cleanup task has been failing silently and the disk is full. Prefer failed_when with specific conditions.

failed_when — Targeted Failure Conditions

- name: Check disk space
  ansible.builtin.command: df -h /
  register: disk_check
  failed_when: "'100%' in disk_check.stdout"

block/rescue/always — Structured Error Handling

- block:
    - name: Deploy application
      ansible.builtin.command: /opt/deploy.sh

    - name: Verify deployment
      ansible.builtin.uri:
        url: http://localhost:8080/health
        status_code: 200

  rescue:
    - name: Rollback on failure
      ansible.builtin.command: /opt/rollback.sh

  always:
    - name: Send notification
      ansible.builtin.debug:
        msg: "Deploy attempt complete"

This is Ansible's try/catch/finally. If any task in block fails, rescue runs. always runs regardless.

changed_when — Control What Counts as "Changed"

- name: Check current version
  ansible.builtin.command: cat /opt/app/VERSION
  register: version
  changed_when: false    # This task never "changes" anything

Part 12: Conditionals and Loops

Conditionals (when)

# Based on OS
- name: Install packages (Debian)
  ansible.builtin.apt:
    name: "{{ item }}"
    state: present
  loop: [nginx, curl, htop]
  when: ansible_os_family == "Debian"

- name: Install packages (RedHat)
  ansible.builtin.yum:
    name: "{{ item }}"
    state: present
  loop: [nginx, curl, htop]
  when: ansible_os_family == "RedHat"

# Based on a registered variable
- ansible.builtin.command: which docker
  register: docker_check
  ignore_errors: true

- name: Install Docker
  ansible.builtin.apt:
    name: docker.io
    state: present
  when: docker_check.rc != 0

Loops

# Simple list
- name: Install packages
  ansible.builtin.apt:
    name: "{{ item }}"
    state: present
  loop:
    - nginx
    - curl
    - htop

# Loop with dictionaries
- name: Create users
  ansible.builtin.user:
    name: "{{ item.name }}"
    groups: "{{ item.groups }}"
    shell: /bin/bash
  loop:
    - { name: alice, groups: "sudo,docker" }
    - { name: bob, groups: "docker" }
    - { name: carol, groups: "sudo" }

Part 13: Rolling Updates and Zero-Downtime Deploys

This is where everything comes together. Deploying to 50 servers behind a load balancer without dropping a single request.

The Strategy

# playbooks/rolling-upgrade.yml
---
- name: Rolling upgrade — {{ app_name }} {{ app_version }}
  hosts: role_webserver
  serial:
    - 1            # First: single canary
    - "10%"        # Then: 10% at a time
    - "25%"        # Then: 25% at a time
  max_fail_percentage: 10

  pre_tasks:
    - name: Verify current version
      ansible.builtin.command: "cat {{ app_home }}/VERSION"
      register: pre_version
      changed_when: false

    - name: Remove from ALB target group
      community.aws.elb_target:
        target_group_arn: "{{ alb_target_group_arn }}"
        target_id: "{{ ansible_host }}"
        state: absent
      delegate_to: localhost

    - name: Wait for connections to drain
      ansible.builtin.pause:
        seconds: 30

  roles:
    - role: webapp
      vars:
        app_version: "{{ target_version }}"

  post_tasks:
    - name: Wait for application health
      ansible.builtin.uri:
        url: "{{ health_check_url }}"
        return_content: true
        status_code: 200
      register: health
      retries: 5
      delay: 10
      until: health.status == 200

    - name: Re-add to ALB target group
      community.aws.elb_target:
        target_group_arn: "{{ alb_target_group_arn }}"
        target_id: "{{ ansible_host }}"
        target_port: "{{ app_port }}"
        state: present
      delegate_to: localhost

    - name: Report upgrade status
      ansible.builtin.debug:
        msg: "{{ inventory_hostname }}: {{ pre_version.stdout }}  {{ post_version.stdout }}"

Key Concepts

serial: [1, "10%", "25%"] — Graduated rollout. First batch is a single canary. If it survives, widen to 10%, then 25%. This catches "completely broken" (canary fails) and "fails under load" (10% batch reveals issues).

max_fail_percentage: 10 — The circuit breaker. If more than 10% of hosts in any batch fail, Ansible stops the entire play. Without this, a bad deploy rolls across all servers.

Mental Model: Think of serial + max_fail_percentage as a circuit breaker pattern — the same concept used in microservice architectures. serial controls the batch size (how much current flows). max_fail_percentage is the trip threshold. Together, they limit blast radius.

delegate_to: localhost — The ALB API calls run on your control node, not on the web servers. delegate_to changes where the task runs, not whose variables it uses.

Rollback with block/rescue

tasks:
  - block:
      - name: Deploy new version
        ansible.builtin.include_role:
          name: webapp
        vars:
          app_version: "{{ target_version }}"

      - name: Verify health
        ansible.builtin.uri:
          url: "{{ health_check_url }}"
          status_code: 200
        retries: 5
        delay: 10
        until: health.status == 200

    rescue:
      - name: ROLLBACK — deploy previous version
        ansible.builtin.include_role:
          name: webapp
        vars:
          app_version: "{{ pre_version.stdout | trim }}"

      - name: Notify on rollback
        community.general.slack:
          token: "{{ vault_slack_token }}"
          channel: "#deploys"
          msg: "ROLLBACK on {{ inventory_hostname }}: {{ target_version }} failed"
        delegate_to: localhost
        ignore_errors: true

Execution Strategies

Strategy Behavior Use When
linear (default) All hosts execute each task before moving to the next Normal operations
free Each host runs as fast as it can, independently Tasks are independent per host
debug Interactive step-by-step debugging Troubleshooting

Part 14: Debugging — Why Did It Do That?

The Verbosity Ladder

# Normal: show task names and status
ansible-playbook playbook.yml

# -v: show task results (return values)
ansible-playbook playbook.yml -v

# -vv: show task input parameters
ansible-playbook playbook.yml -vv

# -vvv: show SSH connection details
ansible-playbook playbook.yml -vvv

# -vvvv: show full SSH protocol debugging
ansible-playbook playbook.yml -vvvv

For most debugging: -vv (see what values were used). For connection issues: -vvv (see SSH commands). For SSH key/auth problems: -vvvv.

Check Mode and Diff Mode

# Preview what WOULD change without changing anything
ansible-playbook playbook.yml --check --diff

# Example diff output:
# TASK [Copy nginx config]
# --- before: /etc/nginx/nginx.conf
# +++ after: /tmp/ansible-generated
# @@ -10,3 +10,3 @@
# -    worker_connections 768;
# +    worker_connections 1024;
# changed: [webserver]

Gotcha: Check mode doesn't work with command/shell modules (they can't predict what a shell command would do). The module docs tell you if check mode is supported.

Nuance: Check mode has partial support for command/shell modules when using creates/removes parameters — Ansible can check whether the file exists without running the command. But don't assume check mode is perfect. It's excellent for declarative modules (apt, template, service) but limited with imperative tasks. Always review the diff output critically.

Non-negotiable rule: Always run --check --diff before every production run.

Essential Debugging Commands

# List what would be affected
ansible-playbook site.yml --list-hosts
ansible-playbook site.yml --list-tasks
ansible-playbook site.yml --list-tags

# Syntax check (no execution)
ansible-playbook site.yml --syntax-check

# Start at a specific task (skip earlier ones)
ansible-playbook site.yml --start-at-task="Install nginx"

# Use the retry file after a failure
ansible-playbook site.yml --limit @site.retry

# Test connectivity
ansible all -m ping

# Debug a specific variable on a specific host
ansible -m debug -a "var=hostvars[inventory_hostname]" web1.example.com

Debugging Command Reference

Task Command
Preview changes ansible-playbook play.yml --check --diff
Debug variables ansible -m debug -a "var=VAR" host
Override a variable ansible-playbook play.yml -e "var=value"
Force handlers ansible-playbook play.yml --force-handlers
Show facts ansible host -m setup
Verbose output -v (results) -vv (inputs) -vvv (SSH)
Step-by-step ansible-playbook play.yml --step
Start at task ansible-playbook play.yml --start-at-task "Name"
List tasks ansible-playbook play.yml --list-tasks
Syntax check ansible-playbook play.yml --syntax-check

Part 15: Performance at Scale

Gathering facts on 50 hosts takes time. On 500 hosts, it takes minutes before your first task runs. Here's how to fix that.

ansible.cfg Tuning

# ansible.cfg
[defaults]
forks = 30                              # Default is 5 — too low for any real fleet
gathering = smart                       # Only gather if cache is stale
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400            # 24 hours
interpreter_python = auto_silent        # Auto-detect Python; suppress warning noise
retry_files_enabled = False             # Don't litter .retry files everywhere

# Show task timing (which tasks are slow?)
callbacks_enabled = timer, profile_tasks
callback_result_format = yaml           # Output results as YAML instead of JSON

[ssh_connection]
pipelining = True                       # Reduces SSH round trips per task
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
Setting What It Does Impact
forks = 30 Run 30 hosts in parallel (default: 5) Linear speedup for independent tasks
gathering = smart Skip fact gathering if cache is fresh Saves 2–10 seconds per host
pipelining = True Run modules in-process instead of copying 2–3x faster per task
ControlPersist=60s Reuse SSH connections for 60 seconds Fewer SSH handshakes
interpreter_python = auto_silent Auto-detect Python path on targets without warning Cleaner output on mixed fleets
retry_files_enabled = False Suppress .retry file creation on failures Less clutter; use --limit @file explicitly

Scale note: ControlPersist + pipelining is the single biggest performance win for Ansible over SSH. Without pipelining, each task requires multiple SSH round trips (copy module, execute, fetch result). With pipelining, everything happens in one SSH session. Enable both for any fleet over 20 hosts.

Fact Caching with Redis (Shared Across CI Runners)

# For shared caching across CI runners
fact_caching = redis
fact_caching_connection = redis://localhost:6379/0

Disable Fact Gathering When Not Needed

- hosts: webservers
  gather_facts: no      # Skip fact gathering entirely
  tasks:
    - name: Restart nginx
      ansible.builtin.service:
        name: nginx
        state: restarted

Mitogen Plugin

Mitogen replaces Ansible's SSH-based module execution with a more efficient method. It provides 2–7x speedup for many workloads and is a drop-in replacement — just change the strategy plugin.

Gotcha: Stale fact caches cause subtle bugs. If someone adds a disk or changes an IP, cached facts show old state. For critical operations, force fresh facts: ansible.builtin.setup: gather_subset: [network, hardware]

Async for Long-Running Tasks

- name: Run migration
  ansible.builtin.command: /opt/migrate.sh
  async: 3600    # Max runtime 1 hour
  poll: 30       # Check every 30 seconds

Part 16: Molecule — Testing Before Production

Molecule is Ansible's testing framework. It spins up containers, runs your role, verifies the result, and tears everything down.

Setup

# roles/webapp/molecule/default/molecule.yml
driver:
  name: docker
platforms:
  - name: ubuntu-noble
    image: ubuntu:noble
    pre_build_image: true
    command: /lib/systemd/systemd
    privileged: true
  - name: rocky-9
    image: rockylinux:9
    pre_build_image: true
    command: /lib/systemd/systemd
    privileged: true
provisioner:
  name: ansible
verifier:
  name: ansible
# roles/webapp/molecule/default/converge.yml
---
- name: Converge
  hosts: all
  vars:
    app_name: testapp
    app_version: "1.0.0"
  roles:
    - role: webapp
# roles/webapp/molecule/default/verify.yml
---
- name: Verify
  hosts: all
  tasks:
    - name: Check application directory exists
      ansible.builtin.stat:
        path: /opt/testapp
      register: app_dir

    - name: Assert application directory is correct
      ansible.builtin.assert:
        that:
          - app_dir.stat.exists
          - app_dir.stat.isdir

    - name: Check nginx config is valid
      ansible.builtin.command: nginx -t
      changed_when: false
      become: true

Running Molecule

# Full test cycle (create → converge → idempotence → verify → destroy)
molecule test

# Just apply the role (leave containers running for debugging)
molecule converge

# Run verification
molecule verify

# SSH into a test container to poke around
molecule login -h ubuntu-noble

# Destroy test containers
molecule destroy

The Idempotence Check

The molecule test sequence includes an idempotence check — it runs the playbook twice and fails if anything reports "changed" on the second run. This catches non-idempotent tasks.

# The idempotence check catches this:
# TASK [Create database] ********************************
# changed: [ubuntu-noble]     # First run: OK
# changed: [ubuntu-noble]     # Second run: NOT OK — should be "ok"

Molecule's full test sequence: dependency → lint → cleanup → destroy → syntax → create → prepare → converge → idempotence → verify → cleanup → destroy. The idempotence step is what separates "it runs" from "it's production-ready."

ansible-lint — Static Analysis for Playbooks

ansible-lint catches bad practices, deprecated syntax, and style issues before you run anything:

pip install ansible-lint
ansible-lint playbooks/

It flags common issues: missing FQCNs, bare variables in when: clauses, missing name: on tasks, and deprecated module usage. Treat it like shellcheck for Ansible — run it in CI and fix what it flags.

Execution Environments, Builder, and Navigator

As your Ansible footprint grows, "works on my machine" becomes a real problem — different engineers have different Python versions, different collection versions, and different system libraries. Execution Environments (EEs) solve this by packaging the entire controller-side runtime (Python, ansible-core, collections, Python dependencies) into a container image.

Tool Purpose
ansible-builder Creates EE container images from a definition file
ansible-navigator Modern CLI/TUI for running playbooks inside EEs (replaces ansible-playbook for EE workflows)
pip install ansible-navigator ansible-builder

# Build an EE from a definition
ansible-builder build --tag my-ee:latest

# Run a playbook inside an EE
ansible-navigator run site.yml -i inventory.yml --execution-environment-image my-ee:latest

Think of EEs as "Docker for your Ansible controller." The playbooks and inventory stay on your filesystem; the runtime (Python, modules, collections, libraries) lives in the container. This guarantees every engineer and CI runner uses the exact same toolchain.


Part 17: Ansible Galaxy and Collections

Ansible Galaxy

Galaxy is the community hub for sharing roles and collections.

# Install a role from Galaxy
ansible-galaxy install geerlingguy.nginx

# Install from a requirements file (pinned versions)
ansible-galaxy install -r requirements.yml

# requirements.yml
roles:
  - name: geerlingguy.nginx
    version: "3.1.0"
  - name: geerlingguy.docker
    version: "6.0.0"

Collections

Collections are the modern packaging format — they bundle roles, modules, and plugins together.

# Install a collection
ansible-galaxy collection install amazon.aws
ansible-galaxy collection install community.general

# Use fully qualified collection names in playbooks
- name: Install package
  ansible.builtin.apt:
    name: nginx
    state: present

Trivia: Ansible Galaxy launched in 2013 with about 200 roles. By 2024, it hosted over 40,000 roles and collections, making it one of the largest repositories of reusable infrastructure code.


Part 18: Tower/AWX — Centralized Automation

For a team of one, ansible-playbook on your laptop works fine. For a team of ten, you need centralized execution, RBAC, audit trails, and scheduling. That's Tower (commercial, now Ansible Automation Platform) or AWX (free upstream).

Feature CLI (ansible-playbook) Tower/AWX
Execution Your laptop/CI runner Centralized server
RBAC None (SSH key = full access) Role-based access per project, inventory, credential
Audit trail Shell history, maybe CI logs Full job log with who, when, what, and diff
Scheduling Cron job Built-in scheduler with dependencies
Credentials Files on disk / env vars Encrypted credential store with access control
API None (wrap in scripts) Full REST API for integration
Cost Free AWX = free, Tower (AAP) = Red Hat subscription

Trivia: Red Hat open-sourced AWX in 2017 as the upstream for Ansible Tower. This was unusual — they essentially gave away the code for a commercial product. The strategy mirrors Red Hat's Fedora/RHEL model: free upstream builds community, paid product adds support and certification.


Part 19: Ansible vs Terraform vs Helm

Tool What It Manages How It Works State
Terraform Cloud infrastructure (VPCs, instances, databases, DNS) Declarative: "I want 3 servers" → Terraform figures out how Explicit state file
Ansible Server configuration (packages, files, services, users) Procedural + declarative: tasks in order on hosts Stateless (checks each run)
Helm Kubernetes workloads (Deployments, Services, ConfigMaps) Declarative: templates K8s YAML Release history in K8s secrets

The simple rule: Terraform builds the house. Ansible furnishes it. Helm runs the apps.

Terraform: Creates the VPC, subnets, EC2 instances, RDS database, S3 buckets
Ansible:   Configures the EC2 instances (packages, users, sshd, monitoring)
Helm:      Deploys applications to the Kubernetes cluster

Common pattern:

# Terraform creates infra and generates inventory
terraform apply
terraform output -json | ./generate_inventory.py > inventory.yml

# Ansible configures it
ansible-playbook -i inventory.yml site.yml

When NOT to use Ansible: In a pure immutable-infrastructure/Kubernetes world, Ansible is less common (Helm and operators handle config). Ansible remains essential for node bootstrapping, bare-metal, network devices, and legacy systems.


Part 20: Common Production Patterns

Bootstrap Pattern — Provisioning Bare Servers

- name: Bootstrap new servers
  hosts: "{{ target }}"
  gather_facts: false    # Python might not be installed yet

  tasks:
    - name: Install Python (raw — no Python required)
      ansible.builtin.raw: |
        if command -v apt-get >/dev/null 2>&1; then
          apt-get update && apt-get install -y python3
        elif command -v dnf >/dev/null 2>&1; then
          dnf install -y python3
        fi
      changed_when: true

    - name: Now gather facts
      ansible.builtin.setup:

    - name: Run common role
      ansible.builtin.include_role:
        name: common

Under the Hood: The raw module doesn't require Python on the target — it sends commands over SSH directly. This is why gather_facts: false is mandatory here: the setup module (which gathers facts) is a Python module and would fail before Python is installed.

Upgrade Pattern

# Dry run first — always
ansible-playbook playbooks/rolling-upgrade.yml \
  -i inventory/aws_ec2.yml \
  -e target_version=2.1.0 \
  --check --diff

# Then for real
ansible-playbook playbooks/rolling-upgrade.yml \
  -i inventory/aws_ec2.yml \
  -e target_version=2.1.0

Rollback Pattern

# Rolling back is just deploying the old version
ansible-playbook playbooks/rolling-upgrade.yml \
  -i inventory/aws_ec2.yml \
  -e target_version=2.0.0

Because the playbook is idempotent, rolling back is just deploying the old version. No special rollback logic needed beyond block/rescue for per-host failures.

K3s Cluster Management Pattern

Real-world example — bootstrapping and upgrading a Kubernetes cluster:

devops/ansible/
  ansible.cfg
  inventory/
    hosts.local.yml                    # Single-node local
    hosts.example.yml                  # Multi-node template
    group_vars/all.yml                 # k3s version, etc.
  roles/
    k3s_server/                        # Install and configure k3s
    k3s_agent/                         # Join agent nodes
    helm/                              # Install Helm binary
    addons/                            # Observability stack
  playbooks/
    bootstrap-k3s.yml                  # Full cluster bootstrap
    upgrade-k3s.yml                    # Rolling k3s upgrade
    install-addons.yml                 # Cluster add-ons
# Bootstrap a cluster
ansible-playbook playbooks/bootstrap-k3s.yml

# Rolling upgrade with version override
ansible-playbook playbooks/upgrade-k3s.yml -e k3s_version=v1.31.0+k3s1

The validate Parameter — Free Insurance

- name: Copy nginx config
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    validate: nginx -t -c %s    # Validates BEFORE writing
  notify: Restart nginx

- name: Update sudoers
  ansible.builtin.template:
    src: sudoers.j2
    dest: /etc/sudoers
    validate: visudo -cf %s     # Validates BEFORE writing

- name: Update sshd config
  ansible.builtin.template:
    src: sshd_config.j2
    dest: /etc/ssh/sshd_config
    validate: sshd -t -f %s     # Validates BEFORE writing

If validation fails, the original file is untouched. This prevented a production outage where a broken nginx template would have taken down the entire web tier.

Custom Modules

When no built-in module exists:

#!/usr/bin/python
# roles/webapp/library/app_health.py
from ansible.module_utils.basic import AnsibleModule
import requests

def main():
    module = AnsibleModule(
        argument_spec=dict(
            url=dict(required=True, type='str'),
            timeout=dict(default=10, type='int'),
        )
    )
    try:
        resp = requests.get(module.params['url'], timeout=module.params['timeout'])
        if resp.status_code == 200:
            module.exit_json(changed=False, status=resp.status_code)
        else:
            module.fail_json(msg=f"Health check returned {resp.status_code}")
    except Exception as e:
        module.fail_json(msg=str(e))

if __name__ == '__main__':
    main()

Part 21: Footguns — Mistakes That Brick Servers

These are the mistakes that experienced Ansible users have made (often more than once). Learn from their pain.

1. Running Against all When You Meant One Host

You type ansible-playbook site.yml without --limit. It runs against every host in inventory — including production. Your half-tested change is now on 200 servers.

Fix: Always use --limit or --check first. Better: add hosts: "{{ target }}" in playbooks and require the variable: ansible-playbook site.yml -e target=staging.

2. Using shell or command for Everything

You write shell: apt-get install nginx instead of apt: name=nginx state=present. It runs every time, isn't idempotent, and takes 20 minutes re-installing packages every run.

Fix: Use native modules. Only use shell/command when there's no module, and add creates: or when: guards.

3. become: true at the Playbook Level

You set become: true globally because one task needs root. Now every task runs as root — file copies create root-owned files your app can't read.

Fix: Set become: true on individual tasks that need it, not the whole playbook.

4. Variable Precedence Surprises

You define app_port: 8080 in defaults/main.yml, group_vars/all.yml, and -e app_port=9090. Which wins? Extra vars. But the developer who set it in defaults doesn't know someone also set it in group_vars.

Fix: Know the precedence order. Keep variables in one place per scope. Use ansible -m debug -a "var=app_port" hostname to check resolved values.

5. Handlers Not Running After a Failure

You change a config file and expect the handler to restart the service. But a later task fails. Handlers don't run if the play fails. Your config changed but the service still runs the old config.

Fix: Use meta: flush_handlers if you need them to run immediately. Don't rely solely on handlers for critical state changes.

6. lineinfile Fighting With Itself

Two tasks add different lines matching the same regex. They fight each other. Every run shows "changed."

Fix: Use blockinfile for multi-line content. Use template for files you fully manage. lineinfile is for surgical one-line changes only.

7. No --check Before Production

You run a playbook on production without --check first. A template typo renders a broken config. The service restarts with broken config and goes down.

Fix: Always run --check --diff first. Review the diffs. Then run for real.

8. Vault Password in Shell History

You type ansible-vault encrypt_string 'mypassword' and the password is in your bash history.

Fix: Use --ask-vault-pass or pipe from stdin. Use chmod 600 on vault password files.

9. Forgetting no_log: true on Sensitive Tasks

Your playbook prints the database password in stdout. CI logs capture it.

Fix: Add no_log: true to tasks that handle secrets. Review CI output for leaked credentials.

10. Default Inventory Points to Production

New team members run ansible-playbook site.yml and hit prod because the default inventory is production.

Fix: Don't set a default inventory that points to production. Require explicit -i inventory/staging.yml.

11. Ignoring Errors Globally

ignore_errors: true because a task fails intermittently. Every error is silently swallowed forever.

Fix: Use failed_when with specific conditions instead of blanket ignore_errors.


Part 22: Real-World Case Studies

Case Study 1: Wrong Inventory Hits Production (PM-016)

Date: 2025-04-08 | Severity: SEV-3

An engineer ran an NTP configuration playbook against production instead of staging — a copy-paste error from a Slack snippet. The playbook reconfigured 47 production hosts to point NTP at a staging server.

Detection (10 minutes): NTP drift monitoring fired when 12 hosts exceeded 50ms offset. SRE on-call checked chronyc sources -v and immediately saw the staging NTP source.

Resolution (5 minutes): Re-ran the playbook with the correct inventory.

Lessons: 1. Confirmation gates are cheap insurance. A single pause task requiring a human "yes" before touching production hosts costs 10 seconds and prevents the entire class of wrong-inventory mistakes. 2. Alert on configuration identity, not just outcomes. Alerting on "production hosts syncing from a non-production NTP source" would have detected the cause instantly. 3. Shared command snippets are living hazards. A Slack message with a working command becomes an authoritative-looking template. Maintain canonical runbooks, not Slack snippets.

Case Study 2: Playbook Hangs — SSH Agent Forwarding + Firewall

Symptom: Ansible playbook hangs on app-server-03 at a Git clone task. Previous servers worked fine.

Investigation trail: 1. DevOps layer: Git clone hangs → SSH agent forwarding? Checked — agent forwarding configured but SSH_AUTH_SOCK empty on new server 2. Linux layer: Sudoers missing env_keep += "SSH_AUTH_SOCK" on new server (provisioning gap) 3. Network layer (actual root cause): New server in restrictive security group — outbound SSH to GitLab subnet blocked. TCP connection to GitLab timed out silently.

Key insight: The symptom was an Ansible playbook hanging (DevOps), the initial investigation pointed to SSH agent forwarding (Linux ops), but the actual root cause was a firewall rule (networking). Ansible playbooks chain multiple SSH hops — agent forwarding, sudo environment, and firewall rules can all cause the same symptom.

Case Study 3: The Anti-Primer — Everything Goes Wrong

An ops engineer configuring 200 servers for a new deployment. Deadline pressure. Skips dry-run.

Hour Mistake Consequence
0 Runs without --limit Half-tested changes hit entire fleet
1 shell: apt-get install everywhere Non-idempotent; re-runs take 40 minutes
2 Global become: true All files owned by root; app can't read own config
3 Variable precedence confusion 3 hours debugging why defaults override doesn't work

Damage: 2–6 hours of infrastructure instability, 12–24 engineer-hours for remediation, infrastructure team credibility damaged.

Case Study 4: Thinking Out Loud — OpenSSL Patch Rollout

A senior SRE rolling out a security patch to 150 servers across 4 environments in 24 hours. Their mental process:

  1. Assess scope: ansible all -m shell -a "dpkg -l openssl" — 128 of 150 need the patch
  2. Plan rollout order: dev → staging → prod-us → prod-eu (progressive, environment by environment)
  3. Write playbook with guardrails: serial: 25%, max_fail_percentage: 10, pre-check to skip already-patched hosts, health checks with retries
  4. Execute progressively: Test on dev (12 servers), validate manually, then staging, then production with smaller serial count
  5. Handle the unexpected: One host's health check was slow (app warm-up time > retry window) — adjusted retry timing for remaining servers
  6. Verify fleet-wide: ansible all -m shell -a "openssl version" --become -o | sort | uniq -c — all 150 on correct version

Key heuristics: Progressive rollout, serial + circuit breaker, post-change health verification.


Glossary

Term Definition Mnemonic/Context
Ansible Agentless automation tool using SSH and YAML Named after FTL communication device from Le Guin's sci-fi
Control node Machine where Ansible is installed and playbooks run from Your laptop or CI server
Managed node Target server being configured by Ansible Needs only Python + SSH
Inventory List of hosts and groups Ansible targets "Who to manage"
Playbook YAML file defining desired state (contains plays) "What to do"
Play Maps hosts to tasks within a playbook One section targeting one group
Task A single action using a module - name: Install nginx
Module Unit of work (package, service, file, etc.) 7,000+ available
Role Packaged reusable tasks/templates/vars/handlers Like a function in code
Handler Delayed action triggered only when notified by a changed task Runs at end of play, not inline
Facts Auto-discovered host data (OS, IP, CPU, memory) Gathered by setup module
Idempotent Re-running yields same end state without repeated changes "Safe to run twice"
Become Privilege escalation (sudo) -b flag or become: true
Vault Encrypts secrets with AES-256 for safe git storage ansible-vault encrypt
Galaxy Community hub for sharing roles and collections 40,000+ roles
Collection Modern package format bundling roles + modules + plugins ansible-galaxy collection install
Tower/AWX Centralized web UI for Ansible with RBAC and scheduling AWX = free, Tower = paid
Molecule Testing framework for Ansible roles Idempotence check is the key feature
Jinja2 Templating engine for dynamic config files {{ variable }}, {% for %}
Serial Batch size for rolling updates serial: "25%" or serial: [1, "10%", "25%"]
Forks Number of hosts processed in parallel Default: 5 (too low for real fleets)
Pipelining SSH optimization — run modules in-process 2–3x speedup
delegate_to Run task on a different host but keep target's variables Used for API calls from control node
block/rescue/always Structured error handling (try/catch/finally) Better than ignore_errors
check mode Dry run — show what would change without changing --check --diff

Trivia and History

  1. Created in one weekend. Michael DeHaan wrote the first Ansible prototype (about 1,200 lines of Python) over a single weekend in February 2012. He was frustrated with Puppet and Chef's complexity.

  2. The name comes from science fiction. "Ansible" is from Ursula K. Le Guin's 1966 novel Rocannon's World — a device for instantaneous communication across any distance. DeHaan chose it because the tool communicates instantly with remote servers.

  3. Red Hat paid $150 million. Red Hat acquired Ansible Inc. in October 2015, just three years after the project's creation. At the time, Ansible had 1,200 contributors and was the most-starred infrastructure automation project on GitHub.

  4. SSH by design, not by accident. Unlike Puppet (custom TLS protocol) and Chef (HTTPS), Ansible uses standard SSH. DeHaan argued: if SSH is good enough for sysadmins to manage servers manually, it's good enough for automation.

  5. The cowsay Easter egg. If you have cowsay installed, Ansible randomly renders output through it, producing ASCII cow art. This was intentional — DeHaan believed long automation runs should have levity. Disable with ANSIBLE_NOCOWS=1.

  6. 40,000+ Galaxy roles. Ansible Galaxy launched in 2013 with ~200 roles. By 2024: 40,000+ roles and collections.

  7. The YAML controversy. DeHaan chose YAML so non-programmers could write automation. Critics argue YAML's whitespace sensitivity causes subtle bugs. Supporters maintain it kept Ansible accessible to sysadmins who would never learn Ruby (Puppet/Chef's DSL).

  8. Windows support via WinRM. Ansible added Windows support in version 1.7 (2014) using WinRM instead of SSH. Today it has 200+ Windows-specific modules.

  9. Idempotency isn't guaranteed. shell and command modules are explicitly not idempotent. A 2019 study found ~18% of community roles contained non-idempotent tasks.

  10. DeHaan left after the acquisition. Michael DeHaan stepped back from the project shortly after the Red Hat acquisition in 2015. He later expressed mixed feelings about the increasing complexity of Tower compared to his original vision of radical simplicity.

  11. AWX: open-sourcing your own paid product. In 2017, Red Hat open-sourced AWX (upstream of Ansible Tower). This followed their Fedora/RHEL model: free upstream grows the ecosystem, paid product adds support.


Flashcard Review

Foundations

Q A
What is Ansible (one line)? Agentless automation over SSH using YAML playbooks
What does Ansible require on managed nodes? Python and SSH — no agent needed
What is idempotency? Re-running produces the same end state without repeated changes
What is an inventory? The list/grouping of hosts Ansible targets (static or dynamic)
What is a playbook? YAML file defining desired state; contains plays and tasks
What is a module? Unit of work (package, service, file, etc.) — 7,000+ available
What is a handler? Delayed action that runs at end of play, only when notified by a changed task
What is a role? Packaged reusable tasks/templates/vars/handlers
What are facts? Auto-gathered host info (OS, IP, CPU) for conditional logic
Play vs task vs role? Play targets hosts; tasks are steps; roles package reusable content

Variables and Precedence

Q A
How many levels of variable precedence exist? 22
What always wins in variable precedence? Extra vars (-e on command line)
Role vars/main.yml vs play vars: — which wins? Role vars (precedence 15) beats play vars (precedence 12)
When do you use defaults/ vs vars/ in a role? defaults/ = overridable knobs (low precedence). vars/ = constants (high precedence)
How do you debug which value a variable has? ansible -m debug -a "var=my_variable" hostname

Operations

Q A
What does --check --diff do? Preview what WOULD change without changing anything; shows file diffs
What does serial: [1, "10%", "25%"] do? Graduated rollout: 1 canary, then 10% batches, then 25% batches
What does max_fail_percentage do? Stops the entire play if too many hosts fail (circuit breaker)
When you use delegate_to: localhost, whose variables does the task see? The target host's variables — delegate_to changes execution location, not variable context
What's the difference between block/rescue and ignore_errors? block/rescue is structured try/catch. ignore_errors silently swallows ALL errors
When do handlers run? At end of play, only if the notifying task reported "changed"
What does meta: flush_handlers do? Forces handlers to run immediately instead of waiting for end of play
Handler not firing — most common cause? Task reports "ok" (config already matches), or handler name was changed but notify: wasn't updated
What does validate: nginx -t -c %s do on a template task? Validates the config BEFORE writing; if validation fails, original file is untouched

Security

Q A
What encryption does Ansible Vault use? AES-256-CTR with HMAC-SHA256 and PBKDF2 key stretching
What is the vault/vars split pattern? Encrypted vault.yml holds values; plaintext vars.yml provides names that reference vault vars
How do you prevent secret leaks in playbook output? no_log: true on tasks that handle secrets
How do you avoid vault passwords in shell history? Pipe secrets to ansible-vault encrypt_string --stdin-name instead of passing on command line

Debugging

Q A
-v vs -vv vs -vvv vs -vvvv? Results → input params → SSH commands → full SSH protocol debug
How do you resume after a failure? --start-at-task="Task Name" or --limit @site.retry
Task reports "changed" every run — why? Not idempotent. Use modules instead of shell/command, or add creates:/changed_when:
How do you test idempotency? Run playbook twice; second run should show 0 changed

Performance

Q A
Default forks value? 5 (too low for any real fleet)
What does pipelining = True do? Runs modules in-process instead of copying; 2–3x faster per task
What does ControlPersist=60s do? Reuses SSH connections for 60 seconds; fewer handshakes
How do you speed up fact gathering? gathering = smart with fact caching (jsonfile or Redis)
What is Mitogen? Drop-in Ansible plugin for 2–7x speedup

Architecture

Q A
Terraform vs Ansible vs Helm? Terraform builds infrastructure. Ansible configures servers. Helm deploys K8s apps
When NOT to use Ansible? Pure immutable infrastructure / K8s-only environments (use Helm/operators instead)
What is include_role vs import_role? Include is dynamic (runtime). Import is static (parse time). Affects tag/condition propagation
What is the raw module for? Running commands when Python isn't installed on the target (bootstrap scenario)
What is Ansible Galaxy? Community hub for sharing roles and collections (40,000+ roles)

Drills

Drill 1: Ad-Hoc Commands (Easy)

Q: Check disk usage on all web servers and restart nginx using ad-hoc commands.

Answer
# Check disk usage
ansible webservers -m command -a "df -h" -i inventory.yml

# Restart nginx (needs sudo)
ansible webservers -m service -a "name=nginx state=restarted" -i inventory.yml -b

# Ping all hosts
ansible all -m ping -i inventory.yml
`-b` = become (sudo). `-m` = module. `-a` = arguments.

Drill 2: Write a Basic Playbook (Easy)

Q: Write a playbook that installs nginx, templates a config file, and ensures the service is running.

Answer
---
- name: Configure web servers
  hosts: webservers
  become: true
  tasks:
    - name: Install nginx
      ansible.builtin.apt:
        name: nginx
        state: present
        update_cache: true

    - name: Copy nginx config
      ansible.builtin.template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: Restart nginx

    - name: Ensure nginx is running
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: true

  handlers:
    - name: Restart nginx
      ansible.builtin.service:
        name: nginx
        state: restarted

Drill 3: Variable Precedence (Easy)

Q: Where can variables be defined? What's the simplified precedence order?

Answer Sources (low → high): 1. Role defaults (`roles/x/defaults/main.yml`) 2. Inventory vars (`group_vars/`, `host_vars/`) 3. Playbook vars (`vars:` section) 4. Role vars (`roles/x/vars/main.yml`) — higher than playbook vars! 5. Task vars 6. Extra vars (`-e`) — **always wins**
# Extra vars override everything
ansible-playbook site.yml -e "nginx_port=8080"

Drill 4: Create a Role (Medium)

Q: Create a role structure for PostgreSQL. What goes in each directory?

Answer
ansible-galaxy init roles/postgresql
roles/postgresql/
├── defaults/main.yml    # Default variables (lowest precedence)
├── tasks/main.yml       # Main task list
├── handlers/main.yml    # Handlers (restart, reload)
├── templates/           # Jinja2 templates (.j2)
│   └── postgresql.conf.j2
├── files/               # Static files to copy
├── vars/main.yml        # Internal constants (higher precedence)
├── meta/main.yml        # Dependencies, metadata
└── README.md
# defaults/main.yml
postgresql_version: "15"
postgresql_max_connections: 100

# tasks/main.yml
- name: Install PostgreSQL
  ansible.builtin.apt:
    name: "postgresql-{{ postgresql_version }}"
    state: present

- name: Configure PostgreSQL
  ansible.builtin.template:
    src: postgresql.conf.j2
    dest: "/etc/postgresql/{{ postgresql_version }}/main/postgresql.conf"
  notify: Restart PostgreSQL

Drill 5: Jinja2 Template (Medium)

Q: Write a Jinja2 template for an nginx upstream config that dynamically lists all hosts in the app group.

Answer
{# templates/upstream.conf.j2 #}
upstream app_servers {
{% for host in groups['app'] %}
    server {{ hostvars[host]['ansible_host'] }}:{{ app_port | default(8080) }};
{% endfor %}
}

server {
    listen 80;
    location / {
        proxy_pass http://app_servers;
    }
}
Key patterns: `{{ variable }}` (output), `{% for %}` (loop), `{{ var | default() }}` (filter).

Drill 6: Fix the Idempotency Bug (Medium)

Q: This task runs on every play and reports "changed" every time. Fix it.

- name: Add line to config
  ansible.builtin.shell: echo "max_connections = 200" >> /etc/postgresql/postgresql.conf
Answer
# Use lineinfile — only changes if the line doesn't match
- name: Set max connections
  ansible.builtin.lineinfile:
    path: /etc/postgresql/postgresql.conf
    regexp: '^max_connections'
    line: 'max_connections = 200'
The `shell` module runs every time and appends duplicate lines. `lineinfile` checks if the line already matches before changing anything.

Drill 7: Vault Operations (Medium)

Q: Encrypt a variable file, then use it in a playbook run.

Answer
# Encrypt a file
ansible-vault encrypt group_vars/production/secrets.yml

# Edit encrypted file
ansible-vault edit group_vars/production/secrets.yml

# Encrypt a single string (safely, no shell history)
echo -n 'hunter2' | ansible-vault encrypt_string --stdin-name 'db_password'

# Run playbook with vault
ansible-playbook site.yml --ask-vault-pass
ansible-playbook site.yml --vault-password-file=~/.vault_pass
Best practice: separate secrets into `vault.yml` (encrypted) and `vars.yml` (plaintext references).

Drill 8: Conditionals and Loops (Medium)

Q: Install different packages based on OS family. Create users from a list.

Answer
# Conditional
- name: Install packages (Debian)
  ansible.builtin.apt:
    name: [nginx, curl, htop]
    state: present
  when: ansible_os_family == "Debian"

- name: Install packages (RedHat)
  ansible.builtin.yum:
    name: [nginx, curl, htop]
    state: present
  when: ansible_os_family == "RedHat"

# Loop with dict
- name: Create users
  ansible.builtin.user:
    name: "{{ item.name }}"
    groups: "{{ item.groups }}"
    shell: /bin/bash
  loop:
    - { name: alice, groups: "sudo,docker" }
    - { name: bob, groups: "docker" }
    - { name: carol, groups: "sudo" }

Drill 9: Error Handling (Medium)

Q: Write a task that gracefully handles a missing legacy service, and a block/rescue pattern for deployment with rollback.

Answer
# Graceful handling
- name: Check if service exists
  ansible.builtin.command: systemctl status legacy-app
  register: result
  ignore_errors: true

- name: Stop legacy app if it exists
  ansible.builtin.service:
    name: legacy-app
    state: stopped
  when: result.rc == 0

# Block/rescue (try/catch)
- block:
    - name: Deploy application
      ansible.builtin.command: /opt/deploy.sh
  rescue:
    - name: Rollback on failure
      ansible.builtin.command: /opt/rollback.sh
  always:
    - name: Send notification
      ansible.builtin.debug:
        msg: "Deploy attempt complete"

Drill 10: Ansible vs Terraform (Easy)

Q: When do you use Ansible vs Terraform? Can they work together?

Answer | Aspect | Terraform | Ansible | |--------|-----------|---------| | Purpose | Provision infrastructure | Configure servers | | State | Stateful (state file) | Stateless | | Best for | Cloud resources, networking | Packages, config, services | **Together:** Terraform provisions VMs → outputs IPs → Ansible configures them.
terraform apply
terraform output -json | ./generate_inventory.py > inventory.yml
ansible-playbook -i inventory.yml site.yml

Cheat Sheet

Commands

# Ad-hoc
ansible all -m ping
ansible webservers -m command -a "uptime"
ansible dbservers -m service -a "name=postgresql state=restarted" -b

# Playbook execution
ansible-playbook site.yml -i inventory.yml
ansible-playbook site.yml --check --diff          # Dry run
ansible-playbook site.yml --limit web1             # One host
ansible-playbook site.yml --tags "nginx"           # Specific tags
ansible-playbook site.yml -e "var=value"           # Override var
ansible-playbook site.yml --step                   # Interactive
ansible-playbook site.yml --start-at-task "Name"   # Resume

# Debugging
ansible-playbook site.yml -v/-vv/-vvv/-vvvv
ansible -m debug -a "var=hostvars[inventory_hostname]" host
ansible-playbook site.yml --syntax-check
ansible-playbook site.yml --list-hosts/--list-tasks/--list-tags

# Vault
ansible-vault create/encrypt/edit/view/decrypt/rekey file.yml
ansible-vault encrypt_string 'secret' --name 'var_name'
ansible-playbook site.yml --ask-vault-pass
ansible-playbook site.yml --vault-password-file=~/.vault_pass

# Galaxy
ansible-galaxy install -r requirements.yml
ansible-galaxy collection install amazon.aws
ansible-galaxy init roles/myrole

# Inventory
ansible-inventory -i inventory/ --graph
ansible-inventory -i inventory/ --list

# Molecule
molecule test                    # Full cycle
molecule converge                # Apply only
molecule verify                  # Verify only
molecule login -h ubuntu-noble   # SSH into container

Key Concepts Quick Reference

Concept Remember
Variable precedence defaults/ = bottom, -e = top, vars/ = high
Handlers Run at end of play, not after notifying task
serial Batch size for rolling updates
max_fail_percentage Circuit breaker — stop rollout if too many fail
delegate_to Changes where task runs, not whose variables
raw module Only module that doesn't require Python on target
Vault encryption AES-256-CTR with PBKDF2 key stretching
Idempotency test Run twice — second run should show 0 changed
Check mode + diff Non-negotiable before any production run
Forks default 5 (increase to 20–50 for real fleets)
Pipelining 2–3x speedup, enable in ansible.cfg

Self-Assessment

Rate yourself on each area. If you can't explain it to someone else, revisit that section.

Core Concepts

  • I can explain what Ansible is and how it differs from Chef/Puppet in one sentence
  • I understand the control node → SSH → module → execute → report mental model
  • I can define: inventory, playbook, module, idempotent, handler, facts, role
  • I know when to use shell/command vs native modules (and why it matters)

Inventory and Targeting

  • I can write static inventory in both INI and YAML format
  • I understand group_vars, host_vars, and the [group:children] syntax
  • I know what dynamic inventory is and when to use it
  • I understand --limit and tags for controlling blast radius

Variables and Templating

  • I can explain the simplified variable precedence (defaults → inventory → play → role vars → extra vars)
  • I know why defaults/ vs vars/ matters in roles
  • I can write Jinja2 templates with loops, conditionals, and filters
  • I know the {{ }} quoting rule in YAML

Secrets

  • I can encrypt and decrypt files/strings with ansible-vault
  • I understand the vault/vars split pattern
  • I know how to prevent secret leaks (no_log: true, avoid CLI args)

Operations

  • I can write a rolling update playbook with serial and max_fail_percentage
  • I understand handlers, when they fire, and when they don't
  • I can use --check --diff to preview changes before production runs
  • I know how to debug with -v through -vvvv
  • I can use block/rescue/always for error handling

Performance and Testing

  • I know how to tune ansible.cfg for large fleets (forks, pipelining, fact caching)
  • I can set up Molecule for role testing
  • I understand the idempotence check and why it matters