Skip to content

Ansible: From Playbook to Production

  • lesson
  • ansible-inventory
  • roles
  • jinja2-templating
  • vault
  • molecule-testing
  • rolling-updates
  • delegation
  • aws-dynamic-inventory
  • callback-plugins
  • tower/awx ---# Ansible: From Playbook to Production

Topics: Ansible inventory, roles, Jinja2 templating, Vault, Molecule testing, rolling updates, delegation, AWS dynamic inventory, callback plugins, Tower/AWX Level: L1–L2 (Foundations → Operations) Time: 75–90 minutes Prerequisites: None (everything is explained inline)


The Mission

You've just been handed responsibility for 50 web servers behind an AWS ALB. The application team needs a zero-downtime rolling upgrade — new application version, updated nginx config, rotated database credentials. The last person who did this manually took four hours, missed two servers, and left the credentials in a shell history file.

You're going to automate the entire thing with Ansible. By the end of this lesson you'll have:

  • A dynamic inventory that discovers your EC2 instances automatically
  • A production-grade role with every directory explained
  • Vault-encrypted secrets that are safe to commit
  • A rolling update playbook with health checks and automatic rollback
  • A Molecule test suite that catches bugs before they reach production
  • The mental model to debug the inevitable "why did it do that?" moments

We'll build from the ground up: inventory first, then roles, then secrets, then the rolling deploy, then testing. Each section adds a layer. By the end, you'll have a complete, production-ready pipeline.


Part 1: Inventory — Who Are We Talking To?

Before Ansible does anything, it needs to know who to talk to. That's the inventory.

Static inventory: the starting point

# inventory/hosts.ini
[webservers]
web01.prod.example.com
web02.prod.example.com ansible_host=10.0.1.12
web03.prod.example.com

[dbservers]
db01.prod.example.com ansible_port=2222

[production:children]
webservers
dbservers

[webservers:vars]
http_port=8080
app_env=production
Syntax What it does
[webservers] Defines a group named "webservers"
ansible_host=10.0.1.12 Override the connection IP (when DNS doesn't resolve)
[production:children] Create a parent group containing other groups
[webservers:vars] Variables applied to every host in the group

Static inventory works for five servers. For 50 EC2 instances that scale up and down? You need dynamic inventory.

Dynamic inventory: let AWS tell you who exists

# inventory/aws_ec2.yml
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
  - us-west-2
keyed_groups:
  - key: tags.Role
    prefix: role
  - key: tags.Environment
    prefix: env
  - key: placement.availability_zone
    prefix: az
filters:
  tag:ManagedBy: ansible
  instance-state-name: running
compose:
  ansible_host: private_ip_address
  ansible_user: "'ubuntu'"

This file is the entire inventory. No hostnames to maintain. Ansible queries the AWS API at runtime, discovers every running instance tagged ManagedBy: ansible, and groups them by their Role tag, Environment tag, and availability zone.

# See what Ansible discovers
ansible-inventory -i inventory/aws_ec2.yml --graph

Output looks like:

@all:
  |--@env_production:
  |  |--10.0.1.10
  |  |--10.0.1.11
  |  |--10.0.1.12
  |--@role_webserver:
  |  |--10.0.1.10
  |  |--10.0.1.11
  |--@az_us_east_1a:
  |  |--10.0.1.10

Under the Hood: The aws_ec2 plugin calls the EC2 DescribeInstances API. The keyed_groups directive creates Ansible groups from instance metadata. The compose block builds per-host variables using Jinja2 expressions evaluated against the API response. The single quotes around 'ubuntu' in compose are deliberate — without them, Ansible would try to resolve ubuntu as a variable name.

Gotcha: The aws_ec2 plugin requires the amazon.aws collection and the boto3 Python package. Install both: ansible-galaxy collection install amazon.aws and pip install boto3. If you forget boto3, the error message is unhelpfully vague: "Failed to import the required Python library".

group_vars and host_vars: variables that follow the inventory

inventory/
  aws_ec2.yml
  group_vars/
    all.yml                    # Every host gets these
    role_webserver.yml         # Only hosts tagged Role=webserver
    env_production.yml         # Only hosts tagged Environment=production
  host_vars/
    10.0.1.10.yml              # Overrides for one specific host
# inventory/group_vars/all.yml
ntp_servers:
  - 169.254.169.123   # AWS time sync service
timezone: UTC
monitoring_agent: prometheus-node-exporter

# inventory/group_vars/role_webserver.yml
nginx_worker_processes: auto
nginx_worker_connections: 2048
app_port: 8080
health_check_path: /health

Variables in host_vars/ override group_vars/. This is part of Ansible's 22-level precedence hierarchy, but the practical rule is simple: role defaults are the bottom, extra vars (-e) are the top, everything else falls in between.


Flashcard Check: Inventory

Question Answer
What's the difference between static and dynamic inventory? Static = manually maintained file. Dynamic = plugin/script queries an API at runtime.
What does keyed_groups do in a dynamic inventory plugin? Creates Ansible groups from instance metadata (tags, zones, types).
Where do you put variables that apply to all hosts in a group? group_vars/<group_name>.yml alongside the inventory file.
What always wins in Ansible's variable precedence? Extra vars (-e on the command line) — they override everything.

Part 2: Roles — Reusable, Testable Building Blocks

A role is a directory structure that packages tasks, templates, variables, handlers, and metadata into a reusable unit. Think of it like a function in code — it takes inputs (variables), does work (tasks), and has side effects (handlers).

The complete role directory layout

roles/
  webapp/
    defaults/main.yml      # Default variables — LOW precedence, meant to be overridden
    vars/main.yml           # Role variables — HIGH precedence, hard to override
    tasks/main.yml          # The actual work
    handlers/main.yml       # Actions triggered by notify (e.g., restart services)
    templates/              # Jinja2 templates (.j2 files)
      nginx-vhost.conf.j2
      app-config.yml.j2
    files/                  # Static files copied as-is (no templating)
      logrotate-webapp
    meta/main.yml           # Role metadata: dependencies, platforms, author
    molecule/               # Test suite (we'll build this in Part 5)
      default/
        molecule.yml
        converge.yml
        verify.yml

Let's walk through each file.

defaults/main.yml — the "safe to override" knobs

# roles/webapp/defaults/main.yml
---
app_name: mywebapp
app_version: "1.0.0"
app_port: 8080
app_user: deploy
app_group: deploy
app_home: "/opt/{{ app_name }}"

nginx_listen_port: 80
nginx_server_name: "{{ inventory_hostname }}"
nginx_ssl_enabled: false

health_check_url: "http://127.0.0.1:{{ app_port }}/health"
health_check_retries: 5
health_check_delay: 10

These are the knobs consumers of your role can turn. Put anything here that a user might reasonably want to change per environment.

vars/main.yml — the "don't touch these" constants

# roles/webapp/vars/main.yml
---
# Internal paths — changing these would break the role
app_bin: "{{ app_home }}/bin"
app_config_dir: "{{ app_home }}/config"
app_log_dir: "/var/log/{{ app_name }}"
nginx_config_path: "/etc/nginx/sites-available/{{ app_name }}.conf"
nginx_enabled_path: "/etc/nginx/sites-enabled/{{ app_name }}.conf"

# System packages required regardless of configuration
required_packages:
  - nginx
  - python3
  - python3-pip
  - acl

War Story: A team put app_port: 8080 in both defaults/main.yml and vars/main.yml during a refactor. The defaults file said port 8080, the vars file also said 8080 — so testing caught nothing. Three months later, the staging team set app_port: 9090 in their group_vars/staging.yml. It didn't work. Their port override was being silently stomped by vars/main.yml, which has higher precedence than group_vars. They spent two hours checking every inventory file and host_var before someone ran ansible -m debug -a "var=app_port" staging-web01 and saw the value was still 8080. The fix: move app_port out of vars/ — it belonged in defaults/ all along. Rule: if users should be able to override it, it goes in defaults. If they shouldn't, it goes in vars. Never put the same variable in both.

tasks/main.yml — the work

# roles/webapp/tasks/main.yml
---
- name: Include OS-specific variables
  ansible.builtin.include_vars: "{{ ansible_os_family | lower }}.yml"

- name: Install required packages
  ansible.builtin.package:
    name: "{{ required_packages }}"
    state: present
  become: true

- name: Create application user
  ansible.builtin.user:
    name: "{{ app_user }}"
    group: "{{ app_group }}"
    home: "{{ app_home }}"
    shell: /usr/sbin/nologin
    system: true
  become: true

- name: Create application directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_group }}"
    mode: "0755"
  loop:
    - "{{ app_home }}"
    - "{{ app_bin }}"
    - "{{ app_config_dir }}"
    - "{{ app_log_dir }}"
  become: true

- name: Deploy application binary
  ansible.builtin.copy:
    src: "{{ app_name }}-{{ app_version }}.jar"
    dest: "{{ app_bin }}/{{ app_name }}.jar"
    owner: "{{ app_user }}"
    group: "{{ app_group }}"
    mode: "0755"
  become: true
  notify: Restart application

- name: Template application config
  ansible.builtin.template:
    src: app-config.yml.j2
    dest: "{{ app_config_dir }}/config.yml"
    owner: "{{ app_user }}"
    group: "{{ app_group }}"
    mode: "0640"
  become: true
  notify: Restart application

- name: Template nginx vhost
  ansible.builtin.template:
    src: nginx-vhost.conf.j2
    dest: "{{ nginx_config_path }}"
    owner: root
    group: root
    mode: "0644"
    validate: nginx -t -c %s
  become: true
  notify: Reload nginx

- name: Enable nginx vhost
  ansible.builtin.file:
    src: "{{ nginx_config_path }}"
    dest: "{{ nginx_enabled_path }}"
    state: link
  become: true
  notify: Reload nginx

- name: Deploy systemd unit
  ansible.builtin.template:
    src: webapp.service.j2
    dest: "/etc/systemd/system/{{ app_name }}.service"
    owner: root
    group: root
    mode: "0644"
  become: true
  notify:
    - Reload systemd
    - Restart application

- name: Ensure application is running
  ansible.builtin.systemd:
    name: "{{ app_name }}"
    state: started
    enabled: true
  become: true

Notice: become: true is on individual tasks, not at the play level. The validate parameter on the nginx template runs nginx -t before writing the file — if the config is broken, the original file stays untouched.

handlers/main.yml — the "only if something changed" actions

# roles/webapp/handlers/main.yml
---
- name: Reload systemd
  ansible.builtin.systemd:
    daemon_reload: true
  become: true

- name: Restart application
  ansible.builtin.systemd:
    name: "{{ app_name }}"
    state: restarted
  become: true

- name: Reload nginx
  ansible.builtin.systemd:
    name: nginx
    state: reloaded
  become: true

Gotcha: Handlers run at the end of the play, not after the task that notified them. If a task notifies "Restart application" but a later task fails, the handler never runs — your config changed but the service still runs the old version. Use meta: flush_handlers if you need handlers to fire immediately.

meta/main.yml — dependencies and metadata

# roles/webapp/meta/main.yml
---
dependencies:
  - role: common
  - role: monitoring
    vars:
      monitoring_port: "{{ app_port }}"

galaxy_info:
  author: ops-team
  description: Deploy and configure a Java web application with nginx reverse proxy
  min_ansible_version: "2.14"
  platforms:
    - name: Ubuntu
      versions: [jammy, noble]
    - name: EL
      versions: [8, 9]

Dependencies run before the role's tasks. The common role might install base packages and configure NTP; the monitoring role might install the Prometheus node exporter.

Trivia: The name "Ansible Galaxy" follows the science fiction theme. Ansible itself is named after the instantaneous communication device from Ursula K. Le Guin's 1966 novel Rocannon's World. Galaxy launched in 2013 with about 200 roles — by 2024, it hosted over 40,000 roles and collections.


Part 3: Jinja2 Templating — Dynamic Config Files

Templates are where variables become real config files. Ansible uses Jinja2, the same engine behind Flask and Django templates.

{# templates/nginx-vhost.conf.j2 #}
upstream {{ app_name }}_backend {
{%- for host in groups['role_webserver'] %}
    server {{ hostvars[host]['ansible_host'] }}:{{ app_port }};
{%- endfor %}
}

server {
    listen {{ nginx_listen_port }};
    server_name {{ nginx_server_name }};

    location / {
        proxy_pass http://{{ app_name }}_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location /health {
        proxy_pass http://127.0.0.1:{{ app_port }}/health;
        access_log off;
    }

{% if nginx_ssl_enabled %}
    listen 443 ssl;
    ssl_certificate /etc/ssl/certs/{{ nginx_server_name }}.pem;
    ssl_certificate_key /etc/ssl/private/{{ nginx_server_name }}.key;
{% endif %}
}

Essential Jinja2 filters you'll actually use

Filter What it does Example
default('val') Fallback if variable is undefined {{ timeout \| default(30) }}
mandatory Fail if variable is undefined {{ db_host \| mandatory }}
to_nice_json Pretty-print as JSON {{ config_dict \| to_nice_json }}
regex_replace Regex substitution {{ hostname \| regex_replace('\.example\.com$', '') }}
join(', ') Join list into string {{ dns_servers \| join(', ') }}
selectattr Filter list of dicts {{ users \| selectattr('active', 'true') \| list }}
b64encode Base64 encode {{ secret \| b64encode }}
password_hash Hash a password {{ pass \| password_hash('sha512') }}
basename Extract filename from path {{ '/etc/nginx/conf.d/app.conf' \| basename }}

Gotcha: YAML treats { as the start of a mapping. Any value that begins with {{ must be quoted: message: "{{ greeting }} world". Without quotes you get the cryptic error mapping values are not allowed in this context. This is the single most common Ansible YAML error.

The {%- syntax (note the dash) strips whitespace before the tag. Without it, {% for %} loops add blank lines between entries. For nginx this is cosmetic; for YAML or Python configs it breaks parsing.


Flashcard Check: Roles and Templates

Question Answer
What's the difference between defaults/ and vars/ in a role? defaults/ = low precedence, meant to be overridden. vars/ = high precedence, hard to override.
When do handlers run? At the end of the play, not after the notifying task. Use meta: flush_handlers for immediate execution.
What does validate: nginx -t -c %s do on a template task? Runs nginx config validation before writing the file. If validation fails, the original is untouched.
Why must {{ var }} be quoted in YAML? YAML interprets { as a mapping start. Unquoted Jinja2 causes a parse error.
What does {%- do in a Jinja2 template? Strips whitespace before the tag, preventing blank lines in loop output.

Part 4: Vault — Secrets That Are Safe to Commit

Your rolling upgrade needs new database credentials. Those credentials need to live somewhere in your repo so the playbook can use them. Ansible Vault encrypts them with AES-256 so the ciphertext is safe to commit to git.

The essential vault commands

# Create a new encrypted file
ansible-vault create group_vars/env_production/vault.yml

# Encrypt an existing file
ansible-vault encrypt group_vars/env_production/secrets.yml

# Edit an encrypted file (decrypts to temp file, opens $EDITOR, re-encrypts on save)
ansible-vault edit group_vars/env_production/vault.yml

# View without decrypting to disk
ansible-vault view group_vars/env_production/vault.yml

# Change the encryption password
ansible-vault rekey group_vars/env_production/vault.yml

# Encrypt a single string (for inline use in YAML)
ansible-vault encrypt_string 'db_p@ssw0rd_2026' --name 'vault_db_password'

Under the Hood: Vault uses AES-256-CTR encryption with HMAC-SHA256 for integrity. The vault password is stretched using PBKDF2 with a random salt. Each encrypted file or string is self-contained — the salt, HMAC, and ciphertext are all embedded in the $ANSIBLE_VAULT;1.1;AES256 header. When you run ansible-vault edit, the plaintext never touches persistent storage — it's decrypted to a tmpfs-backed file, you edit it, and it's re-encrypted on save.

The vault/vars split pattern

Don't encrypt your entire vars file. Use two files:

inventory/
  group_vars/
    env_production/
      vars.yml          # Plaintext — references vault variables
      vault.yml          # Encrypted — contains the actual secrets
# group_vars/env_production/vault.yml (encrypted)
vault_db_password: "s3cr3t_pr0d_p4ss"
vault_api_key: "ak_prod_xK9mP2qR7vN4"
vault_tls_key: |
  -----BEGIN PRIVATE KEY-----
  MIIEvQIBADANBgkqhkiG9w0BAQEFA...
  -----END PRIVATE KEY-----
# group_vars/env_production/vars.yml (plaintext)
db_password: "{{ vault_db_password }}"
api_key: "{{ vault_api_key }}"
tls_key: "{{ vault_tls_key }}"

Why the split? You can grep for where db_password is used without decrypting anything. The vault file holds values; the vars file provides names. When reviewing a PR, you can see that db_password was changed without needing the vault password.

Running playbooks with vault

# Interactive prompt
ansible-playbook site.yml --ask-vault-pass

# Password file (for CI/CD)
echo "your-vault-password" > ~/.vault_pass
chmod 600 ~/.vault_pass
ansible-playbook site.yml --vault-password-file ~/.vault_pass

# Environment variable (also common in CI)
export ANSIBLE_VAULT_PASSWORD_FILE=~/.vault_pass
ansible-playbook site.yml

Gotcha: Never pass secrets directly on the command line: ansible-vault encrypt_string 'actual_password' puts the password in your shell history. Use --stdin-name with a pipe instead: echo -n 'actual_password' | ansible-vault encrypt_string --stdin-name 'db_password'

Multiple vault IDs: different secrets for different teams

# Encrypt with specific vault IDs
ansible-vault encrypt --vault-id dev@prompt secrets-dev.yml
ansible-vault encrypt --vault-id prod@/path/to/prod-password secrets-prod.yml

# Run with multiple vault IDs
ansible-playbook site.yml \
  --vault-id dev@prompt \
  --vault-id prod@/path/to/prod-password

The dev team has one vault password, the prod team has another. Neither can decrypt the other's secrets.


Part 5: The Rolling Update — Zero Downtime on 50 Servers

This is where everything comes together. We need to upgrade 50 web servers behind an AWS ALB without dropping a single request.

The strategy

  1. Process servers in small batches (serial)
  2. For each batch: remove from load balancer, upgrade, health check, re-add
  3. If too many fail, stop the entire rollout (max_fail_percentage)
  4. If a single server fails, roll it back (block/rescue)
# playbooks/rolling-upgrade.yml
---
- name: Rolling upgrade — {{ app_name }} {{ app_version }}
  hosts: role_webserver
  serial:
    - 1            # First: single canary
    - "10%"        # Then: 10% at a time
    - "25%"        # Then: 25% at a time
  max_fail_percentage: 10

  pre_tasks:
    - name: Verify current version
      ansible.builtin.command: "cat {{ app_home }}/VERSION"
      register: pre_version
      changed_when: false

    - name: Remove from ALB target group
      community.aws.elb_target:
        target_group_arn: "{{ alb_target_group_arn }}"
        target_id: "{{ ansible_host }}"
        state: absent
      delegate_to: localhost

    - name: Wait for connections to drain
      ansible.builtin.pause:
        seconds: 30

  roles:
    - role: webapp
      vars:
        app_version: "{{ target_version }}"

  post_tasks:
    - name: Wait for application health
      ansible.builtin.uri:
        url: "{{ health_check_url }}"
        return_content: true
        status_code: 200
      register: health
      retries: "{{ health_check_retries }}"
      delay: "{{ health_check_delay }}"
      until: health.status == 200

    - name: Re-add to ALB target group
      community.aws.elb_target:
        target_group_arn: "{{ alb_target_group_arn }}"
        target_id: "{{ ansible_host }}"
        target_port: "{{ app_port }}"
        state: present
      delegate_to: localhost

    - name: Wait for ALB health check to pass
      ansible.builtin.pause:
        seconds: 15

    - name: Verify new version
      ansible.builtin.command: "cat {{ app_home }}/VERSION"
      register: post_version
      changed_when: false

    - name: Report upgrade status
      ansible.builtin.debug:
        msg: >-
          {{ inventory_hostname }}: {{ pre_version.stdout }}
          → {{ post_version.stdout }}

Let's break down the key decisions:

serial: [1, "10%", "25%"] — Graduated rollout. The first batch is a single canary. If that one server survives, we widen to 10%, then 25%. This catches both "the new version is completely broken" (canary fails) and "the new version fails under load" (10% batch reveals issues).

max_fail_percentage: 10 — The circuit breaker. If more than 10% of hosts in any batch fail, Ansible stops the entire play. Without this, a bad deploy rolls across all 50 servers.

Mental Model: Think of serial + max_fail_percentage as a circuit breaker pattern — the same concept used in microservice architectures. serial controls the batch size (how much current flows). max_fail_percentage is the trip threshold (how much failure before the breaker opens). Together, they limit blast radius.

delegate_to: localhost — The ALB API calls run on your control node, not on the web servers. The inventory_hostname and ansible_host variables still refer to the target server — delegate_to changes where the task runs, not whose variables it uses.

Rollback with block/rescue

For critical deployments, wrap the upgrade in a block/rescue:

  tasks:
    - name: Deploy with rollback
      block:
        - name: Deploy new version
          ansible.builtin.include_role:
            name: webapp
          vars:
            app_version: "{{ target_version }}"

        - name: Verify health after deploy
          ansible.builtin.uri:
            url: "{{ health_check_url }}"
            status_code: 200
          register: health
          retries: 5
          delay: 10
          until: health.status == 200

      rescue:
        - name: ROLLBACK — deploy previous version
          ansible.builtin.include_role:
            name: webapp
          vars:
            app_version: "{{ pre_version.stdout | trim }}"

        - name: Notify on rollback
          community.general.slack:
            token: "{{ vault_slack_token }}"
            channel: "#deploys"
            msg: >-
              ROLLBACK on {{ inventory_hostname }}:
              {{ target_version }} failed,
              reverted to {{ pre_version.stdout | trim }}
          delegate_to: localhost
          ignore_errors: true

If the health check fails after deploying the new version, the rescue block automatically deploys the previous version and alerts the team.


Flashcard Check: Rolling Updates

Question Answer
What does serial: [1, "10%", "25%"] do? First batch: 1 host (canary). Second: 10%. Remaining: 25% at a time.
What happens when max_fail_percentage is exceeded? Ansible stops the entire play — no more batches run.
When you use delegate_to: localhost, whose variables does the task see? The target host's variables. delegate_to changes execution location, not variable context.
What's the difference between block/rescue and ignore_errors? block/rescue is structured error handling (like try/catch). ignore_errors silently swallows all errors.

Part 6: Molecule — Testing Before It Reaches Production

Molecule is Ansible's testing framework. It spins up containers, runs your role, verifies the result, and tears everything down.

Setting up Molecule for the webapp role

# roles/webapp/molecule/default/molecule.yml
---
driver:
  name: docker
platforms:
  - name: ubuntu-noble
    image: ubuntu:noble
    pre_build_image: true
    command: /lib/systemd/systemd
    privileged: true
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
  - name: rocky-9
    image: rockylinux:9
    pre_build_image: true
    command: /lib/systemd/systemd
    privileged: true
provisioner:
  name: ansible
  playbooks:
    converge: converge.yml
    verify: verify.yml
verifier:
  name: ansible
# roles/webapp/molecule/default/converge.yml
---
- name: Converge
  hosts: all
  vars:
    app_name: testapp
    app_version: "1.0.0"
    vault_db_password: "test-password"
    db_password: "{{ vault_db_password }}"
  roles:
    - role: webapp
# roles/webapp/molecule/default/verify.yml
---
- name: Verify
  hosts: all
  tasks:
    - name: Check application user exists
      ansible.builtin.getent:
        database: passwd
        key: deploy

    - name: Check application directory exists
      ansible.builtin.stat:
        path: /opt/testapp
      register: app_dir

    - name: Assert application directory is correct
      ansible.builtin.assert:
        that:
          - app_dir.stat.exists
          - app_dir.stat.isdir
          - app_dir.stat.pw_name == 'deploy'

    - name: Check nginx config is valid
      ansible.builtin.command: nginx -t
      changed_when: false
      become: true

    - name: Check systemd unit exists
      ansible.builtin.stat:
        path: /etc/systemd/system/testapp.service
      register: unit_file

    - name: Assert systemd unit exists
      ansible.builtin.assert:
        that: unit_file.stat.exists

Running Molecule

# Full test cycle (create → converge → idempotence → verify → destroy)
molecule test

# Just apply the role (leave containers running for debugging)
molecule converge

# Run verification
molecule verify

# SSH into a test container to poke around
molecule login -h ubuntu-noble

# Destroy test containers
molecule destroy

The molecule test sequence includes an idempotence check — it runs the playbook twice and fails if anything reports "changed" on the second run. This catches non-idempotent tasks (like bare command: calls without creates: guards).

# The idempotence check catches this:
# TASK [Create database] ********************************
# changed: [ubuntu-noble]     # First run: OK
# changed: [ubuntu-noble]     # Second run: NOT OK — should be "ok"

Remember: Molecule's test sequence is: dependency → lint → cleanup → destroy → syntax → create → prepare → converge → idempotence → verify → cleanup → destroy. The idempotence step is what separates "it runs" from "it's production-ready."


Part 7: Fact Caching and Performance at Scale

Gathering facts on 50 hosts takes time. On 500 hosts it takes minutes before your first task runs.

Enable fact caching

# ansible.cfg
[defaults]
forks = 30                              # Default is 5 — too low for any real fleet
gathering = smart                       # Only gather if cache is stale
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400            # 24 hours

[ssh_connection]
pipelining = True                       # Reduces SSH round trips per task
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
Setting What it does Impact
forks = 30 Run 30 hosts in parallel (default: 5) Linear speedup for independent tasks
gathering = smart Skip fact gathering if cache is fresh Saves 2–10 seconds per host
pipelining = True Run modules in-process instead of copying 2–3x faster per task
ControlPersist=60s Reuse SSH connections for 60 seconds Fewer SSH handshakes

Gotcha: Stale fact caches cause subtle bugs. If someone adds a disk or changes an IP, your cached facts still show the old state. For critical operations (disk partitioning, network reconfiguration), force fresh facts: ansible.builtin.setup: gather_subset: [network, hardware]


Part 8: Ansible Tower/AWX vs. Command Line

For a team of one, ansible-playbook on your laptop works fine. For a team of ten, you need centralized execution, RBAC, audit trails, and scheduling. That's Tower/AWX.

Feature CLI (ansible-playbook) Tower/AWX
Execution Your laptop/CI runner Centralized server
RBAC None (SSH key = full access) Role-based access per project, inventory, credential
Audit trail Shell history, maybe CI logs Full job log with who, when, what, and diff
Scheduling Cron job Built-in scheduler with dependencies
Credentials Files on disk / env vars Encrypted credential store with access control
API None (wrap in scripts) Full REST API for integration
Cost Free AWX = free, Tower (now AAP) = Red Hat subscription

Trivia: Red Hat open-sourced AWX in 2017 as the upstream for Ansible Tower. This was unusual — they essentially gave away the code for a commercial product. The strategy mirrors Red Hat's Fedora/RHEL model: free upstream builds community, paid product adds support and certification.

For this lesson's mission, the CLI approach works. But if you're running rolling updates weekly across multiple teams, Tower/AWX gives you the guardrails to sleep at night.


Part 9: Custom Modules and Callback Plugins (Brief Tour)

When built-in modules aren't enough

Occasionally, no module exists for what you need. You can write a custom module in Python and drop it in the role's library/ directory:

#!/usr/bin/python
# roles/webapp/library/app_health.py
from ansible.module_utils.basic import AnsibleModule
import requests

def main():
    module = AnsibleModule(
        argument_spec=dict(
            url=dict(required=True, type='str'),
            timeout=dict(default=10, type='int'),
        )
    )
    try:
        resp = requests.get(module.params['url'], timeout=module.params['timeout'])
        if resp.status_code == 200:
            module.exit_json(changed=False, status=resp.status_code, body=resp.text)
        else:
            module.fail_json(msg=f"Health check returned {resp.status_code}")
    except Exception as e:
        module.fail_json(msg=str(e))

if __name__ == '__main__':
    main()

Callback plugins: custom output formatting

Callback plugins change how Ansible displays output. The most useful built-in ones:

# ansible.cfg
[defaults]
# Show task timing (which tasks are slow?)
callbacks_enabled = timer, profile_tasks

# Output as YAML instead of JSON (more readable)
stdout_callback = yaml

profile_tasks shows how long each task took — essential for finding performance bottlenecks in large playbooks.


Part 10: Common Production Patterns

Bootstrap pattern — provisioning bare servers

# playbooks/bootstrap.yml
---
- name: Bootstrap new servers
  hosts: "{{ target }}"
  gather_facts: false    # Can't gather facts — Python might not be installed

  tasks:
    - name: Install Python (raw — no Python required)
      ansible.builtin.raw: |
        if command -v apt-get >/dev/null 2>&1; then
          apt-get update && apt-get install -y python3
        elif command -v dnf >/dev/null 2>&1; then
          dnf install -y python3
        fi
      changed_when: true

    - name: Now gather facts
      ansible.builtin.setup:

    - name: Run common role
      ansible.builtin.include_role:
        name: common

Under the Hood: The raw module doesn't require Python on the target — it sends commands over SSH directly. This is why gather_facts: false is mandatory here: the setup module (which gathers facts) is a Python module and would fail before Python is installed.

Upgrade pattern

# Upgrade with dry run first
ansible-playbook playbooks/rolling-upgrade.yml \
  -i inventory/aws_ec2.yml \
  -e target_version=2.1.0 \
  --check --diff

# Then for real, starting with one canary
ansible-playbook playbooks/rolling-upgrade.yml \
  -i inventory/aws_ec2.yml \
  -e target_version=2.1.0

Rollback pattern

# Roll back to the previous version
ansible-playbook playbooks/rolling-upgrade.yml \
  -i inventory/aws_ec2.yml \
  -e target_version=2.0.0

Because the playbook is idempotent, rolling back is just deploying the old version. No special rollback logic needed beyond the block/rescue for per-host failures.


Exercises

Exercise 1 (Quick win, 5 minutes): Create a static inventory file with two groups (webservers and dbservers), three hosts total, and one group variable. Test it with ansible-inventory --graph.

Hint Use INI format. Define groups with `[groupname]`, variables with `[groupname:vars]`.

Exercise 2 (15 minutes): Write a role that installs nginx and templates a config file with a server_name variable from defaults. Include a handler that reloads nginx when the config changes. Test idempotency by running the role twice.

Hint The role needs: `defaults/main.yml`, `tasks/main.yml`, `handlers/main.yml`, and `templates/nginx.conf.j2`. Use `validate: nginx -t -c %s` on the template task.

Exercise 3 (20 minutes): Encrypt a variable file with Ansible Vault, reference the vault variables from a plaintext vars file, and run a playbook that uses both. Verify with ansible-vault view that the secrets are stored encrypted.

Hint Create `vault.yml` with `vault_secret: "my-secret"`, encrypt it, create `vars.yml` with `secret: "{{ vault_secret }}"`, and use `--ask-vault-pass` when running the playbook.

Exercise 4 (30 minutes): Write a rolling update playbook using serial: [1, "25%"] and max_fail_percentage: 15. Include a health check in post_tasks that retries 5 times. Test it with --check --diff against your inventory.

Hint Use `ansible.builtin.uri` for the health check with `retries`, `delay`, and `until: result.status == 200`. The `serial` list means first batch = 1, then 25% batches.

Cheat Sheet

Task Command
Test inventory ansible-inventory -i inventory/ --graph
Dry run ansible-playbook site.yml --check --diff
Limit to one host ansible-playbook site.yml --limit web01
Run specific tags ansible-playbook site.yml --tags deploy
Debug a variable ansible -m debug -a "var=app_port" web01
Create vault file ansible-vault create secrets.yml
Edit vault file ansible-vault edit secrets.yml
Rekey vault ansible-vault rekey secrets.yml
Install collection ansible-galaxy collection install amazon.aws
Install roles ansible-galaxy install -r requirements.yml
Syntax check ansible-playbook site.yml --syntax-check
List tasks ansible-playbook site.yml --list-tasks
Molecule full test molecule test
Molecule converge only molecule converge
SSH into molecule container molecule login -h ubuntu-noble
Profile task timing Add callbacks_enabled = profile_tasks to ansible.cfg
Concept Remember
Variable precedence defaults/ = bottom, -e = top, vars/ = high
Handlers Run at end of play, not after notifying task
serial Batch size for rolling updates
max_fail_percentage Circuit breaker — stop rollout if too many fail
delegate_to Changes where task runs, not whose variables it uses
raw module Only module that doesn't require Python on target
Vault encryption AES-256-CTR with PBKDF2 key stretching
Idempotency test Run playbook twice — second run should show 0 changed

Takeaways

  • Dynamic inventory eliminates drift. If your inventory is a static file, it's wrong the moment someone launches or terminates an instance. Let the cloud API be the source of truth.

  • defaults/ vs vars/ is the most consequential directory choice in a role. If users should override it, defaults/. If they shouldn't, vars/. Put it in both and you'll debug precedence issues at 2am.

  • serial + max_fail_percentage is your circuit breaker. Every production playbook that touches multiple hosts needs both. Without them, a bad change rolls across the entire fleet.

  • Vault-encrypt values, not entire files. The vault/vars split pattern lets you grep for variable usage and review PRs without the vault password.

  • Molecule catches idempotency bugs before production does. The idempotence step (running the playbook twice and failing on "changed") is worth more than any other test.

  • The validate parameter on template tasks is free insurance. If the config file has a syntax checker (nginx -t, sshd -t, visudo -cf), use it.