Ansible: From Playbook to Production
- lesson
- ansible-inventory
- roles
- jinja2-templating
- vault
- molecule-testing
- rolling-updates
- delegation
- aws-dynamic-inventory
- callback-plugins
- tower/awx ---# Ansible: From Playbook to Production
Topics: Ansible inventory, roles, Jinja2 templating, Vault, Molecule testing, rolling updates, delegation, AWS dynamic inventory, callback plugins, Tower/AWX Level: L1–L2 (Foundations → Operations) Time: 75–90 minutes Prerequisites: None (everything is explained inline)
The Mission¶
You've just been handed responsibility for 50 web servers behind an AWS ALB. The application team needs a zero-downtime rolling upgrade — new application version, updated nginx config, rotated database credentials. The last person who did this manually took four hours, missed two servers, and left the credentials in a shell history file.
You're going to automate the entire thing with Ansible. By the end of this lesson you'll have:
- A dynamic inventory that discovers your EC2 instances automatically
- A production-grade role with every directory explained
- Vault-encrypted secrets that are safe to commit
- A rolling update playbook with health checks and automatic rollback
- A Molecule test suite that catches bugs before they reach production
- The mental model to debug the inevitable "why did it do that?" moments
We'll build from the ground up: inventory first, then roles, then secrets, then the rolling deploy, then testing. Each section adds a layer. By the end, you'll have a complete, production-ready pipeline.
Part 1: Inventory — Who Are We Talking To?¶
Before Ansible does anything, it needs to know who to talk to. That's the inventory.
Static inventory: the starting point¶
# inventory/hosts.ini
[webservers]
web01.prod.example.com
web02.prod.example.com ansible_host=10.0.1.12
web03.prod.example.com
[dbservers]
db01.prod.example.com ansible_port=2222
[production:children]
webservers
dbservers
[webservers:vars]
http_port=8080
app_env=production
| Syntax | What it does |
|---|---|
[webservers] |
Defines a group named "webservers" |
ansible_host=10.0.1.12 |
Override the connection IP (when DNS doesn't resolve) |
[production:children] |
Create a parent group containing other groups |
[webservers:vars] |
Variables applied to every host in the group |
Static inventory works for five servers. For 50 EC2 instances that scale up and down? You need dynamic inventory.
Dynamic inventory: let AWS tell you who exists¶
# inventory/aws_ec2.yml
plugin: amazon.aws.aws_ec2
regions:
- us-east-1
- us-west-2
keyed_groups:
- key: tags.Role
prefix: role
- key: tags.Environment
prefix: env
- key: placement.availability_zone
prefix: az
filters:
tag:ManagedBy: ansible
instance-state-name: running
compose:
ansible_host: private_ip_address
ansible_user: "'ubuntu'"
This file is the entire inventory. No hostnames to maintain. Ansible queries the AWS API
at runtime, discovers every running instance tagged ManagedBy: ansible, and groups them
by their Role tag, Environment tag, and availability zone.
Output looks like:
@all:
|--@env_production:
| |--10.0.1.10
| |--10.0.1.11
| |--10.0.1.12
|--@role_webserver:
| |--10.0.1.10
| |--10.0.1.11
|--@az_us_east_1a:
| |--10.0.1.10
Under the Hood: The
aws_ec2plugin calls the EC2DescribeInstancesAPI. Thekeyed_groupsdirective creates Ansible groups from instance metadata. Thecomposeblock builds per-host variables using Jinja2 expressions evaluated against the API response. The single quotes around'ubuntu'in compose are deliberate — without them, Ansible would try to resolveubuntuas a variable name.Gotcha: The
aws_ec2plugin requires theamazon.awscollection and theboto3Python package. Install both:ansible-galaxy collection install amazon.awsandpip install boto3. If you forgetboto3, the error message is unhelpfully vague:"Failed to import the required Python library".
group_vars and host_vars: variables that follow the inventory¶
inventory/
aws_ec2.yml
group_vars/
all.yml # Every host gets these
role_webserver.yml # Only hosts tagged Role=webserver
env_production.yml # Only hosts tagged Environment=production
host_vars/
10.0.1.10.yml # Overrides for one specific host
# inventory/group_vars/all.yml
ntp_servers:
- 169.254.169.123 # AWS time sync service
timezone: UTC
monitoring_agent: prometheus-node-exporter
# inventory/group_vars/role_webserver.yml
nginx_worker_processes: auto
nginx_worker_connections: 2048
app_port: 8080
health_check_path: /health
Variables in host_vars/ override group_vars/. This is part of Ansible's 22-level
precedence hierarchy, but the practical rule is simple: role defaults are the bottom,
extra vars (-e) are the top, everything else falls in between.
Flashcard Check: Inventory¶
| Question | Answer |
|---|---|
| What's the difference between static and dynamic inventory? | Static = manually maintained file. Dynamic = plugin/script queries an API at runtime. |
What does keyed_groups do in a dynamic inventory plugin? |
Creates Ansible groups from instance metadata (tags, zones, types). |
| Where do you put variables that apply to all hosts in a group? | group_vars/<group_name>.yml alongside the inventory file. |
| What always wins in Ansible's variable precedence? | Extra vars (-e on the command line) — they override everything. |
Part 2: Roles — Reusable, Testable Building Blocks¶
A role is a directory structure that packages tasks, templates, variables, handlers, and metadata into a reusable unit. Think of it like a function in code — it takes inputs (variables), does work (tasks), and has side effects (handlers).
The complete role directory layout¶
roles/
webapp/
defaults/main.yml # Default variables — LOW precedence, meant to be overridden
vars/main.yml # Role variables — HIGH precedence, hard to override
tasks/main.yml # The actual work
handlers/main.yml # Actions triggered by notify (e.g., restart services)
templates/ # Jinja2 templates (.j2 files)
nginx-vhost.conf.j2
app-config.yml.j2
files/ # Static files copied as-is (no templating)
logrotate-webapp
meta/main.yml # Role metadata: dependencies, platforms, author
molecule/ # Test suite (we'll build this in Part 5)
default/
molecule.yml
converge.yml
verify.yml
Let's walk through each file.
defaults/main.yml — the "safe to override" knobs¶
# roles/webapp/defaults/main.yml
---
app_name: mywebapp
app_version: "1.0.0"
app_port: 8080
app_user: deploy
app_group: deploy
app_home: "/opt/{{ app_name }}"
nginx_listen_port: 80
nginx_server_name: "{{ inventory_hostname }}"
nginx_ssl_enabled: false
health_check_url: "http://127.0.0.1:{{ app_port }}/health"
health_check_retries: 5
health_check_delay: 10
These are the knobs consumers of your role can turn. Put anything here that a user might reasonably want to change per environment.
vars/main.yml — the "don't touch these" constants¶
# roles/webapp/vars/main.yml
---
# Internal paths — changing these would break the role
app_bin: "{{ app_home }}/bin"
app_config_dir: "{{ app_home }}/config"
app_log_dir: "/var/log/{{ app_name }}"
nginx_config_path: "/etc/nginx/sites-available/{{ app_name }}.conf"
nginx_enabled_path: "/etc/nginx/sites-enabled/{{ app_name }}.conf"
# System packages required regardless of configuration
required_packages:
- nginx
- python3
- python3-pip
- acl
War Story: A team put
app_port: 8080in bothdefaults/main.ymlandvars/main.ymlduring a refactor. The defaults file said port 8080, the vars file also said 8080 — so testing caught nothing. Three months later, the staging team setapp_port: 9090in theirgroup_vars/staging.yml. It didn't work. Their port override was being silently stomped byvars/main.yml, which has higher precedence than group_vars. They spent two hours checking every inventory file and host_var before someone ranansible -m debug -a "var=app_port" staging-web01and saw the value was still 8080. The fix: moveapp_portout ofvars/— it belonged indefaults/all along. Rule: if users should be able to override it, it goes in defaults. If they shouldn't, it goes in vars. Never put the same variable in both.
tasks/main.yml — the work¶
# roles/webapp/tasks/main.yml
---
- name: Include OS-specific variables
ansible.builtin.include_vars: "{{ ansible_os_family | lower }}.yml"
- name: Install required packages
ansible.builtin.package:
name: "{{ required_packages }}"
state: present
become: true
- name: Create application user
ansible.builtin.user:
name: "{{ app_user }}"
group: "{{ app_group }}"
home: "{{ app_home }}"
shell: /usr/sbin/nologin
system: true
become: true
- name: Create application directories
ansible.builtin.file:
path: "{{ item }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: "0755"
loop:
- "{{ app_home }}"
- "{{ app_bin }}"
- "{{ app_config_dir }}"
- "{{ app_log_dir }}"
become: true
- name: Deploy application binary
ansible.builtin.copy:
src: "{{ app_name }}-{{ app_version }}.jar"
dest: "{{ app_bin }}/{{ app_name }}.jar"
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: "0755"
become: true
notify: Restart application
- name: Template application config
ansible.builtin.template:
src: app-config.yml.j2
dest: "{{ app_config_dir }}/config.yml"
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: "0640"
become: true
notify: Restart application
- name: Template nginx vhost
ansible.builtin.template:
src: nginx-vhost.conf.j2
dest: "{{ nginx_config_path }}"
owner: root
group: root
mode: "0644"
validate: nginx -t -c %s
become: true
notify: Reload nginx
- name: Enable nginx vhost
ansible.builtin.file:
src: "{{ nginx_config_path }}"
dest: "{{ nginx_enabled_path }}"
state: link
become: true
notify: Reload nginx
- name: Deploy systemd unit
ansible.builtin.template:
src: webapp.service.j2
dest: "/etc/systemd/system/{{ app_name }}.service"
owner: root
group: root
mode: "0644"
become: true
notify:
- Reload systemd
- Restart application
- name: Ensure application is running
ansible.builtin.systemd:
name: "{{ app_name }}"
state: started
enabled: true
become: true
Notice: become: true is on individual tasks, not at the play level. The validate
parameter on the nginx template runs nginx -t before writing the file — if the
config is broken, the original file stays untouched.
handlers/main.yml — the "only if something changed" actions¶
# roles/webapp/handlers/main.yml
---
- name: Reload systemd
ansible.builtin.systemd:
daemon_reload: true
become: true
- name: Restart application
ansible.builtin.systemd:
name: "{{ app_name }}"
state: restarted
become: true
- name: Reload nginx
ansible.builtin.systemd:
name: nginx
state: reloaded
become: true
Gotcha: Handlers run at the end of the play, not after the task that notified them. If a task notifies "Restart application" but a later task fails, the handler never runs — your config changed but the service still runs the old version. Use
meta: flush_handlersif you need handlers to fire immediately.
meta/main.yml — dependencies and metadata¶
# roles/webapp/meta/main.yml
---
dependencies:
- role: common
- role: monitoring
vars:
monitoring_port: "{{ app_port }}"
galaxy_info:
author: ops-team
description: Deploy and configure a Java web application with nginx reverse proxy
min_ansible_version: "2.14"
platforms:
- name: Ubuntu
versions: [jammy, noble]
- name: EL
versions: [8, 9]
Dependencies run before the role's tasks. The common role might install base packages
and configure NTP; the monitoring role might install the Prometheus node exporter.
Trivia: The name "Ansible Galaxy" follows the science fiction theme. Ansible itself is named after the instantaneous communication device from Ursula K. Le Guin's 1966 novel Rocannon's World. Galaxy launched in 2013 with about 200 roles — by 2024, it hosted over 40,000 roles and collections.
Part 3: Jinja2 Templating — Dynamic Config Files¶
Templates are where variables become real config files. Ansible uses Jinja2, the same engine behind Flask and Django templates.
{# templates/nginx-vhost.conf.j2 #}
upstream {{ app_name }}_backend {
{%- for host in groups['role_webserver'] %}
server {{ hostvars[host]['ansible_host'] }}:{{ app_port }};
{%- endfor %}
}
server {
listen {{ nginx_listen_port }};
server_name {{ nginx_server_name }};
location / {
proxy_pass http://{{ app_name }}_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
location /health {
proxy_pass http://127.0.0.1:{{ app_port }}/health;
access_log off;
}
{% if nginx_ssl_enabled %}
listen 443 ssl;
ssl_certificate /etc/ssl/certs/{{ nginx_server_name }}.pem;
ssl_certificate_key /etc/ssl/private/{{ nginx_server_name }}.key;
{% endif %}
}
Essential Jinja2 filters you'll actually use¶
| Filter | What it does | Example |
|---|---|---|
default('val') |
Fallback if variable is undefined | {{ timeout \| default(30) }} |
mandatory |
Fail if variable is undefined | {{ db_host \| mandatory }} |
to_nice_json |
Pretty-print as JSON | {{ config_dict \| to_nice_json }} |
regex_replace |
Regex substitution | {{ hostname \| regex_replace('\.example\.com$', '') }} |
join(', ') |
Join list into string | {{ dns_servers \| join(', ') }} |
selectattr |
Filter list of dicts | {{ users \| selectattr('active', 'true') \| list }} |
b64encode |
Base64 encode | {{ secret \| b64encode }} |
password_hash |
Hash a password | {{ pass \| password_hash('sha512') }} |
basename |
Extract filename from path | {{ '/etc/nginx/conf.d/app.conf' \| basename }} |
Gotcha: YAML treats
{as the start of a mapping. Any value that begins with{{must be quoted:message: "{{ greeting }} world". Without quotes you get the cryptic errormapping values are not allowed in this context. This is the single most common Ansible YAML error.
The {%- syntax (note the dash) strips whitespace before the tag. Without it, {% for %}
loops add blank lines between entries. For nginx this is cosmetic; for YAML or Python
configs it breaks parsing.
Flashcard Check: Roles and Templates¶
| Question | Answer |
|---|---|
What's the difference between defaults/ and vars/ in a role? |
defaults/ = low precedence, meant to be overridden. vars/ = high precedence, hard to override. |
| When do handlers run? | At the end of the play, not after the notifying task. Use meta: flush_handlers for immediate execution. |
What does validate: nginx -t -c %s do on a template task? |
Runs nginx config validation before writing the file. If validation fails, the original is untouched. |
Why must {{ var }} be quoted in YAML? |
YAML interprets { as a mapping start. Unquoted Jinja2 causes a parse error. |
What does {%- do in a Jinja2 template? |
Strips whitespace before the tag, preventing blank lines in loop output. |
Part 4: Vault — Secrets That Are Safe to Commit¶
Your rolling upgrade needs new database credentials. Those credentials need to live somewhere in your repo so the playbook can use them. Ansible Vault encrypts them with AES-256 so the ciphertext is safe to commit to git.
The essential vault commands¶
# Create a new encrypted file
ansible-vault create group_vars/env_production/vault.yml
# Encrypt an existing file
ansible-vault encrypt group_vars/env_production/secrets.yml
# Edit an encrypted file (decrypts to temp file, opens $EDITOR, re-encrypts on save)
ansible-vault edit group_vars/env_production/vault.yml
# View without decrypting to disk
ansible-vault view group_vars/env_production/vault.yml
# Change the encryption password
ansible-vault rekey group_vars/env_production/vault.yml
# Encrypt a single string (for inline use in YAML)
ansible-vault encrypt_string 'db_p@ssw0rd_2026' --name 'vault_db_password'
Under the Hood: Vault uses AES-256-CTR encryption with HMAC-SHA256 for integrity. The vault password is stretched using PBKDF2 with a random salt. Each encrypted file or string is self-contained — the salt, HMAC, and ciphertext are all embedded in the
$ANSIBLE_VAULT;1.1;AES256header. When you runansible-vault edit, the plaintext never touches persistent storage — it's decrypted to a tmpfs-backed file, you edit it, and it's re-encrypted on save.
The vault/vars split pattern¶
Don't encrypt your entire vars file. Use two files:
inventory/
group_vars/
env_production/
vars.yml # Plaintext — references vault variables
vault.yml # Encrypted — contains the actual secrets
# group_vars/env_production/vault.yml (encrypted)
vault_db_password: "s3cr3t_pr0d_p4ss"
vault_api_key: "ak_prod_xK9mP2qR7vN4"
vault_tls_key: |
-----BEGIN PRIVATE KEY-----
MIIEvQIBADANBgkqhkiG9w0BAQEFA...
-----END PRIVATE KEY-----
# group_vars/env_production/vars.yml (plaintext)
db_password: "{{ vault_db_password }}"
api_key: "{{ vault_api_key }}"
tls_key: "{{ vault_tls_key }}"
Why the split? You can grep for where db_password is used without decrypting
anything. The vault file holds values; the vars file provides names. When reviewing a PR,
you can see that db_password was changed without needing the vault password.
Running playbooks with vault¶
# Interactive prompt
ansible-playbook site.yml --ask-vault-pass
# Password file (for CI/CD)
echo "your-vault-password" > ~/.vault_pass
chmod 600 ~/.vault_pass
ansible-playbook site.yml --vault-password-file ~/.vault_pass
# Environment variable (also common in CI)
export ANSIBLE_VAULT_PASSWORD_FILE=~/.vault_pass
ansible-playbook site.yml
Gotcha: Never pass secrets directly on the command line:
ansible-vault encrypt_string 'actual_password'puts the password in your shell history. Use--stdin-namewith a pipe instead:echo -n 'actual_password' | ansible-vault encrypt_string --stdin-name 'db_password'
Multiple vault IDs: different secrets for different teams¶
# Encrypt with specific vault IDs
ansible-vault encrypt --vault-id dev@prompt secrets-dev.yml
ansible-vault encrypt --vault-id prod@/path/to/prod-password secrets-prod.yml
# Run with multiple vault IDs
ansible-playbook site.yml \
--vault-id dev@prompt \
--vault-id prod@/path/to/prod-password
The dev team has one vault password, the prod team has another. Neither can decrypt the other's secrets.
Part 5: The Rolling Update — Zero Downtime on 50 Servers¶
This is where everything comes together. We need to upgrade 50 web servers behind an AWS ALB without dropping a single request.
The strategy¶
- Process servers in small batches (
serial) - For each batch: remove from load balancer, upgrade, health check, re-add
- If too many fail, stop the entire rollout (
max_fail_percentage) - If a single server fails, roll it back (
block/rescue)
# playbooks/rolling-upgrade.yml
---
- name: Rolling upgrade — {{ app_name }} {{ app_version }}
hosts: role_webserver
serial:
- 1 # First: single canary
- "10%" # Then: 10% at a time
- "25%" # Then: 25% at a time
max_fail_percentage: 10
pre_tasks:
- name: Verify current version
ansible.builtin.command: "cat {{ app_home }}/VERSION"
register: pre_version
changed_when: false
- name: Remove from ALB target group
community.aws.elb_target:
target_group_arn: "{{ alb_target_group_arn }}"
target_id: "{{ ansible_host }}"
state: absent
delegate_to: localhost
- name: Wait for connections to drain
ansible.builtin.pause:
seconds: 30
roles:
- role: webapp
vars:
app_version: "{{ target_version }}"
post_tasks:
- name: Wait for application health
ansible.builtin.uri:
url: "{{ health_check_url }}"
return_content: true
status_code: 200
register: health
retries: "{{ health_check_retries }}"
delay: "{{ health_check_delay }}"
until: health.status == 200
- name: Re-add to ALB target group
community.aws.elb_target:
target_group_arn: "{{ alb_target_group_arn }}"
target_id: "{{ ansible_host }}"
target_port: "{{ app_port }}"
state: present
delegate_to: localhost
- name: Wait for ALB health check to pass
ansible.builtin.pause:
seconds: 15
- name: Verify new version
ansible.builtin.command: "cat {{ app_home }}/VERSION"
register: post_version
changed_when: false
- name: Report upgrade status
ansible.builtin.debug:
msg: >-
{{ inventory_hostname }}: {{ pre_version.stdout }}
→ {{ post_version.stdout }}
Let's break down the key decisions:
serial: [1, "10%", "25%"] — Graduated rollout. The first batch is a single canary.
If that one server survives, we widen to 10%, then 25%. This catches both "the new version
is completely broken" (canary fails) and "the new version fails under load" (10% batch
reveals issues).
max_fail_percentage: 10 — The circuit breaker. If more than 10% of hosts in any
batch fail, Ansible stops the entire play. Without this, a bad deploy rolls across all 50
servers.
Mental Model: Think of
serial+max_fail_percentageas a circuit breaker pattern — the same concept used in microservice architectures.serialcontrols the batch size (how much current flows).max_fail_percentageis the trip threshold (how much failure before the breaker opens). Together, they limit blast radius.
delegate_to: localhost — The ALB API calls run on your control node, not on the
web servers. The inventory_hostname and ansible_host variables still refer to the
target server — delegate_to changes where the task runs, not whose variables it uses.
Rollback with block/rescue¶
For critical deployments, wrap the upgrade in a block/rescue:
tasks:
- name: Deploy with rollback
block:
- name: Deploy new version
ansible.builtin.include_role:
name: webapp
vars:
app_version: "{{ target_version }}"
- name: Verify health after deploy
ansible.builtin.uri:
url: "{{ health_check_url }}"
status_code: 200
register: health
retries: 5
delay: 10
until: health.status == 200
rescue:
- name: ROLLBACK — deploy previous version
ansible.builtin.include_role:
name: webapp
vars:
app_version: "{{ pre_version.stdout | trim }}"
- name: Notify on rollback
community.general.slack:
token: "{{ vault_slack_token }}"
channel: "#deploys"
msg: >-
ROLLBACK on {{ inventory_hostname }}:
{{ target_version }} failed,
reverted to {{ pre_version.stdout | trim }}
delegate_to: localhost
ignore_errors: true
If the health check fails after deploying the new version, the rescue block
automatically deploys the previous version and alerts the team.
Flashcard Check: Rolling Updates¶
| Question | Answer |
|---|---|
What does serial: [1, "10%", "25%"] do? |
First batch: 1 host (canary). Second: 10%. Remaining: 25% at a time. |
What happens when max_fail_percentage is exceeded? |
Ansible stops the entire play — no more batches run. |
When you use delegate_to: localhost, whose variables does the task see? |
The target host's variables. delegate_to changes execution location, not variable context. |
What's the difference between block/rescue and ignore_errors? |
block/rescue is structured error handling (like try/catch). ignore_errors silently swallows all errors. |
Part 6: Molecule — Testing Before It Reaches Production¶
Molecule is Ansible's testing framework. It spins up containers, runs your role, verifies the result, and tears everything down.
Setting up Molecule for the webapp role¶
# roles/webapp/molecule/default/molecule.yml
---
driver:
name: docker
platforms:
- name: ubuntu-noble
image: ubuntu:noble
pre_build_image: true
command: /lib/systemd/systemd
privileged: true
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:rw
- name: rocky-9
image: rockylinux:9
pre_build_image: true
command: /lib/systemd/systemd
privileged: true
provisioner:
name: ansible
playbooks:
converge: converge.yml
verify: verify.yml
verifier:
name: ansible
# roles/webapp/molecule/default/converge.yml
---
- name: Converge
hosts: all
vars:
app_name: testapp
app_version: "1.0.0"
vault_db_password: "test-password"
db_password: "{{ vault_db_password }}"
roles:
- role: webapp
# roles/webapp/molecule/default/verify.yml
---
- name: Verify
hosts: all
tasks:
- name: Check application user exists
ansible.builtin.getent:
database: passwd
key: deploy
- name: Check application directory exists
ansible.builtin.stat:
path: /opt/testapp
register: app_dir
- name: Assert application directory is correct
ansible.builtin.assert:
that:
- app_dir.stat.exists
- app_dir.stat.isdir
- app_dir.stat.pw_name == 'deploy'
- name: Check nginx config is valid
ansible.builtin.command: nginx -t
changed_when: false
become: true
- name: Check systemd unit exists
ansible.builtin.stat:
path: /etc/systemd/system/testapp.service
register: unit_file
- name: Assert systemd unit exists
ansible.builtin.assert:
that: unit_file.stat.exists
Running Molecule¶
# Full test cycle (create → converge → idempotence → verify → destroy)
molecule test
# Just apply the role (leave containers running for debugging)
molecule converge
# Run verification
molecule verify
# SSH into a test container to poke around
molecule login -h ubuntu-noble
# Destroy test containers
molecule destroy
The molecule test sequence includes an idempotence check — it runs the playbook
twice and fails if anything reports "changed" on the second run. This catches
non-idempotent tasks (like bare command: calls without creates: guards).
# The idempotence check catches this:
# TASK [Create database] ********************************
# changed: [ubuntu-noble] # First run: OK
# changed: [ubuntu-noble] # Second run: NOT OK — should be "ok"
Remember: Molecule's test sequence is: dependency → lint → cleanup → destroy → syntax → create → prepare → converge → idempotence → verify → cleanup → destroy. The idempotence step is what separates "it runs" from "it's production-ready."
Part 7: Fact Caching and Performance at Scale¶
Gathering facts on 50 hosts takes time. On 500 hosts it takes minutes before your first task runs.
Enable fact caching¶
# ansible.cfg
[defaults]
forks = 30 # Default is 5 — too low for any real fleet
gathering = smart # Only gather if cache is stale
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400 # 24 hours
[ssh_connection]
pipelining = True # Reduces SSH round trips per task
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
| Setting | What it does | Impact |
|---|---|---|
forks = 30 |
Run 30 hosts in parallel (default: 5) | Linear speedup for independent tasks |
gathering = smart |
Skip fact gathering if cache is fresh | Saves 2–10 seconds per host |
pipelining = True |
Run modules in-process instead of copying | 2–3x faster per task |
ControlPersist=60s |
Reuse SSH connections for 60 seconds | Fewer SSH handshakes |
Gotcha: Stale fact caches cause subtle bugs. If someone adds a disk or changes an IP, your cached facts still show the old state. For critical operations (disk partitioning, network reconfiguration), force fresh facts:
ansible.builtin.setup: gather_subset: [network, hardware]
Part 8: Ansible Tower/AWX vs. Command Line¶
For a team of one, ansible-playbook on your laptop works fine. For a team of ten, you
need centralized execution, RBAC, audit trails, and scheduling. That's Tower/AWX.
| Feature | CLI (ansible-playbook) |
Tower/AWX |
|---|---|---|
| Execution | Your laptop/CI runner | Centralized server |
| RBAC | None (SSH key = full access) | Role-based access per project, inventory, credential |
| Audit trail | Shell history, maybe CI logs | Full job log with who, when, what, and diff |
| Scheduling | Cron job | Built-in scheduler with dependencies |
| Credentials | Files on disk / env vars | Encrypted credential store with access control |
| API | None (wrap in scripts) | Full REST API for integration |
| Cost | Free | AWX = free, Tower (now AAP) = Red Hat subscription |
Trivia: Red Hat open-sourced AWX in 2017 as the upstream for Ansible Tower. This was unusual — they essentially gave away the code for a commercial product. The strategy mirrors Red Hat's Fedora/RHEL model: free upstream builds community, paid product adds support and certification.
For this lesson's mission, the CLI approach works. But if you're running rolling updates weekly across multiple teams, Tower/AWX gives you the guardrails to sleep at night.
Part 9: Custom Modules and Callback Plugins (Brief Tour)¶
When built-in modules aren't enough¶
Occasionally, no module exists for what you need. You can write a custom module in Python
and drop it in the role's library/ directory:
#!/usr/bin/python
# roles/webapp/library/app_health.py
from ansible.module_utils.basic import AnsibleModule
import requests
def main():
module = AnsibleModule(
argument_spec=dict(
url=dict(required=True, type='str'),
timeout=dict(default=10, type='int'),
)
)
try:
resp = requests.get(module.params['url'], timeout=module.params['timeout'])
if resp.status_code == 200:
module.exit_json(changed=False, status=resp.status_code, body=resp.text)
else:
module.fail_json(msg=f"Health check returned {resp.status_code}")
except Exception as e:
module.fail_json(msg=str(e))
if __name__ == '__main__':
main()
Callback plugins: custom output formatting¶
Callback plugins change how Ansible displays output. The most useful built-in ones:
# ansible.cfg
[defaults]
# Show task timing (which tasks are slow?)
callbacks_enabled = timer, profile_tasks
# Output as YAML instead of JSON (more readable)
stdout_callback = yaml
profile_tasks shows how long each task took — essential for finding performance
bottlenecks in large playbooks.
Part 10: Common Production Patterns¶
Bootstrap pattern — provisioning bare servers¶
# playbooks/bootstrap.yml
---
- name: Bootstrap new servers
hosts: "{{ target }}"
gather_facts: false # Can't gather facts — Python might not be installed
tasks:
- name: Install Python (raw — no Python required)
ansible.builtin.raw: |
if command -v apt-get >/dev/null 2>&1; then
apt-get update && apt-get install -y python3
elif command -v dnf >/dev/null 2>&1; then
dnf install -y python3
fi
changed_when: true
- name: Now gather facts
ansible.builtin.setup:
- name: Run common role
ansible.builtin.include_role:
name: common
Under the Hood: The
rawmodule doesn't require Python on the target — it sends commands over SSH directly. This is whygather_facts: falseis mandatory here: thesetupmodule (which gathers facts) is a Python module and would fail before Python is installed.
Upgrade pattern¶
# Upgrade with dry run first
ansible-playbook playbooks/rolling-upgrade.yml \
-i inventory/aws_ec2.yml \
-e target_version=2.1.0 \
--check --diff
# Then for real, starting with one canary
ansible-playbook playbooks/rolling-upgrade.yml \
-i inventory/aws_ec2.yml \
-e target_version=2.1.0
Rollback pattern¶
# Roll back to the previous version
ansible-playbook playbooks/rolling-upgrade.yml \
-i inventory/aws_ec2.yml \
-e target_version=2.0.0
Because the playbook is idempotent, rolling back is just deploying the old version. No special rollback logic needed beyond the block/rescue for per-host failures.
Exercises¶
Exercise 1 (Quick win, 5 minutes): Create a static inventory file with two groups
(webservers and dbservers), three hosts total, and one group variable. Test it with
ansible-inventory --graph.
Hint
Use INI format. Define groups with `[groupname]`, variables with `[groupname:vars]`.Exercise 2 (15 minutes): Write a role that installs nginx and templates a config file
with a server_name variable from defaults. Include a handler that reloads nginx when the
config changes. Test idempotency by running the role twice.
Hint
The role needs: `defaults/main.yml`, `tasks/main.yml`, `handlers/main.yml`, and `templates/nginx.conf.j2`. Use `validate: nginx -t -c %s` on the template task.Exercise 3 (20 minutes): Encrypt a variable file with Ansible Vault, reference the
vault variables from a plaintext vars file, and run a playbook that uses both. Verify
with ansible-vault view that the secrets are stored encrypted.
Hint
Create `vault.yml` with `vault_secret: "my-secret"`, encrypt it, create `vars.yml` with `secret: "{{ vault_secret }}"`, and use `--ask-vault-pass` when running the playbook.Exercise 4 (30 minutes): Write a rolling update playbook using serial: [1, "25%"]
and max_fail_percentage: 15. Include a health check in post_tasks that retries 5 times.
Test it with --check --diff against your inventory.
Hint
Use `ansible.builtin.uri` for the health check with `retries`, `delay`, and `until: result.status == 200`. The `serial` list means first batch = 1, then 25% batches.Cheat Sheet¶
| Task | Command |
|---|---|
| Test inventory | ansible-inventory -i inventory/ --graph |
| Dry run | ansible-playbook site.yml --check --diff |
| Limit to one host | ansible-playbook site.yml --limit web01 |
| Run specific tags | ansible-playbook site.yml --tags deploy |
| Debug a variable | ansible -m debug -a "var=app_port" web01 |
| Create vault file | ansible-vault create secrets.yml |
| Edit vault file | ansible-vault edit secrets.yml |
| Rekey vault | ansible-vault rekey secrets.yml |
| Install collection | ansible-galaxy collection install amazon.aws |
| Install roles | ansible-galaxy install -r requirements.yml |
| Syntax check | ansible-playbook site.yml --syntax-check |
| List tasks | ansible-playbook site.yml --list-tasks |
| Molecule full test | molecule test |
| Molecule converge only | molecule converge |
| SSH into molecule container | molecule login -h ubuntu-noble |
| Profile task timing | Add callbacks_enabled = profile_tasks to ansible.cfg |
| Concept | Remember |
|---|---|
| Variable precedence | defaults/ = bottom, -e = top, vars/ = high |
| Handlers | Run at end of play, not after notifying task |
serial |
Batch size for rolling updates |
max_fail_percentage |
Circuit breaker — stop rollout if too many fail |
delegate_to |
Changes where task runs, not whose variables it uses |
raw module |
Only module that doesn't require Python on target |
| Vault encryption | AES-256-CTR with PBKDF2 key stretching |
| Idempotency test | Run playbook twice — second run should show 0 changed |
Takeaways¶
-
Dynamic inventory eliminates drift. If your inventory is a static file, it's wrong the moment someone launches or terminates an instance. Let the cloud API be the source of truth.
-
defaults/ vs vars/ is the most consequential directory choice in a role. If users should override it,
defaults/. If they shouldn't,vars/. Put it in both and you'll debug precedence issues at 2am. -
serial + max_fail_percentage is your circuit breaker. Every production playbook that touches multiple hosts needs both. Without them, a bad change rolls across the entire fleet.
-
Vault-encrypt values, not entire files. The vault/vars split pattern lets you
grepfor variable usage and review PRs without the vault password. -
Molecule catches idempotency bugs before production does. The idempotence step (running the playbook twice and failing on "changed") is worth more than any other test.
-
The
validateparameter on template tasks is free insurance. If the config file has a syntax checker (nginx -t,sshd -t,visudo -cf), use it.
Related Lessons¶
- Ansible Playbook Debugging — deep dive into
--check,--diff, verbosity levels, and variable inspection - Terraform vs Ansible vs Helm — when to use which tool (Terraform creates infrastructure, Ansible configures it, Helm deploys to Kubernetes)
- Secrets Management Without Tears — Vault, HashiCorp Vault, AWS Secrets Manager, and the tradeoffs between them
- The Hanging Deploy — what happens when processes don't exit cleanly (relevant when Ansible tasks time out)
- Deploy a Web App From Nothing — end-to-end build-up from process to networking to persistence to monitoring