Ansible The Complete Guide
- lesson
- ansible-architecture
- inventory
- playbooks
- modules
- roles
- jinja2-templating
- variables-&-precedence
- handlers
- vault
- molecule-testing
- rolling-updates
- debugging
- performance-tuning
- tower/awx
- ansible-galaxy
- collections
- error-handling
- dynamic-inventory
- delegation
- callback-plugins
- custom-modules ---# Ansible — The Complete Guide: From Zero to Production
Topics: Ansible architecture, inventory, playbooks, modules, roles, Jinja2 templating, variables & precedence, handlers, Vault, Molecule testing, rolling updates, debugging, performance tuning, Tower/AWX, Ansible Galaxy, collections, error handling, dynamic inventory, delegation, callback plugins, custom modules Strategy: Build-up (ground zero → production mastery) with war stories, trivia, and drills woven throughout Level: L1–L2 (Foundations → Operations) Time: 3–4 hours (designed for deep study in one or multiple sittings) Prerequisites: SSH access to at least one Linux host (or a local VM). Familiarity with YAML helps but is not required.
The Mission¶
You're a new platform engineer. Your team manages 200 servers across dev, staging, and production. The previous engineer did everything by hand — SSHing into each server, running commands, copy-pasting configs. It took four hours to roll out a single update and they still missed two servers. The database password is in a shell history file. There's no record of what changed or when.
You're going to replace all of that with Ansible. By the end of this guide you'll understand every major concept from the ground up, have production-grade patterns for real work, and know the traps that catch experienced engineers. This is the one document you need to go from "what is Ansible?" to "I can run a zero-downtime rolling upgrade across 50 servers with encrypted secrets and automated testing."
Table of Contents¶
- What Is Ansible?
- Installation and First Commands
- Inventory — Who Are We Talking To?
- Playbooks — What Do We Want to Happen?
- Modules — The Building Blocks
- Variables, Facts, and the Precedence Nightmare
- Jinja2 Templating — Dynamic Config Files
- Handlers — Actions That Only Fire When Needed
- Roles — Reusable, Testable Building Blocks
- Ansible Vault — Secrets That Are Safe to Commit
- Error Handling — block/rescue, ignore_errors, failed_when
- Conditionals and Loops
- Rolling Updates and Zero-Downtime Deploys
- Debugging — Why Did It Do That?
- Performance at Scale
- Molecule — Testing Before Production
- Ansible Galaxy and Collections
- Tower/AWX — Centralized Automation
- Ansible vs Terraform vs Helm — When to Use Which
- Common Production Patterns
- Footguns — Mistakes That Brick Servers
- Real-World Case Studies
- Glossary
- Trivia and History
- Flashcard Review
- Drills
- Cheat Sheet
- Self-Assessment
Part 1: What Is Ansible?¶
Ansible is an agentless automation tool that manages server configuration, application deployment, and orchestration over SSH. You describe the desired state of your infrastructure in YAML files (playbooks), and Ansible makes it so.
One-liner definition: Agentless automation over SSH using YAML playbooks.
The Mental Model¶
Control Node (your laptop / CI server)
|
| SSH (or WinRM for Windows)
v
Managed Nodes (target servers)
|
v
Module executes → reports changed/ok/failed → returns result
Ansible connects to your servers over SSH, pushes small programs called modules, executes them, and returns structured results. No agent to install. No daemon to manage. No custom protocol to debug. If you can SSH to a host and it has Python, Ansible can manage it.
One-liner from the street: Ansible is SSH in a trench coat. If SSH is broken, Ansible is broken.
Why Ansible Exists¶
Before Ansible, configuration management meant either: - Manual work: SSH into each server, run commands, hope you didn't miss one - Chef/Puppet: Install and manage agents on every server, maintain a PKI infrastructure, learn Ruby DSLs
Michael DeHaan created Ansible in February 2012 as a deliberate reaction to this complexity. His philosophy: if a machine has SSH and Python, it is already ready for configuration management. No bootstrap. No agent. No custom certificates. This zero-bootstrap approach is why Ansible became the default choice for network device automation — switches and routers have SSH but cannot run Ruby or install agents.
Etymology: The name "Ansible" comes from Ursula K. Le Guin's 1966 novel Rocannon's World, where an "ansible" is a device for instantaneous communication across any distance. Later popularized in Orson Scott Card's Ender's Game. DeHaan chose it because the tool was designed for instant, agentless communication with remote servers.
Core Principles¶
| Principle | What It Means |
|---|---|
| Agentless | Nothing to install on managed hosts (just Python + SSH) |
| Idempotent | Running the same playbook twice produces the same result — no unintended side effects |
| Declarative | You describe the desired state ("nginx should be installed"), not the steps ("apt-get install nginx") |
| Push-based | You run Ansible from a control node; it pushes changes to targets (vs. pull-based like Puppet) |
| YAML | Human-readable configuration language — no programming required |
Mnemonic — AIDPY: Agentless, Idempotent, Declarative, Push-based, YAML. These five properties define Ansible's design.
Part 2: Installation and First Commands¶
Installing Ansible¶
# On Ubuntu/Debian
sudo apt update && sudo apt install -y ansible
# On macOS
brew install ansible
# Via pip (any platform, most up-to-date)
pip install ansible
# Via pipx (isolated install, preferred for CLI tools)
pipx install ansible-core
# Verify
ansible --version
Ansible only needs to be installed on the control node (your laptop or CI server). Managed nodes need only Python 3 and an SSH server — both come pre-installed on virtually every Linux distribution.
ansible-core vs ansible¶
There are two PyPI packages, and the distinction matters:
| Package | What You Get | When to Use |
|---|---|---|
ansible-core |
Engine + CLI tools + ansible.builtin content + plugin framework |
When you want minimal, controlled dependencies and install collections explicitly |
ansible |
ansible-core + curated community collections (batteries-included) |
Quick start, learning, small teams where convenience beats precision |
For team reproducibility, prefer ansible-core with a pinned requirements.yml listing exactly which collections you need. This avoids "works on my machine" drift where different ansible package versions ship different collection versions.
The next step beyond pinned collections is Execution Environments — containerized Ansible runtimes that freeze the entire controller-side dependency tree (see the Execution Environments section later in this guide).
Your First Ad-Hoc Command¶
Ad-hoc commands let you run a single task on one or many servers without writing a playbook. They're the Ansible equivalent of one-liner shell commands.
# Ping all hosts (tests connection, not ICMP)
ansible all -m ping -i inventory.yml
# Check disk usage on web servers
ansible webservers -m command -a "df -h" -i inventory.yml
# Restart nginx (needs sudo)
ansible webservers -m service -a "name=nginx state=restarted" -i inventory.yml -b
# Install a package
ansible all -m apt -a "name=curl state=present" -i inventory.yml -b
# Copy a file to all servers
ansible all -m copy -a "src=motd.txt dest=/etc/motd" -i inventory.yml -b
| Flag | Meaning | Mnemonic |
|---|---|---|
-i |
Inventory file | inventory |
-m |
Module name | module |
-a |
Module arguments | arguments |
-b |
Become (sudo) | become root |
-u |
Remote user | user |
-k |
Ask for SSH password | key/password prompt |
--check |
Dry-run (don't change anything) | |
--diff |
Show file changes |
Remember: Ad-hoc module categories:
command(raw shell, no pipes),shell(supports pipes and redirects),copy(files to remote),service(start/stop/restart),package(install/remove). Mnemonic: CSCSP — Command, Shell, Copy, Service, Package.Gotcha: The
commandmodule does NOT support pipes, redirects, or shell builtins.ansible all -m command -a "cat /etc/passwd | grep root"fails. Use-m shellfor anything that needs shell features. This is a common interview trip-up.
Part 3: Inventory — Who Are We Talking To?¶
Before Ansible does anything, it needs to know who to talk to. That's the inventory — the list of hosts and their groupings.
Static Inventory¶
The simplest inventory is a file that lists your hosts:
# inventory/hosts.ini (INI format)
[webservers]
web1.example.com
web2.example.com ansible_host=10.0.1.10
[dbservers]
db1.example.com ansible_port=2222
[production:children]
webservers
dbservers
[webservers:vars]
http_port=8080
app_env=production
# inventory/hosts.yml (YAML format — same thing, different syntax)
all:
children:
webservers:
hosts:
web1.example.com:
web2.example.com:
ansible_host: 10.0.1.10
vars:
http_port: 8080
dbservers:
hosts:
db1.example.com:
ansible_port: 2222
| Syntax | What It Does |
|---|---|
[webservers] |
Defines a group named "webservers" |
ansible_host=10.0.1.10 |
Override the connection IP (when DNS doesn't resolve) |
[production:children] |
Create a parent group containing other groups |
[webservers:vars] |
Variables applied to every host in the group |
Dynamic Inventory — Let the Cloud Tell You¶
Static inventory works for 5 servers. For 50 EC2 instances that scale up and down? You need dynamic inventory — scripts or plugins that query cloud APIs at runtime.
# inventory/aws_ec2.yml
plugin: amazon.aws.aws_ec2
regions:
- us-east-1
- us-west-2
keyed_groups:
- key: tags.Role
prefix: role
- key: tags.Environment
prefix: env
- key: placement.availability_zone
prefix: az
filters:
tag:ManagedBy: ansible
instance-state-name: running
compose:
ansible_host: private_ip_address
ansible_user: "'ubuntu'"
# See what Ansible discovers
ansible-inventory -i inventory/aws_ec2.yml --graph
# Output:
# @all:
# |--@env_production:
# | |--10.0.1.10
# | |--10.0.1.11
# |--@role_webserver:
# | |--10.0.1.10
# |--@az_us_east_1a:
# | |--10.0.1.10
Under the Hood: The
aws_ec2plugin calls the EC2DescribeInstancesAPI.keyed_groupscreates Ansible groups from instance metadata. Thecomposeblock builds per-host variables using Jinja2 expressions. The single quotes around'ubuntu'in compose are deliberate — without them, Ansible would try to resolveubuntuas a variable name.Gotcha: The
aws_ec2plugin requires theamazon.awscollection and theboto3Python package. Install both:ansible-galaxy collection install amazon.awsandpip install boto3. If you forgetboto3, the error is unhelpfully vague:"Failed to import the required Python library".
group_vars and host_vars¶
Variables that apply to groups or specific hosts live in directories alongside the inventory:
inventory/
aws_ec2.yml
group_vars/
all.yml # Every host gets these
role_webserver.yml # Only hosts tagged Role=webserver
env_production.yml # Only hosts tagged Environment=production
host_vars/
10.0.1.10.yml # Overrides for one specific host
# inventory/group_vars/all.yml
ntp_servers:
- 169.254.169.123 # AWS time sync service
timezone: UTC
monitoring_agent: prometheus-node-exporter
# inventory/group_vars/role_webserver.yml
nginx_worker_processes: auto
nginx_worker_connections: 2048
app_port: 8080
health_check_path: /health
Flashcard Check: Inventory¶
| Question | Answer |
|---|---|
| What's the difference between static and dynamic inventory? | Static = manually maintained file. Dynamic = plugin/script queries an API at runtime. |
What does keyed_groups do in a dynamic inventory plugin? |
Creates Ansible groups from instance metadata (tags, zones, types). |
| Where do you put variables that apply to all hosts in a group? | group_vars/<group_name>.yml alongside the inventory file. |
| What always wins in variable precedence? | Extra vars (-e on the command line) — they override everything. |
Part 4: Playbooks — What Do We Want to Happen?¶
Playbooks are YAML files that define the desired state of your infrastructure. A playbook contains plays; each play targets hosts and runs tasks.
---
- name: Configure web servers
hosts: webservers
become: yes # Run as root (sudo)
vars:
app_port: 8080
packages:
- nginx
- python3
- certbot
tasks:
- name: Install required packages
ansible.builtin.apt:
name: "{{ packages }}"
state: present
update_cache: yes
- name: Copy nginx config
ansible.builtin.template:
src: templates/nginx.conf.j2
dest: /etc/nginx/sites-available/default
owner: root
group: root
mode: '0644'
notify: Restart nginx
- name: Ensure nginx is running
ansible.builtin.service:
name: nginx
state: started
enabled: yes
handlers:
- name: Restart nginx
ansible.builtin.service:
name: nginx
state: restarted
Anatomy of a Playbook¶
| Component | Purpose |
|---|---|
name |
Human-readable description (shows in output) |
hosts |
Which inventory hosts/groups to target |
become |
Escalate privileges (sudo) |
vars |
Variables for this play |
tasks |
Ordered list of actions to perform |
handlers |
Actions triggered by notify (run once at end of play) |
pre_tasks |
Tasks that run before roles |
post_tasks |
Tasks that run after roles |
roles |
Reusable role includes |
Running a Playbook¶
# Basic run
ansible-playbook site.yml -i inventory.yml
# Dry run — see what would change
ansible-playbook site.yml --check --diff
# Limit to one host
ansible-playbook site.yml --limit web1.example.com
# Run only tagged tasks
ansible-playbook site.yml --tags "nginx"
# Pass extra variables
ansible-playbook site.yml -e "app_version=2.1.0"
# Step through task by task (interactive)
ansible-playbook site.yml --step
Part 5: Modules — The Building Blocks¶
Modules are the units of work in Ansible. There are 7,000+ modules covering everything from package management to cloud provisioning. Each module is idempotent by design — it checks current state and only changes what's needed.
Key Module Categories¶
| Category | Modules | Purpose | Example |
|---|---|---|---|
| Package | apt, dnf, yum, pip |
Install/remove packages | apt: name=nginx state=present |
| File | file, copy, template, lineinfile |
Manage files and content | template: src=app.conf.j2 dest=/etc/app.conf |
| Service | service, systemd |
Manage services | service: name=nginx state=started enabled=true |
| User | user, group, authorized_key |
Manage users and access | user: name=deploy groups=sudo |
| Command | command, shell, script, raw |
Run arbitrary commands | command: /opt/deploy.sh |
| Cloud | ec2, gcp_compute_instance |
Cloud resource management | ec2: instance_type=t3.medium |
| Debug | debug, assert, fail |
Debugging and assertions | debug: var=ansible_hostname |
The Golden Rule: Prefer Modules Over Shell¶
# BAD — not idempotent, runs every time, slow
- name: Install nginx
ansible.builtin.shell: apt-get install -y nginx
# GOOD — idempotent, only changes if needed, reports status correctly
- name: Install nginx
ansible.builtin.apt:
name: nginx
state: present
Modules know how to check current state. The apt module checks if nginx is already installed before doing anything. The shell module has no idea — it runs the command every time, reports "changed" every time, and takes 20 minutes because it reinstalls packages every run.
Rule of thumb: Only use
shell/commandwhen there's no module for what you need. And when you do, addcreates:orwhen:to make it idempotent.
# shell with idempotency guard
- name: Create database
ansible.builtin.command: createdb myapp
args:
creates: /var/lib/postgresql/myapp # Skip if this path exists
# Even better — use the proper module
- name: Create database
postgresql_db:
name: myapp
state: present
lineinfile vs template vs blockinfile¶
These three modules manage file content differently:
| Module | Use When | Gotcha |
|---|---|---|
lineinfile |
Changing a single line in a file you don't fully manage | Two tasks matching the same regex fight each other |
blockinfile |
Adding a multi-line block to a file | Adds marker comments (BEGIN/END ANSIBLE MANAGED BLOCK) |
template |
You fully manage the file | Overwrites entire file; previous manual edits are lost |
# lineinfile — surgical one-line edit
- ansible.builtin.lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PermitRootLogin'
line: 'PermitRootLogin no'
# template — full file management
- ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
validate: nginx -t -c %s # Validates BEFORE writing
Part 6: Variables, Facts, and the Precedence Nightmare¶
Variable Sources¶
Variables can be defined in many places:
# In the playbook
vars:
app_port: 8080
# In group_vars
# group_vars/webservers.yml
http_port: 80
# In host_vars
# host_vars/web1.example.com.yml
http_port: 9090 # Override for this specific host
# On the command line
# ansible-playbook site.yml -e "app_port=9090"
The 22 Levels of Variable Precedence¶
Ansible has 22 levels of variable precedence. Yes, twenty-two. When the same variable is defined in multiple places, the highest-precedence one wins.
(lowest precedence)
1. command line values (not variables)
2. role defaults (roles/x/defaults/main.yml)
3. inventory file or script group vars
4. inventory group_vars/all
5. playbook group_vars/all
6. inventory group_vars/*
7. playbook group_vars/*
8. inventory file or script host vars
9. inventory host_vars/*
10. playbook host_vars/*
11. host facts / cached set_facts
12. play vars
13. play vars_prompt
14. play vars_files
15. role vars (roles/x/vars/main.yml) ← higher than play vars!
16. block vars
17. task vars
18. include_vars
19. set_facts / register
20. role (and include_role) params
21. include params
22. extra vars (-e on command line) ← ALWAYS wins
(highest precedence)
The Most Common Traps¶
Trap 1: Role vars/ beats play vars:
# roles/nginx/vars/main.yml
nginx_worker_connections: 768 # ← This wins (precedence 15)
# playbook.yml
- hosts: webservers
vars:
nginx_worker_connections: 1024 # ← This loses (precedence 12)
roles:
- nginx
# Result: 768. Not 1024. Surprise.
Rule: If users should be able to override a variable, put it in
defaults/main.yml(precedence 2). If they shouldn't, put it invars/main.yml(precedence 15). Never put the same variable in both.
Trap 2: Extra vars override everything
Passing -e app_env=staging in a CI pipeline that deploys to production will override the inventory's app_env=production. Use extra vars sparingly.
Debugging Variables¶
# See what value a host actually gets
ansible -m debug -a "var=app_port" webserver
# See ALL variables for a host
ansible -m debug -a "var=hostvars[inventory_hostname]" webserver
# In a playbook
- ansible.builtin.debug:
var: nginx_worker_connections
Facts — Auto-Discovered Host Information¶
Facts are system information Ansible gathers automatically at the start of each play:
- ansible.builtin.debug:
msg: "OS: {{ ansible_distribution }} {{ ansible_distribution_version }}"
# Output: "OS: Ubuntu 22.04"
- ansible.builtin.debug:
msg: "IP: {{ ansible_default_ipv4.address }}"
- ansible.builtin.debug:
msg: "CPUs: {{ ansible_processor_vcpus }}"
Common facts: ansible_hostname, ansible_os_family, ansible_distribution, ansible_default_ipv4.address, ansible_processor_vcpus, ansible_memtotal_mb, ansible_fqdn.
Modern practice: Ansible now prefers accessing facts via the
ansible_factsdictionary — e.g.,ansible_facts['distribution']instead of the bareansible_distributioninjection. Both work, but the dictionary form is explicit, avoids variable-name collisions, and is whatansible-lintrecommends. You can enforce this by settinginject_facts_as_vars: falseinansible.cfg.
# View all facts for a host
ansible webserver -m setup
# Filter facts
ansible webserver -m setup -a "filter=ansible_distribution*"
Part 7: Jinja2 Templating — Dynamic Config Files¶
Templates generate config files with dynamic content. Ansible uses Jinja2, the same engine behind Flask and Django.
Basic Syntax¶
{# This is a comment #}
{{ variable }} {# Output a value #}
{{ var | default("fallback") }} {# Filter with default #}
{% for item in list %} {# Loop #}
{% if condition %} {# Conditional #}
{% endif %}
{% endfor %}
A Real Template¶
{# templates/nginx.conf.j2 #}
server {
listen {{ http_port }};
server_name {{ ansible_fqdn }};
location / {
proxy_pass http://127.0.0.1:{{ app_port }};
}
{% if enable_ssl %}
listen 443 ssl;
ssl_certificate /etc/ssl/{{ ansible_fqdn }}.crt;
{% endif %}
}
Dynamic Upstream from Inventory Groups¶
{# templates/upstream.conf.j2 #}
upstream app_servers {
{%- for host in groups['app'] %}
server {{ hostvars[host]['ansible_host'] }}:{{ app_port | default(8080) }};
{%- endfor %}
}
The {%- syntax (note the dash) strips whitespace before the tag. Without it, {% for %} loops add blank lines between entries.
Essential Jinja2 Filters¶
| Filter | What It Does | Example |
|---|---|---|
default('val') |
Fallback if undefined | {{ timeout \| default(30) }} |
mandatory |
Fail if undefined | {{ db_host \| mandatory }} |
to_nice_json |
Pretty-print as JSON | {{ config_dict \| to_nice_json }} |
regex_replace |
Regex substitution | {{ hostname \| regex_replace('\.example\.com$', '') }} |
join(', ') |
Join list into string | {{ dns_servers \| join(', ') }} |
selectattr |
Filter list of dicts | {{ users \| selectattr('active', 'true') \| list }} |
b64encode |
Base64 encode | {{ secret \| b64encode }} |
password_hash |
Hash a password | {{ pass \| password_hash('sha512') }} |
basename |
Extract filename from path | {{ '/etc/nginx/conf.d/app.conf' \| basename }} |
Gotcha: YAML treats
{as the start of a mapping. Any value that begins with{{must be quoted:message: "{{ greeting }} world". Without quotes you get the cryptic errormapping values are not allowed in this context. This is the single most common Ansible YAML error.
Part 8: Handlers — Actions That Only Fire When Needed¶
Handlers are special tasks that run only when notified, and only once at the end of the play (even if notified multiple times).
tasks:
- name: Update nginx config
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: Restart nginx
- name: Update SSL cert
ansible.builtin.copy:
src: cert.pem
dest: /etc/ssl/cert.pem
notify: Restart nginx # Won't restart twice
handlers:
- name: Restart nginx
ansible.builtin.service:
name: nginx
state: restarted
Key Behavior¶
- Handlers only fire if the notifying task reports "changed" (not "ok")
- Handlers run at the end of the play, not after the notifying task
- If a task notifies the same handler twice, it still only runs once
- If the play fails before reaching the handler phase, handlers don't run
Common Handler Mistakes¶
Mistake 1: Handler doesn't fire because task reports "ok"
If the config file already matches the template, the task reports "ok" (not "changed"), and the handler never fires. This is correct behavior — but if you renamed the handler and forgot to update the notify, it silently does nothing.
Mistake 2: Handler doesn't run because a later task fails
Your config changed, but a later task failed, so handlers never ran. Now the service runs with an old config that doesn't match the files on disk.
Fix: Use meta: flush_handlers if you need handlers to fire immediately:
tasks:
- name: Update config
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: Restart nginx
- ansible.builtin.meta: flush_handlers # Force handlers to run NOW
- name: Potentially risky task
ansible.builtin.command: /opt/risky-thing.sh
# Force all handlers to run regardless of notification
ansible-playbook playbook.yml --force-handlers
Part 9: Roles — Reusable, Testable Building Blocks¶
A role is a directory structure that packages tasks, templates, variables, handlers, and metadata into a reusable unit. Think of it like a function in code — it takes inputs (variables), does work (tasks), and has side effects (handlers).
The Complete Role Directory Layout¶
roles/
webapp/
defaults/main.yml # Default variables — LOW precedence, meant to be overridden
vars/main.yml # Role variables — HIGH precedence, hard to override
tasks/main.yml # The actual work
handlers/main.yml # Actions triggered by notify
templates/ # Jinja2 templates (.j2 files)
nginx-vhost.conf.j2
app-config.yml.j2
files/ # Static files copied as-is (no templating)
logrotate-webapp
meta/main.yml # Role metadata: dependencies, platforms, author
molecule/ # Test suite
default/
molecule.yml
converge.yml
verify.yml
defaults/ vs vars/ — The Most Consequential Choice¶
| Directory | Precedence | Purpose | When to Use |
|---|---|---|---|
defaults/main.yml |
2 (lowest) | Knobs users should turn | Ports, versions, feature flags |
vars/main.yml |
15 (high) | Constants that shouldn't change | Internal paths, required packages |
War Story: A team put
app_port: 8080in bothdefaults/andvars/during a refactor. Both said 8080, so testing caught nothing. Three months later, staging setapp_port: 9090in theirgroup_vars/. It didn't work —vars/main.ymlsilently stomped the override (higher precedence). They spent two hours debugging before someone ranansible -m debug -a "var=app_port" staging-web01. The fix: moveapp_portout ofvars/. If users should override it, it goes in defaults. Period.
Creating a Role¶
# roles/postgresql/defaults/main.yml
postgresql_version: "15"
postgresql_max_connections: 100
# roles/postgresql/tasks/main.yml
---
- name: Install PostgreSQL
ansible.builtin.apt:
name: "postgresql-{{ postgresql_version }}"
state: present
- name: Configure PostgreSQL
ansible.builtin.template:
src: postgresql.conf.j2
dest: "/etc/postgresql/{{ postgresql_version }}/main/postgresql.conf"
notify: Restart PostgreSQL
# roles/postgresql/handlers/main.yml
---
- name: Restart PostgreSQL
ansible.builtin.service:
name: postgresql
state: restarted
Using Roles in a Playbook¶
- hosts: webservers
roles:
- common
- webserver
- { role: monitoring, tags: ['monitoring'] }
# Or dynamically
- hosts: webservers
tasks:
- ansible.builtin.include_role:
name: webserver
when: deploy_web | default(true)
include_role vs import_role¶
| Feature | import_role |
include_role |
|---|---|---|
| Processing | Static — resolved at parse time | Dynamic — resolved at runtime |
| Tags | Inherited by all tasks | Only apply to the include itself |
| Conditionals | Applied to every task in the role | Applied once to the include |
| Use when | You want tags/conditions to propagate | You need conditional or looped inclusion |
Role Dependencies¶
# roles/webapp/meta/main.yml
dependencies:
- role: common
- role: monitoring
vars:
monitoring_port: "{{ app_port }}"
Dependencies run before the role's tasks. By default, a role runs only once even if listed as a dependency by multiple roles (use allow_duplicates: true to change this).
Part 10: Ansible Vault — Secrets That Are Safe to Commit¶
Your playbooks need database passwords, API keys, and TLS certificates. These need to live in the repo so automation can use them, but they must be encrypted. Ansible Vault encrypts data with AES-256-CTR and HMAC-SHA256 for integrity, using PBKDF2 key stretching.
Essential Vault Commands¶
# Create a new encrypted file
ansible-vault create secrets.yml
# Encrypt an existing file
ansible-vault encrypt vars/passwords.yml
# Edit an encrypted file (decrypts to tmpfs, re-encrypts on save)
ansible-vault edit secrets.yml
# View without decrypting to disk
ansible-vault view secrets.yml
# Change the encryption password
ansible-vault rekey secrets.yml
# Encrypt a single string
echo -n 'my_secret_password' | ansible-vault encrypt_string --stdin-name 'db_password'
# Outputs:
# db_password: !vault |
# $ANSIBLE_VAULT;1.1;AES256
# ...
# Run playbook with vault password
ansible-playbook site.yml --ask-vault-pass
ansible-playbook site.yml --vault-password-file ~/.vault_pass
The Vault/Vars Split Pattern¶
Don't encrypt your entire vars file. Use two files:
inventory/
group_vars/
env_production/
vars.yml # Plaintext — references vault variables
vault.yml # Encrypted — contains the actual secrets
# vault.yml (encrypted)
vault_db_password: "s3cr3t_pr0d_p4ss"
vault_api_key: "ak_prod_xK9mP2qR7vN4"
# vars.yml (plaintext)
db_password: "{{ vault_db_password }}"
api_key: "{{ vault_api_key }}"
Why the split? You can grep for where db_password is used without decrypting anything. When reviewing a PR, you can see that db_password was changed without needing the vault password.
Multiple Vault IDs¶
Different secrets for different teams:
# Encrypt with specific vault IDs
ansible-vault encrypt --vault-id dev@prompt secrets-dev.yml
ansible-vault encrypt --vault-id prod@/path/to/prod-password secrets-prod.yml
# Run with multiple vault IDs
ansible-playbook site.yml \
--vault-id dev@prompt \
--vault-id prod@/path/to/prod-password
Vault Security Rules¶
- Never pass secrets on the command line:
ansible-vault encrypt_string 'actual_password'puts the password in shell history. Pipe it instead. - Use
no_log: trueon tasks that handle secrets. Without it, your database password shows up in CI logs. - Store the vault password file with restrictive permissions:
chmod 600 ~/.vault_pass - For enterprise: Use an external secret store (HashiCorp Vault, AWS Secrets Manager) with dynamic credentials when possible.
Part 11: Error Handling¶
ignore_errors — The Blunt Instrument¶
- name: Check if legacy service exists
ansible.builtin.command: systemctl status legacy-app
register: result
ignore_errors: true
- name: Stop legacy app if it exists
ansible.builtin.service:
name: legacy-app
state: stopped
when: result.rc == 0
Warning:
ignore_errors: truesilently swallows ALL errors. Six months later, your cleanup task has been failing silently and the disk is full. Preferfailed_whenwith specific conditions.
failed_when — Targeted Failure Conditions¶
- name: Check disk space
ansible.builtin.command: df -h /
register: disk_check
failed_when: "'100%' in disk_check.stdout"
block/rescue/always — Structured Error Handling¶
- block:
- name: Deploy application
ansible.builtin.command: /opt/deploy.sh
- name: Verify deployment
ansible.builtin.uri:
url: http://localhost:8080/health
status_code: 200
rescue:
- name: Rollback on failure
ansible.builtin.command: /opt/rollback.sh
always:
- name: Send notification
ansible.builtin.debug:
msg: "Deploy attempt complete"
This is Ansible's try/catch/finally. If any task in block fails, rescue runs. always runs regardless.
changed_when — Control What Counts as "Changed"¶
- name: Check current version
ansible.builtin.command: cat /opt/app/VERSION
register: version
changed_when: false # This task never "changes" anything
Part 12: Conditionals and Loops¶
Conditionals (when)¶
# Based on OS
- name: Install packages (Debian)
ansible.builtin.apt:
name: "{{ item }}"
state: present
loop: [nginx, curl, htop]
when: ansible_os_family == "Debian"
- name: Install packages (RedHat)
ansible.builtin.yum:
name: "{{ item }}"
state: present
loop: [nginx, curl, htop]
when: ansible_os_family == "RedHat"
# Based on a registered variable
- ansible.builtin.command: which docker
register: docker_check
ignore_errors: true
- name: Install Docker
ansible.builtin.apt:
name: docker.io
state: present
when: docker_check.rc != 0
Loops¶
# Simple list
- name: Install packages
ansible.builtin.apt:
name: "{{ item }}"
state: present
loop:
- nginx
- curl
- htop
# Loop with dictionaries
- name: Create users
ansible.builtin.user:
name: "{{ item.name }}"
groups: "{{ item.groups }}"
shell: /bin/bash
loop:
- { name: alice, groups: "sudo,docker" }
- { name: bob, groups: "docker" }
- { name: carol, groups: "sudo" }
Part 13: Rolling Updates and Zero-Downtime Deploys¶
This is where everything comes together. Deploying to 50 servers behind a load balancer without dropping a single request.
The Strategy¶
# playbooks/rolling-upgrade.yml
---
- name: Rolling upgrade — {{ app_name }} {{ app_version }}
hosts: role_webserver
serial:
- 1 # First: single canary
- "10%" # Then: 10% at a time
- "25%" # Then: 25% at a time
max_fail_percentage: 10
pre_tasks:
- name: Verify current version
ansible.builtin.command: "cat {{ app_home }}/VERSION"
register: pre_version
changed_when: false
- name: Remove from ALB target group
community.aws.elb_target:
target_group_arn: "{{ alb_target_group_arn }}"
target_id: "{{ ansible_host }}"
state: absent
delegate_to: localhost
- name: Wait for connections to drain
ansible.builtin.pause:
seconds: 30
roles:
- role: webapp
vars:
app_version: "{{ target_version }}"
post_tasks:
- name: Wait for application health
ansible.builtin.uri:
url: "{{ health_check_url }}"
return_content: true
status_code: 200
register: health
retries: 5
delay: 10
until: health.status == 200
- name: Re-add to ALB target group
community.aws.elb_target:
target_group_arn: "{{ alb_target_group_arn }}"
target_id: "{{ ansible_host }}"
target_port: "{{ app_port }}"
state: present
delegate_to: localhost
- name: Report upgrade status
ansible.builtin.debug:
msg: "{{ inventory_hostname }}: {{ pre_version.stdout }} → {{ post_version.stdout }}"
Key Concepts¶
serial: [1, "10%", "25%"] — Graduated rollout. First batch is a single canary. If it survives, widen to 10%, then 25%. This catches "completely broken" (canary fails) and "fails under load" (10% batch reveals issues).
max_fail_percentage: 10 — The circuit breaker. If more than 10% of hosts in any batch fail, Ansible stops the entire play. Without this, a bad deploy rolls across all servers.
Mental Model: Think of
serial+max_fail_percentageas a circuit breaker pattern — the same concept used in microservice architectures.serialcontrols the batch size (how much current flows).max_fail_percentageis the trip threshold. Together, they limit blast radius.
delegate_to: localhost — The ALB API calls run on your control node, not on the web servers. delegate_to changes where the task runs, not whose variables it uses.
Rollback with block/rescue¶
tasks:
- block:
- name: Deploy new version
ansible.builtin.include_role:
name: webapp
vars:
app_version: "{{ target_version }}"
- name: Verify health
ansible.builtin.uri:
url: "{{ health_check_url }}"
status_code: 200
retries: 5
delay: 10
until: health.status == 200
rescue:
- name: ROLLBACK — deploy previous version
ansible.builtin.include_role:
name: webapp
vars:
app_version: "{{ pre_version.stdout | trim }}"
- name: Notify on rollback
community.general.slack:
token: "{{ vault_slack_token }}"
channel: "#deploys"
msg: "ROLLBACK on {{ inventory_hostname }}: {{ target_version }} failed"
delegate_to: localhost
ignore_errors: true
Execution Strategies¶
| Strategy | Behavior | Use When |
|---|---|---|
linear (default) |
All hosts execute each task before moving to the next | Normal operations |
free |
Each host runs as fast as it can, independently | Tasks are independent per host |
debug |
Interactive step-by-step debugging | Troubleshooting |
Part 14: Debugging — Why Did It Do That?¶
The Verbosity Ladder¶
# Normal: show task names and status
ansible-playbook playbook.yml
# -v: show task results (return values)
ansible-playbook playbook.yml -v
# -vv: show task input parameters
ansible-playbook playbook.yml -vv
# -vvv: show SSH connection details
ansible-playbook playbook.yml -vvv
# -vvvv: show full SSH protocol debugging
ansible-playbook playbook.yml -vvvv
For most debugging: -vv (see what values were used). For connection issues: -vvv (see SSH commands). For SSH key/auth problems: -vvvv.
Check Mode and Diff Mode¶
# Preview what WOULD change without changing anything
ansible-playbook playbook.yml --check --diff
# Example diff output:
# TASK [Copy nginx config]
# --- before: /etc/nginx/nginx.conf
# +++ after: /tmp/ansible-generated
# @@ -10,3 +10,3 @@
# - worker_connections 768;
# + worker_connections 1024;
# changed: [webserver]
Gotcha: Check mode doesn't work with
command/shellmodules (they can't predict what a shell command would do). The module docs tell you if check mode is supported.Nuance: Check mode has partial support for
command/shellmodules when usingcreates/removesparameters — Ansible can check whether the file exists without running the command. But don't assume check mode is perfect. It's excellent for declarative modules (apt,template,service) but limited with imperative tasks. Always review the diff output critically.Non-negotiable rule: Always run
--check --diffbefore every production run.
Essential Debugging Commands¶
# List what would be affected
ansible-playbook site.yml --list-hosts
ansible-playbook site.yml --list-tasks
ansible-playbook site.yml --list-tags
# Syntax check (no execution)
ansible-playbook site.yml --syntax-check
# Start at a specific task (skip earlier ones)
ansible-playbook site.yml --start-at-task="Install nginx"
# Use the retry file after a failure
ansible-playbook site.yml --limit @site.retry
# Test connectivity
ansible all -m ping
# Debug a specific variable on a specific host
ansible -m debug -a "var=hostvars[inventory_hostname]" web1.example.com
Debugging Command Reference¶
| Task | Command |
|---|---|
| Preview changes | ansible-playbook play.yml --check --diff |
| Debug variables | ansible -m debug -a "var=VAR" host |
| Override a variable | ansible-playbook play.yml -e "var=value" |
| Force handlers | ansible-playbook play.yml --force-handlers |
| Show facts | ansible host -m setup |
| Verbose output | -v (results) -vv (inputs) -vvv (SSH) |
| Step-by-step | ansible-playbook play.yml --step |
| Start at task | ansible-playbook play.yml --start-at-task "Name" |
| List tasks | ansible-playbook play.yml --list-tasks |
| Syntax check | ansible-playbook play.yml --syntax-check |
Part 15: Performance at Scale¶
Gathering facts on 50 hosts takes time. On 500 hosts, it takes minutes before your first task runs. Here's how to fix that.
ansible.cfg Tuning¶
# ansible.cfg
[defaults]
forks = 30 # Default is 5 — too low for any real fleet
gathering = smart # Only gather if cache is stale
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400 # 24 hours
interpreter_python = auto_silent # Auto-detect Python; suppress warning noise
retry_files_enabled = False # Don't litter .retry files everywhere
# Show task timing (which tasks are slow?)
callbacks_enabled = timer, profile_tasks
callback_result_format = yaml # Output results as YAML instead of JSON
[ssh_connection]
pipelining = True # Reduces SSH round trips per task
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
| Setting | What It Does | Impact |
|---|---|---|
forks = 30 |
Run 30 hosts in parallel (default: 5) | Linear speedup for independent tasks |
gathering = smart |
Skip fact gathering if cache is fresh | Saves 2–10 seconds per host |
pipelining = True |
Run modules in-process instead of copying | 2–3x faster per task |
ControlPersist=60s |
Reuse SSH connections for 60 seconds | Fewer SSH handshakes |
interpreter_python = auto_silent |
Auto-detect Python path on targets without warning | Cleaner output on mixed fleets |
retry_files_enabled = False |
Suppress .retry file creation on failures |
Less clutter; use --limit @file explicitly |
Scale note: ControlPersist + pipelining is the single biggest performance win for Ansible over SSH. Without pipelining, each task requires multiple SSH round trips (copy module, execute, fetch result). With pipelining, everything happens in one SSH session. Enable both for any fleet over 20 hosts.
Fact Caching with Redis (Shared Across CI Runners)¶
# For shared caching across CI runners
fact_caching = redis
fact_caching_connection = redis://localhost:6379/0
Disable Fact Gathering When Not Needed¶
- hosts: webservers
gather_facts: no # Skip fact gathering entirely
tasks:
- name: Restart nginx
ansible.builtin.service:
name: nginx
state: restarted
Mitogen Plugin¶
Mitogen replaces Ansible's SSH-based module execution with a more efficient method. It provides 2–7x speedup for many workloads and is a drop-in replacement — just change the strategy plugin.
Gotcha: Stale fact caches cause subtle bugs. If someone adds a disk or changes an IP, cached facts show old state. For critical operations, force fresh facts:
ansible.builtin.setup: gather_subset: [network, hardware]
Async for Long-Running Tasks¶
- name: Run migration
ansible.builtin.command: /opt/migrate.sh
async: 3600 # Max runtime 1 hour
poll: 30 # Check every 30 seconds
Part 16: Molecule — Testing Before Production¶
Molecule is Ansible's testing framework. It spins up containers, runs your role, verifies the result, and tears everything down.
Setup¶
# roles/webapp/molecule/default/molecule.yml
driver:
name: docker
platforms:
- name: ubuntu-noble
image: ubuntu:noble
pre_build_image: true
command: /lib/systemd/systemd
privileged: true
- name: rocky-9
image: rockylinux:9
pre_build_image: true
command: /lib/systemd/systemd
privileged: true
provisioner:
name: ansible
verifier:
name: ansible
# roles/webapp/molecule/default/converge.yml
---
- name: Converge
hosts: all
vars:
app_name: testapp
app_version: "1.0.0"
roles:
- role: webapp
# roles/webapp/molecule/default/verify.yml
---
- name: Verify
hosts: all
tasks:
- name: Check application directory exists
ansible.builtin.stat:
path: /opt/testapp
register: app_dir
- name: Assert application directory is correct
ansible.builtin.assert:
that:
- app_dir.stat.exists
- app_dir.stat.isdir
- name: Check nginx config is valid
ansible.builtin.command: nginx -t
changed_when: false
become: true
Running Molecule¶
# Full test cycle (create → converge → idempotence → verify → destroy)
molecule test
# Just apply the role (leave containers running for debugging)
molecule converge
# Run verification
molecule verify
# SSH into a test container to poke around
molecule login -h ubuntu-noble
# Destroy test containers
molecule destroy
The Idempotence Check¶
The molecule test sequence includes an idempotence check — it runs the playbook twice and fails if anything reports "changed" on the second run. This catches non-idempotent tasks.
# The idempotence check catches this:
# TASK [Create database] ********************************
# changed: [ubuntu-noble] # First run: OK
# changed: [ubuntu-noble] # Second run: NOT OK — should be "ok"
Molecule's full test sequence: dependency → lint → cleanup → destroy → syntax → create → prepare → converge → idempotence → verify → cleanup → destroy. The idempotence step is what separates "it runs" from "it's production-ready."
ansible-lint — Static Analysis for Playbooks¶
ansible-lint catches bad practices, deprecated syntax, and style issues before you run anything:
It flags common issues: missing FQCNs, bare variables in when: clauses, missing name: on tasks, and deprecated module usage. Treat it like shellcheck for Ansible — run it in CI and fix what it flags.
Execution Environments, Builder, and Navigator¶
As your Ansible footprint grows, "works on my machine" becomes a real problem — different engineers have different Python versions, different collection versions, and different system libraries. Execution Environments (EEs) solve this by packaging the entire controller-side runtime (Python, ansible-core, collections, Python dependencies) into a container image.
| Tool | Purpose |
|---|---|
ansible-builder |
Creates EE container images from a definition file |
ansible-navigator |
Modern CLI/TUI for running playbooks inside EEs (replaces ansible-playbook for EE workflows) |
pip install ansible-navigator ansible-builder
# Build an EE from a definition
ansible-builder build --tag my-ee:latest
# Run a playbook inside an EE
ansible-navigator run site.yml -i inventory.yml --execution-environment-image my-ee:latest
Think of EEs as "Docker for your Ansible controller." The playbooks and inventory stay on your filesystem; the runtime (Python, modules, collections, libraries) lives in the container. This guarantees every engineer and CI runner uses the exact same toolchain.
Part 17: Ansible Galaxy and Collections¶
Ansible Galaxy¶
Galaxy is the community hub for sharing roles and collections.
# Install a role from Galaxy
ansible-galaxy install geerlingguy.nginx
# Install from a requirements file (pinned versions)
ansible-galaxy install -r requirements.yml
# requirements.yml
roles:
- name: geerlingguy.nginx
version: "3.1.0"
- name: geerlingguy.docker
version: "6.0.0"
Collections¶
Collections are the modern packaging format — they bundle roles, modules, and plugins together.
# Install a collection
ansible-galaxy collection install amazon.aws
ansible-galaxy collection install community.general
# Use fully qualified collection names in playbooks
- name: Install package
ansible.builtin.apt:
name: nginx
state: present
Trivia: Ansible Galaxy launched in 2013 with about 200 roles. By 2024, it hosted over 40,000 roles and collections, making it one of the largest repositories of reusable infrastructure code.
Part 18: Tower/AWX — Centralized Automation¶
For a team of one, ansible-playbook on your laptop works fine. For a team of ten, you need centralized execution, RBAC, audit trails, and scheduling. That's Tower (commercial, now Ansible Automation Platform) or AWX (free upstream).
| Feature | CLI (ansible-playbook) |
Tower/AWX |
|---|---|---|
| Execution | Your laptop/CI runner | Centralized server |
| RBAC | None (SSH key = full access) | Role-based access per project, inventory, credential |
| Audit trail | Shell history, maybe CI logs | Full job log with who, when, what, and diff |
| Scheduling | Cron job | Built-in scheduler with dependencies |
| Credentials | Files on disk / env vars | Encrypted credential store with access control |
| API | None (wrap in scripts) | Full REST API for integration |
| Cost | Free | AWX = free, Tower (AAP) = Red Hat subscription |
Trivia: Red Hat open-sourced AWX in 2017 as the upstream for Ansible Tower. This was unusual — they essentially gave away the code for a commercial product. The strategy mirrors Red Hat's Fedora/RHEL model: free upstream builds community, paid product adds support and certification.
Part 19: Ansible vs Terraform vs Helm¶
| Tool | What It Manages | How It Works | State |
|---|---|---|---|
| Terraform | Cloud infrastructure (VPCs, instances, databases, DNS) | Declarative: "I want 3 servers" → Terraform figures out how | Explicit state file |
| Ansible | Server configuration (packages, files, services, users) | Procedural + declarative: tasks in order on hosts | Stateless (checks each run) |
| Helm | Kubernetes workloads (Deployments, Services, ConfigMaps) | Declarative: templates K8s YAML | Release history in K8s secrets |
The simple rule: Terraform builds the house. Ansible furnishes it. Helm runs the apps.
Terraform: Creates the VPC, subnets, EC2 instances, RDS database, S3 buckets
↓
Ansible: Configures the EC2 instances (packages, users, sshd, monitoring)
↓
Helm: Deploys applications to the Kubernetes cluster
Common pattern:
# Terraform creates infra and generates inventory
terraform apply
terraform output -json | ./generate_inventory.py > inventory.yml
# Ansible configures it
ansible-playbook -i inventory.yml site.yml
When NOT to use Ansible: In a pure immutable-infrastructure/Kubernetes world, Ansible is less common (Helm and operators handle config). Ansible remains essential for node bootstrapping, bare-metal, network devices, and legacy systems.
Part 20: Common Production Patterns¶
Bootstrap Pattern — Provisioning Bare Servers¶
- name: Bootstrap new servers
hosts: "{{ target }}"
gather_facts: false # Python might not be installed yet
tasks:
- name: Install Python (raw — no Python required)
ansible.builtin.raw: |
if command -v apt-get >/dev/null 2>&1; then
apt-get update && apt-get install -y python3
elif command -v dnf >/dev/null 2>&1; then
dnf install -y python3
fi
changed_when: true
- name: Now gather facts
ansible.builtin.setup:
- name: Run common role
ansible.builtin.include_role:
name: common
Under the Hood: The
rawmodule doesn't require Python on the target — it sends commands over SSH directly. This is whygather_facts: falseis mandatory here: thesetupmodule (which gathers facts) is a Python module and would fail before Python is installed.
Upgrade Pattern¶
# Dry run first — always
ansible-playbook playbooks/rolling-upgrade.yml \
-i inventory/aws_ec2.yml \
-e target_version=2.1.0 \
--check --diff
# Then for real
ansible-playbook playbooks/rolling-upgrade.yml \
-i inventory/aws_ec2.yml \
-e target_version=2.1.0
Rollback Pattern¶
# Rolling back is just deploying the old version
ansible-playbook playbooks/rolling-upgrade.yml \
-i inventory/aws_ec2.yml \
-e target_version=2.0.0
Because the playbook is idempotent, rolling back is just deploying the old version. No special rollback logic needed beyond block/rescue for per-host failures.
K3s Cluster Management Pattern¶
Real-world example — bootstrapping and upgrading a Kubernetes cluster:
devops/ansible/
ansible.cfg
inventory/
hosts.local.yml # Single-node local
hosts.example.yml # Multi-node template
group_vars/all.yml # k3s version, etc.
roles/
k3s_server/ # Install and configure k3s
k3s_agent/ # Join agent nodes
helm/ # Install Helm binary
addons/ # Observability stack
playbooks/
bootstrap-k3s.yml # Full cluster bootstrap
upgrade-k3s.yml # Rolling k3s upgrade
install-addons.yml # Cluster add-ons
# Bootstrap a cluster
ansible-playbook playbooks/bootstrap-k3s.yml
# Rolling upgrade with version override
ansible-playbook playbooks/upgrade-k3s.yml -e k3s_version=v1.31.0+k3s1
The validate Parameter — Free Insurance¶
- name: Copy nginx config
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
validate: nginx -t -c %s # Validates BEFORE writing
notify: Restart nginx
- name: Update sudoers
ansible.builtin.template:
src: sudoers.j2
dest: /etc/sudoers
validate: visudo -cf %s # Validates BEFORE writing
- name: Update sshd config
ansible.builtin.template:
src: sshd_config.j2
dest: /etc/ssh/sshd_config
validate: sshd -t -f %s # Validates BEFORE writing
If validation fails, the original file is untouched. This prevented a production outage where a broken nginx template would have taken down the entire web tier.
Custom Modules¶
When no built-in module exists:
#!/usr/bin/python
# roles/webapp/library/app_health.py
from ansible.module_utils.basic import AnsibleModule
import requests
def main():
module = AnsibleModule(
argument_spec=dict(
url=dict(required=True, type='str'),
timeout=dict(default=10, type='int'),
)
)
try:
resp = requests.get(module.params['url'], timeout=module.params['timeout'])
if resp.status_code == 200:
module.exit_json(changed=False, status=resp.status_code)
else:
module.fail_json(msg=f"Health check returned {resp.status_code}")
except Exception as e:
module.fail_json(msg=str(e))
if __name__ == '__main__':
main()
Part 21: Footguns — Mistakes That Brick Servers¶
These are the mistakes that experienced Ansible users have made (often more than once). Learn from their pain.
1. Running Against all When You Meant One Host¶
You type ansible-playbook site.yml without --limit. It runs against every host in inventory — including production. Your half-tested change is now on 200 servers.
Fix: Always use --limit or --check first. Better: add hosts: "{{ target }}" in playbooks and require the variable: ansible-playbook site.yml -e target=staging.
2. Using shell or command for Everything¶
You write shell: apt-get install nginx instead of apt: name=nginx state=present. It runs every time, isn't idempotent, and takes 20 minutes re-installing packages every run.
Fix: Use native modules. Only use shell/command when there's no module, and add creates: or when: guards.
3. become: true at the Playbook Level¶
You set become: true globally because one task needs root. Now every task runs as root — file copies create root-owned files your app can't read.
Fix: Set become: true on individual tasks that need it, not the whole playbook.
4. Variable Precedence Surprises¶
You define app_port: 8080 in defaults/main.yml, group_vars/all.yml, and -e app_port=9090. Which wins? Extra vars. But the developer who set it in defaults doesn't know someone also set it in group_vars.
Fix: Know the precedence order. Keep variables in one place per scope. Use ansible -m debug -a "var=app_port" hostname to check resolved values.
5. Handlers Not Running After a Failure¶
You change a config file and expect the handler to restart the service. But a later task fails. Handlers don't run if the play fails. Your config changed but the service still runs the old config.
Fix: Use meta: flush_handlers if you need them to run immediately. Don't rely solely on handlers for critical state changes.
6. lineinfile Fighting With Itself¶
Two tasks add different lines matching the same regex. They fight each other. Every run shows "changed."
Fix: Use blockinfile for multi-line content. Use template for files you fully manage. lineinfile is for surgical one-line changes only.
7. No --check Before Production¶
You run a playbook on production without --check first. A template typo renders a broken config. The service restarts with broken config and goes down.
Fix: Always run --check --diff first. Review the diffs. Then run for real.
8. Vault Password in Shell History¶
You type ansible-vault encrypt_string 'mypassword' and the password is in your bash history.
Fix: Use --ask-vault-pass or pipe from stdin. Use chmod 600 on vault password files.
9. Forgetting no_log: true on Sensitive Tasks¶
Your playbook prints the database password in stdout. CI logs capture it.
Fix: Add no_log: true to tasks that handle secrets. Review CI output for leaked credentials.
10. Default Inventory Points to Production¶
New team members run ansible-playbook site.yml and hit prod because the default inventory is production.
Fix: Don't set a default inventory that points to production. Require explicit -i inventory/staging.yml.
11. Ignoring Errors Globally¶
ignore_errors: true because a task fails intermittently. Every error is silently swallowed forever.
Fix: Use failed_when with specific conditions instead of blanket ignore_errors.
Part 22: Real-World Case Studies¶
Case Study 1: Wrong Inventory Hits Production (PM-016)¶
Date: 2025-04-08 | Severity: SEV-3
An engineer ran an NTP configuration playbook against production instead of staging — a copy-paste error from a Slack snippet. The playbook reconfigured 47 production hosts to point NTP at a staging server.
Detection (10 minutes): NTP drift monitoring fired when 12 hosts exceeded 50ms offset. SRE on-call checked chronyc sources -v and immediately saw the staging NTP source.
Resolution (5 minutes): Re-ran the playbook with the correct inventory.
Lessons:
1. Confirmation gates are cheap insurance. A single pause task requiring a human "yes" before touching production hosts costs 10 seconds and prevents the entire class of wrong-inventory mistakes.
2. Alert on configuration identity, not just outcomes. Alerting on "production hosts syncing from a non-production NTP source" would have detected the cause instantly.
3. Shared command snippets are living hazards. A Slack message with a working command becomes an authoritative-looking template. Maintain canonical runbooks, not Slack snippets.
Case Study 2: Playbook Hangs — SSH Agent Forwarding + Firewall¶
Symptom: Ansible playbook hangs on app-server-03 at a Git clone task. Previous servers worked fine.
Investigation trail:
1. DevOps layer: Git clone hangs → SSH agent forwarding? Checked — agent forwarding configured but SSH_AUTH_SOCK empty on new server
2. Linux layer: Sudoers missing env_keep += "SSH_AUTH_SOCK" on new server (provisioning gap)
3. Network layer (actual root cause): New server in restrictive security group — outbound SSH to GitLab subnet blocked. TCP connection to GitLab timed out silently.
Key insight: The symptom was an Ansible playbook hanging (DevOps), the initial investigation pointed to SSH agent forwarding (Linux ops), but the actual root cause was a firewall rule (networking). Ansible playbooks chain multiple SSH hops — agent forwarding, sudo environment, and firewall rules can all cause the same symptom.
Case Study 3: The Anti-Primer — Everything Goes Wrong¶
An ops engineer configuring 200 servers for a new deployment. Deadline pressure. Skips dry-run.
| Hour | Mistake | Consequence |
|---|---|---|
| 0 | Runs without --limit |
Half-tested changes hit entire fleet |
| 1 | shell: apt-get install everywhere |
Non-idempotent; re-runs take 40 minutes |
| 2 | Global become: true |
All files owned by root; app can't read own config |
| 3 | Variable precedence confusion | 3 hours debugging why defaults override doesn't work |
Damage: 2–6 hours of infrastructure instability, 12–24 engineer-hours for remediation, infrastructure team credibility damaged.
Case Study 4: Thinking Out Loud — OpenSSL Patch Rollout¶
A senior SRE rolling out a security patch to 150 servers across 4 environments in 24 hours. Their mental process:
- Assess scope:
ansible all -m shell -a "dpkg -l openssl"— 128 of 150 need the patch - Plan rollout order: dev → staging → prod-us → prod-eu (progressive, environment by environment)
- Write playbook with guardrails:
serial: 25%,max_fail_percentage: 10, pre-check to skip already-patched hosts, health checks with retries - Execute progressively: Test on dev (12 servers), validate manually, then staging, then production with smaller serial count
- Handle the unexpected: One host's health check was slow (app warm-up time > retry window) — adjusted retry timing for remaining servers
- Verify fleet-wide:
ansible all -m shell -a "openssl version" --become -o | sort | uniq -c— all 150 on correct version
Key heuristics: Progressive rollout, serial + circuit breaker, post-change health verification.
Glossary¶
| Term | Definition | Mnemonic/Context |
|---|---|---|
| Ansible | Agentless automation tool using SSH and YAML | Named after FTL communication device from Le Guin's sci-fi |
| Control node | Machine where Ansible is installed and playbooks run from | Your laptop or CI server |
| Managed node | Target server being configured by Ansible | Needs only Python + SSH |
| Inventory | List of hosts and groups Ansible targets | "Who to manage" |
| Playbook | YAML file defining desired state (contains plays) | "What to do" |
| Play | Maps hosts to tasks within a playbook | One section targeting one group |
| Task | A single action using a module | - name: Install nginx |
| Module | Unit of work (package, service, file, etc.) | 7,000+ available |
| Role | Packaged reusable tasks/templates/vars/handlers | Like a function in code |
| Handler | Delayed action triggered only when notified by a changed task | Runs at end of play, not inline |
| Facts | Auto-discovered host data (OS, IP, CPU, memory) | Gathered by setup module |
| Idempotent | Re-running yields same end state without repeated changes | "Safe to run twice" |
| Become | Privilege escalation (sudo) | -b flag or become: true |
| Vault | Encrypts secrets with AES-256 for safe git storage | ansible-vault encrypt |
| Galaxy | Community hub for sharing roles and collections | 40,000+ roles |
| Collection | Modern package format bundling roles + modules + plugins | ansible-galaxy collection install |
| Tower/AWX | Centralized web UI for Ansible with RBAC and scheduling | AWX = free, Tower = paid |
| Molecule | Testing framework for Ansible roles | Idempotence check is the key feature |
| Jinja2 | Templating engine for dynamic config files | {{ variable }}, {% for %} |
| Serial | Batch size for rolling updates | serial: "25%" or serial: [1, "10%", "25%"] |
| Forks | Number of hosts processed in parallel | Default: 5 (too low for real fleets) |
| Pipelining | SSH optimization — run modules in-process | 2–3x speedup |
| delegate_to | Run task on a different host but keep target's variables | Used for API calls from control node |
| block/rescue/always | Structured error handling (try/catch/finally) | Better than ignore_errors |
| check mode | Dry run — show what would change without changing | --check --diff |
Trivia and History¶
-
Created in one weekend. Michael DeHaan wrote the first Ansible prototype (about 1,200 lines of Python) over a single weekend in February 2012. He was frustrated with Puppet and Chef's complexity.
-
The name comes from science fiction. "Ansible" is from Ursula K. Le Guin's 1966 novel Rocannon's World — a device for instantaneous communication across any distance. DeHaan chose it because the tool communicates instantly with remote servers.
-
Red Hat paid $150 million. Red Hat acquired Ansible Inc. in October 2015, just three years after the project's creation. At the time, Ansible had 1,200 contributors and was the most-starred infrastructure automation project on GitHub.
-
SSH by design, not by accident. Unlike Puppet (custom TLS protocol) and Chef (HTTPS), Ansible uses standard SSH. DeHaan argued: if SSH is good enough for sysadmins to manage servers manually, it's good enough for automation.
-
The cowsay Easter egg. If you have
cowsayinstalled, Ansible randomly renders output through it, producing ASCII cow art. This was intentional — DeHaan believed long automation runs should have levity. Disable withANSIBLE_NOCOWS=1. -
40,000+ Galaxy roles. Ansible Galaxy launched in 2013 with ~200 roles. By 2024: 40,000+ roles and collections.
-
The YAML controversy. DeHaan chose YAML so non-programmers could write automation. Critics argue YAML's whitespace sensitivity causes subtle bugs. Supporters maintain it kept Ansible accessible to sysadmins who would never learn Ruby (Puppet/Chef's DSL).
-
Windows support via WinRM. Ansible added Windows support in version 1.7 (2014) using WinRM instead of SSH. Today it has 200+ Windows-specific modules.
-
Idempotency isn't guaranteed.
shellandcommandmodules are explicitly not idempotent. A 2019 study found ~18% of community roles contained non-idempotent tasks. -
DeHaan left after the acquisition. Michael DeHaan stepped back from the project shortly after the Red Hat acquisition in 2015. He later expressed mixed feelings about the increasing complexity of Tower compared to his original vision of radical simplicity.
-
AWX: open-sourcing your own paid product. In 2017, Red Hat open-sourced AWX (upstream of Ansible Tower). This followed their Fedora/RHEL model: free upstream grows the ecosystem, paid product adds support.
Flashcard Review¶
Foundations¶
| Q | A |
|---|---|
| What is Ansible (one line)? | Agentless automation over SSH using YAML playbooks |
| What does Ansible require on managed nodes? | Python and SSH — no agent needed |
| What is idempotency? | Re-running produces the same end state without repeated changes |
| What is an inventory? | The list/grouping of hosts Ansible targets (static or dynamic) |
| What is a playbook? | YAML file defining desired state; contains plays and tasks |
| What is a module? | Unit of work (package, service, file, etc.) — 7,000+ available |
| What is a handler? | Delayed action that runs at end of play, only when notified by a changed task |
| What is a role? | Packaged reusable tasks/templates/vars/handlers |
| What are facts? | Auto-gathered host info (OS, IP, CPU) for conditional logic |
| Play vs task vs role? | Play targets hosts; tasks are steps; roles package reusable content |
Variables and Precedence¶
| Q | A |
|---|---|
| How many levels of variable precedence exist? | 22 |
| What always wins in variable precedence? | Extra vars (-e on command line) |
Role vars/main.yml vs play vars: — which wins? |
Role vars (precedence 15) beats play vars (precedence 12) |
When do you use defaults/ vs vars/ in a role? |
defaults/ = overridable knobs (low precedence). vars/ = constants (high precedence) |
| How do you debug which value a variable has? | ansible -m debug -a "var=my_variable" hostname |
Operations¶
| Q | A |
|---|---|
What does --check --diff do? |
Preview what WOULD change without changing anything; shows file diffs |
What does serial: [1, "10%", "25%"] do? |
Graduated rollout: 1 canary, then 10% batches, then 25% batches |
What does max_fail_percentage do? |
Stops the entire play if too many hosts fail (circuit breaker) |
When you use delegate_to: localhost, whose variables does the task see? |
The target host's variables — delegate_to changes execution location, not variable context |
What's the difference between block/rescue and ignore_errors? |
block/rescue is structured try/catch. ignore_errors silently swallows ALL errors |
| When do handlers run? | At end of play, only if the notifying task reported "changed" |
What does meta: flush_handlers do? |
Forces handlers to run immediately instead of waiting for end of play |
| Handler not firing — most common cause? | Task reports "ok" (config already matches), or handler name was changed but notify: wasn't updated |
What does validate: nginx -t -c %s do on a template task? |
Validates the config BEFORE writing; if validation fails, original file is untouched |
Security¶
| Q | A |
|---|---|
| What encryption does Ansible Vault use? | AES-256-CTR with HMAC-SHA256 and PBKDF2 key stretching |
| What is the vault/vars split pattern? | Encrypted vault.yml holds values; plaintext vars.yml provides names that reference vault vars |
| How do you prevent secret leaks in playbook output? | no_log: true on tasks that handle secrets |
| How do you avoid vault passwords in shell history? | Pipe secrets to ansible-vault encrypt_string --stdin-name instead of passing on command line |
Debugging¶
| Q | A |
|---|---|
-v vs -vv vs -vvv vs -vvvv? |
Results → input params → SSH commands → full SSH protocol debug |
| How do you resume after a failure? | --start-at-task="Task Name" or --limit @site.retry |
| Task reports "changed" every run — why? | Not idempotent. Use modules instead of shell/command, or add creates:/changed_when: |
| How do you test idempotency? | Run playbook twice; second run should show 0 changed |
Performance¶
| Q | A |
|---|---|
| Default forks value? | 5 (too low for any real fleet) |
What does pipelining = True do? |
Runs modules in-process instead of copying; 2–3x faster per task |
What does ControlPersist=60s do? |
Reuses SSH connections for 60 seconds; fewer handshakes |
| How do you speed up fact gathering? | gathering = smart with fact caching (jsonfile or Redis) |
| What is Mitogen? | Drop-in Ansible plugin for 2–7x speedup |
Architecture¶
| Q | A |
|---|---|
| Terraform vs Ansible vs Helm? | Terraform builds infrastructure. Ansible configures servers. Helm deploys K8s apps |
| When NOT to use Ansible? | Pure immutable infrastructure / K8s-only environments (use Helm/operators instead) |
What is include_role vs import_role? |
Include is dynamic (runtime). Import is static (parse time). Affects tag/condition propagation |
What is the raw module for? |
Running commands when Python isn't installed on the target (bootstrap scenario) |
| What is Ansible Galaxy? | Community hub for sharing roles and collections (40,000+ roles) |
Drills¶
Drill 1: Ad-Hoc Commands (Easy)¶
Q: Check disk usage on all web servers and restart nginx using ad-hoc commands.
Answer
`-b` = become (sudo). `-m` = module. `-a` = arguments.Drill 2: Write a Basic Playbook (Easy)¶
Q: Write a playbook that installs nginx, templates a config file, and ensures the service is running.
Answer
---
- name: Configure web servers
hosts: webservers
become: true
tasks:
- name: Install nginx
ansible.builtin.apt:
name: nginx
state: present
update_cache: true
- name: Copy nginx config
ansible.builtin.template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: Restart nginx
- name: Ensure nginx is running
ansible.builtin.service:
name: nginx
state: started
enabled: true
handlers:
- name: Restart nginx
ansible.builtin.service:
name: nginx
state: restarted
Drill 3: Variable Precedence (Easy)¶
Q: Where can variables be defined? What's the simplified precedence order?
Answer
Sources (low → high): 1. Role defaults (`roles/x/defaults/main.yml`) 2. Inventory vars (`group_vars/`, `host_vars/`) 3. Playbook vars (`vars:` section) 4. Role vars (`roles/x/vars/main.yml`) — higher than playbook vars! 5. Task vars 6. Extra vars (`-e`) — **always wins**Drill 4: Create a Role (Medium)¶
Q: Create a role structure for PostgreSQL. What goes in each directory?
Answer
roles/postgresql/
├── defaults/main.yml # Default variables (lowest precedence)
├── tasks/main.yml # Main task list
├── handlers/main.yml # Handlers (restart, reload)
├── templates/ # Jinja2 templates (.j2)
│ └── postgresql.conf.j2
├── files/ # Static files to copy
├── vars/main.yml # Internal constants (higher precedence)
├── meta/main.yml # Dependencies, metadata
└── README.md
# defaults/main.yml
postgresql_version: "15"
postgresql_max_connections: 100
# tasks/main.yml
- name: Install PostgreSQL
ansible.builtin.apt:
name: "postgresql-{{ postgresql_version }}"
state: present
- name: Configure PostgreSQL
ansible.builtin.template:
src: postgresql.conf.j2
dest: "/etc/postgresql/{{ postgresql_version }}/main/postgresql.conf"
notify: Restart PostgreSQL
Drill 5: Jinja2 Template (Medium)¶
Q: Write a Jinja2 template for an nginx upstream config that dynamically lists all hosts in the app group.
Answer
Key patterns: `{{ variable }}` (output), `{% for %}` (loop), `{{ var | default() }}` (filter).Drill 6: Fix the Idempotency Bug (Medium)¶
Q: This task runs on every play and reports "changed" every time. Fix it.
- name: Add line to config
ansible.builtin.shell: echo "max_connections = 200" >> /etc/postgresql/postgresql.conf
Answer
The `shell` module runs every time and appends duplicate lines. `lineinfile` checks if the line already matches before changing anything.Drill 7: Vault Operations (Medium)¶
Q: Encrypt a variable file, then use it in a playbook run.
Answer
# Encrypt a file
ansible-vault encrypt group_vars/production/secrets.yml
# Edit encrypted file
ansible-vault edit group_vars/production/secrets.yml
# Encrypt a single string (safely, no shell history)
echo -n 'hunter2' | ansible-vault encrypt_string --stdin-name 'db_password'
# Run playbook with vault
ansible-playbook site.yml --ask-vault-pass
ansible-playbook site.yml --vault-password-file=~/.vault_pass
Drill 8: Conditionals and Loops (Medium)¶
Q: Install different packages based on OS family. Create users from a list.
Answer
# Conditional
- name: Install packages (Debian)
ansible.builtin.apt:
name: [nginx, curl, htop]
state: present
when: ansible_os_family == "Debian"
- name: Install packages (RedHat)
ansible.builtin.yum:
name: [nginx, curl, htop]
state: present
when: ansible_os_family == "RedHat"
# Loop with dict
- name: Create users
ansible.builtin.user:
name: "{{ item.name }}"
groups: "{{ item.groups }}"
shell: /bin/bash
loop:
- { name: alice, groups: "sudo,docker" }
- { name: bob, groups: "docker" }
- { name: carol, groups: "sudo" }
Drill 9: Error Handling (Medium)¶
Q: Write a task that gracefully handles a missing legacy service, and a block/rescue pattern for deployment with rollback.
Answer
# Graceful handling
- name: Check if service exists
ansible.builtin.command: systemctl status legacy-app
register: result
ignore_errors: true
- name: Stop legacy app if it exists
ansible.builtin.service:
name: legacy-app
state: stopped
when: result.rc == 0
# Block/rescue (try/catch)
- block:
- name: Deploy application
ansible.builtin.command: /opt/deploy.sh
rescue:
- name: Rollback on failure
ansible.builtin.command: /opt/rollback.sh
always:
- name: Send notification
ansible.builtin.debug:
msg: "Deploy attempt complete"
Drill 10: Ansible vs Terraform (Easy)¶
Q: When do you use Ansible vs Terraform? Can they work together?
Answer
| Aspect | Terraform | Ansible | |--------|-----------|---------| | Purpose | Provision infrastructure | Configure servers | | State | Stateful (state file) | Stateless | | Best for | Cloud resources, networking | Packages, config, services | **Together:** Terraform provisions VMs → outputs IPs → Ansible configures them.Cheat Sheet¶
Commands¶
# Ad-hoc
ansible all -m ping
ansible webservers -m command -a "uptime"
ansible dbservers -m service -a "name=postgresql state=restarted" -b
# Playbook execution
ansible-playbook site.yml -i inventory.yml
ansible-playbook site.yml --check --diff # Dry run
ansible-playbook site.yml --limit web1 # One host
ansible-playbook site.yml --tags "nginx" # Specific tags
ansible-playbook site.yml -e "var=value" # Override var
ansible-playbook site.yml --step # Interactive
ansible-playbook site.yml --start-at-task "Name" # Resume
# Debugging
ansible-playbook site.yml -v/-vv/-vvv/-vvvv
ansible -m debug -a "var=hostvars[inventory_hostname]" host
ansible-playbook site.yml --syntax-check
ansible-playbook site.yml --list-hosts/--list-tasks/--list-tags
# Vault
ansible-vault create/encrypt/edit/view/decrypt/rekey file.yml
ansible-vault encrypt_string 'secret' --name 'var_name'
ansible-playbook site.yml --ask-vault-pass
ansible-playbook site.yml --vault-password-file=~/.vault_pass
# Galaxy
ansible-galaxy install -r requirements.yml
ansible-galaxy collection install amazon.aws
ansible-galaxy init roles/myrole
# Inventory
ansible-inventory -i inventory/ --graph
ansible-inventory -i inventory/ --list
# Molecule
molecule test # Full cycle
molecule converge # Apply only
molecule verify # Verify only
molecule login -h ubuntu-noble # SSH into container
Key Concepts Quick Reference¶
| Concept | Remember |
|---|---|
| Variable precedence | defaults/ = bottom, -e = top, vars/ = high |
| Handlers | Run at end of play, not after notifying task |
serial |
Batch size for rolling updates |
max_fail_percentage |
Circuit breaker — stop rollout if too many fail |
delegate_to |
Changes where task runs, not whose variables |
raw module |
Only module that doesn't require Python on target |
| Vault encryption | AES-256-CTR with PBKDF2 key stretching |
| Idempotency test | Run twice — second run should show 0 changed |
| Check mode + diff | Non-negotiable before any production run |
| Forks default | 5 (increase to 20–50 for real fleets) |
| Pipelining | 2–3x speedup, enable in ansible.cfg |
Self-Assessment¶
Rate yourself on each area. If you can't explain it to someone else, revisit that section.
Core Concepts¶
- I can explain what Ansible is and how it differs from Chef/Puppet in one sentence
- I understand the control node → SSH → module → execute → report mental model
- I can define: inventory, playbook, module, idempotent, handler, facts, role
- I know when to use
shell/commandvs native modules (and why it matters)
Inventory and Targeting¶
- I can write static inventory in both INI and YAML format
- I understand group_vars, host_vars, and the
[group:children]syntax - I know what dynamic inventory is and when to use it
- I understand
--limitand tags for controlling blast radius
Variables and Templating¶
- I can explain the simplified variable precedence (defaults → inventory → play → role vars → extra vars)
- I know why
defaults/vsvars/matters in roles - I can write Jinja2 templates with loops, conditionals, and filters
- I know the
{{ }}quoting rule in YAML
Secrets¶
- I can encrypt and decrypt files/strings with ansible-vault
- I understand the vault/vars split pattern
- I know how to prevent secret leaks (
no_log: true, avoid CLI args)
Operations¶
- I can write a rolling update playbook with
serialandmax_fail_percentage - I understand handlers, when they fire, and when they don't
- I can use
--check --diffto preview changes before production runs - I know how to debug with
-vthrough-vvvv - I can use
block/rescue/alwaysfor error handling
Performance and Testing¶
- I know how to tune ansible.cfg for large fleets (forks, pipelining, fact caching)
- I can set up Molecule for role testing
- I understand the idempotence check and why it matters
Related Lessons¶
- Ansible: From Playbook to Production — Mission-driven deep dive with AWS dynamic inventory and Molecule
- Ansible Playbook Debugging — Focused on
--check,--diff, verbosity, and variable inspection - Terraform vs Ansible vs Helm — Detailed tool selection guide with boundary lines
- Secrets Management Without Tears — Vault, HashiCorp Vault, AWS Secrets Manager tradeoffs