Ansible - The Complete Guide (Revised, Current, Production-Focused)¶
Audience: Linux/sysadmin, platform, cloud, and operations engineers
Scope: Ansible fundamentals through production operating patterns
Style: opinionated, practical, current-era ansible-core ecosystem
Goal: get from "I know the buzzwords" to "I can ship safe automation without bricking fleets"
What changed in this revision¶
This rewrite keeps the strong operator-grade material from the original and removes or fixes the parts that would age badly or mislead people.
Fixed¶
- outdated
ansible-galaxycommand examples - outdated YAML callback guidance
- incorrect retry-file assumption
- overly absolute statements about
ignore_errorsand check mode - dated Tower-only framing
- unsafe/contradictory Vault examples
- weak trivia, flashcards, and unsourced case-study filler
Added¶
ansiblevsansible-core- Execution Environments (EE)
ansible-builderansible-navigatoransible-lint- FQCN guidance
- modern CI/CD validation flow
- safer examples using
ansible_factsinstead of injected fact variables
Table of Contents¶
- What Ansible Is
- Pick the Right Package and Install It
- Project Layout That Does Not Rot
- Inventory
- Playbooks
- Modules: Declarative First, Imperative Last
- Variables, Facts, and Precedence
- Templates and Handlers
- Roles and Collections
- Secrets and Vault
- Conditionals, Loops, and Error Handling
- Rolling Changes and Zero-Downtime Thinking
- Debugging
- Performance at Scale
- Testing and CI/CD
- Execution Environments, Builder, and Navigator
- Automation Controller and AWX
- Ansible vs Terraform vs Helm
- Common Production Patterns
- Footguns
- Dense Cheat Sheet
- Reference Links
1. What Ansible Is¶
Ansible is agentless automation. You run automation from a control node, Ansible connects to targets, executes modules, and converges them toward the desired state.
The mental model¶
Control node
|
| SSH / WinRM / PSRP / API / network transport
v
Managed node or device
|
v
Module runs -> returns changed/ok/failed + structured result
The five things that matter¶
| Principle | Meaning |
|---|---|
| Agentless | Usually nothing to install on Linux targets beyond what the module needs |
| Idempotent | Re-running should not keep changing things |
| Declarative | Ask for a state, not a sequence of shell commands |
| Push-based | You initiate changes from a control node |
| Extensible | Core + collections + plugins + inventories + callbacks |
Where Ansible shines¶
- fleet configuration
- package/service/file management
- OS and middleware standardization
- rolling deployments
- orchestration around APIs, network devices, cloud, and VMs
- “glue” automation across systems
Where Ansible is not magic¶
- It is not a general replacement for software engineering.
- It is not the best tool for full-blown resource graph provisioning.
- It is not fast if you write everything as
shelland re-gather facts every five seconds. - It does not make dangerous ideas safe just because they are YAML.
2. Pick the Right Package and Install It¶
Modern Ansible is not one monolith anymore.
ansible vs ansible-core¶
| Package | What you get | Use when |
|---|---|---|
ansible-core |
engine, CLI, builtin content, plugin framework | minimal, controlled environments |
ansible |
ansible-core plus a curated set of community collections |
easier all-in-one workstation install |
Practical recommendation¶
- Use
ansible-corewhen you want explicit dependencies and clean reproducibility. - Use
ansiblewhen you want a batteries-included learning/workstation setup. - For team-scale reproducibility, move toward Execution Environments.
Install examples¶
# Preferred for isolated workstation installs
pipx install ansible-core
# Or the broader package
pipx install ansible
# Verify
ansible --version
ansible-config dump --only-changed
Control node and target reality¶
- Control node support changes over time. Match your documentation to your installed
ansible-coreversion. - POSIX targets usually need Python for most modules.
- Some targets are exceptions. Network devices often do not need remote Python because modules use other transports.
- Windows is a different world: use WinRM/PSRP commonly, with modern SSH support also available in current Ansible for newer Windows/OpenSSH combinations.
First ad-hoc commands¶
ansible all -i inventory/hosts.yml -m ansible.builtin.ping
ansible web -i inventory/hosts.yml -m ansible.builtin.command -a 'uptime'
ansible web -i inventory/hosts.yml -b -m ansible.builtin.package -a 'name=nginx state=present'
ansible web -i inventory/hosts.yml -b -m ansible.builtin.service -a 'name=nginx state=started enabled=true'
Ad-hoc rule of thumb¶
Ad-hoc commands are for: - quick inspection - one-off safe changes - triage
If you might ever run it twice, it probably wants to be a playbook.
3. Project Layout That Does Not Rot¶
A sane layout buys you more than clever YAML ever will.
ansible/
├── ansible.cfg
├── inventory/
│ ├── hosts.yml
│ ├── group_vars/
│ └── host_vars/
├── playbooks/
│ ├── site.yml
│ ├── web.yml
│ └── db.yml
├── roles/
│ ├── common/
│ └── nginx/
├── collections/
│ └── requirements.yml
├── roles/requirements.yml
├── files/
├── templates/
├── molecule/
├── .ansible-lint
└── README.md
Minimal ansible.cfg¶
[defaults]
inventory = inventory/hosts.yml
stdout_callback = default
callback_result_format = yaml
bin_ansible_callbacks = True
interpreter_python = auto_silent
host_key_checking = True
retry_files_enabled = False
forks = 20
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .cache/facts
roles_path = roles
[ssh_connection]
pipelining = True
Why this config¶
callback_result_format = yamlreplaces old callback hacks.retry_files_enabled = Falsematches modern defaults and avoids stale.retryhabits.interpreter_python = auto_silentuses interpreter discovery without noisy warnings.pipeliningand fact caching improve speed.
4. Inventory¶
Inventory answers two questions:
- Which hosts exist?
- What do we know about them?
Static inventory example¶
all:
children:
web:
hosts:
web1:
ansible_host: 10.0.10.11
web2:
ansible_host: 10.0.10.12
vars:
app_env: production
http_port: 8080
db:
hosts:
db1:
ansible_host: 10.0.20.11
Useful inventory variables¶
| Variable | Purpose |
|---|---|
ansible_host |
actual address to connect to |
ansible_user |
remote username |
ansible_port |
non-default SSH port |
ansible_connection |
ssh, winrm, psrp, local, network_cli, etc. |
ansible_python_interpreter |
force a target Python path |
ansible_become |
privilege escalation default |
Dynamic inventory¶
Use inventory plugins when the source of truth is elsewhere.
Common examples: - AWS EC2 - VMware - OpenStack - Kubernetes - constructed inventories from metadata/tags
Example:
plugin: amazon.aws.aws_ec2
regions:
- us-east-1
keyed_groups:
- key: tags.Role
prefix: role
- key: tags.Environment
prefix: env
filters:
instance-state-name: running
Good inventory habits¶
- Put environment-specific data in
group_vars/andhost_vars/. - Keep inventory names stable even if IPs change.
- Group by function and environment.
- Do not bury secrets in inventory.
- Use
ansible-inventory --graphand--listconstantly.
5. Playbooks¶
A playbook is one or more plays. A play maps hosts to tasks.
Minimal playbook¶
---
- name: Configure web servers
hosts: web
become: true
gather_facts: true
tasks:
- name: Install nginx
ansible.builtin.package:
name: nginx
state: present
- name: Ensure nginx is enabled and started
ansible.builtin.service:
name: nginx
state: started
enabled: true
The important play-level knobs¶
| Key | Why it matters |
|---|---|
hosts |
target scope |
become |
privilege escalation |
gather_facts |
speed vs convenience |
serial |
rolling change size |
vars |
play-scoped variables |
pre_tasks / post_tasks |
guardrails and cleanup |
handlers |
delayed actions triggered by changes |
strategy |
lockstep or freer execution |
Tagging¶
tasks:
- name: Install packages
ansible.builtin.package:
name: nginx
state: present
tags: [packages]
- name: Push config
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: restart nginx
tags: [config]
Run only what you need:
ansible-playbook playbooks/web.yml --tags config
ansible-playbook playbooks/web.yml --skip-tags packages
ansible-playbook playbooks/web.yml --list-tags
ansible-playbook playbooks/web.yml --list-tasks
6. Modules: Declarative First, Imperative Last¶
The single biggest quality divider in Ansible code is this:
Use a purpose-built module when one exists. Reach for
commandorshellonly when you must.
Good¶
- name: Install nginx
ansible.builtin.package:
name: nginx
state: present
- name: Drop config file
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
mode: '0644'
notify: restart nginx
- name: Ensure service is up
ansible.builtin.service:
name: nginx
state: started
enabled: true
Bad unless you truly need it¶
- name: Do everything with shell because reasons
ansible.builtin.shell: |
apt-get update
apt-get install -y nginx
systemctl enable --now nginx
That second example is how YAML cosplay turns into pager duty.
command vs shell¶
| Module | Use when | Avoid when |
|---|---|---|
ansible.builtin.command |
run a simple command safely | you need pipes, redirects, shell expansion |
ansible.builtin.shell |
you truly need shell features | a normal module exists |
Make imperative tasks less stupid¶
- name: Initialize app database once
ansible.builtin.command:
cmd: /opt/app/bin/init-db
creates: /var/lib/app/.db_initialized
That gives command a guardrail and partial check-mode usefulness.
Modules worth knowing cold¶
| Category | Modules |
|---|---|
| packages | package, apt, dnf, yum, pip |
| services | service, systemd_service |
| files | copy, template, file, lineinfile, replace, assemble |
| users | user, group, authorized_key |
| commands | command, shell, script, raw |
| control flow | include_tasks, import_tasks, include_role, import_role, meta |
| validation | assert, wait_for, uri, stat, slurp |
| orchestration | delegate_to, set_fact, add_host, group_by |
Use FQCNs¶
Prefer:
not:
Why: - clearer provenance - fewer name collisions - better linting - easier documentation lookup
7. Variables, Facts, and Precedence¶
This is where many playbooks become haunted.
The practical precedence rule¶
There is a long official precedence chain. In real life, remember this order:
| Usually lower -> higher | Typical use |
|---|---|
| role defaults | safe knobs meant to be overridden |
| inventory vars / group_vars / host_vars | environment-specific data |
| play vars | local overrides for one play |
| task vars / include vars | narrow-scope overrides |
| registered vars / set_fact | runtime data |
extra vars (-e) |
explicit operator override; wins hard |
defaults/ vs vars/ in roles¶
| Path | Use for | Avoid putting here |
|---|---|---|
defaults/main.yml |
values users should tune | hard constants |
vars/main.yml |
truly internal constants you do not expect callers to override | ports, versions, feature flags, env-specific values |
If operators might need to change it, it belongs in defaults/, not vars/.
Facts¶
Facts are discovered target data.
Use ansible_facts[...] explicitly. It is clearer and more future-proof than relying on injected top-level fact variables.
Registered variables¶
- name: Check app health
ansible.builtin.uri:
url: http://127.0.0.1:8080/health
status_code: 200
register: healthcheck
- name: Show response body
ansible.builtin.debug:
var: healthcheck.json
Good variable hygiene¶
- keep names specific:
nginx_worker_connections, notworkers - keep environment data in inventory, not inside roles
- avoid global variable soup
- use
assertearly for required inputs
- name: Assert required variables exist
ansible.builtin.assert:
that:
- app_name is defined
- app_port is defined
- app_port | int > 0
fail_msg: "Required app variables are missing or invalid"
8. Templates and Handlers¶
Templates make dynamic config files. Handlers keep restarts from becoming denial-of-service attacks against yourself.
Template example¶
- name: Render nginx vhost
ansible.builtin.template:
src: nginx-vhost.conf.j2
dest: /etc/nginx/conf.d/{{ app_name }}.conf
mode: '0644'
validate: 'nginx -t -c %s'
notify: reload nginx
Why validate matters¶
Without validate, you can push a broken config and kill a service.
With validate, Ansible tests the candidate file before replacing the real one.
Handler example¶
Handler behavior that matters¶
- A handler runs only if something notified it.
- A handler runs once per host per play, even if multiple tasks notify it.
- Use
meta: flush_handlerswhen you need the restart/reload now, not at the end.
Template habits¶
- keep logic light
- put complicated logic in vars/filter plugins, not giant Jinja spaghetti
- validate configs whenever the software supports it
- prefer reload over restart when safe
9. Roles and Collections¶
Roles package related tasks. Collections package content at a larger namespace level.
Role structure¶
roles/
nginx/
defaults/main.yml
vars/main.yml
tasks/main.yml
handlers/main.yml
templates/
files/
meta/main.yml
Create a role skeleton¶
Use a role¶
- name: Configure web servers
hosts: web
become: true
roles:
- role: nginx
vars:
nginx_listen_port: 8080
import_role vs include_role¶
| Feature | import_role |
include_role |
|---|---|---|
| resolution | parse time | runtime |
| best for | static composition | conditional or looped inclusion |
| behavior | more predictable tags/parse-time structure | more flexible |
Collections¶
Collections are the distribution unit for modern Ansible content.
Examples:
- amazon.aws
- kubernetes.core
- ansible.posix
- community.general
Install collections¶
Example collections/requirements.yml:
Install roles¶
Dependency rule¶
Do not assume random collections happen to be installed on someone else's workstation. Declare them.
10. Secrets and Vault¶
Vault is for encrypting data you must keep with your automation. It is not a substitute for a full secret-management strategy, but it is far better than plaintext regret.
Basic Vault commands¶
ansible-vault create inventory/group_vars/prod/vault.yml
ansible-vault edit inventory/group_vars/prod/vault.yml
ansible-vault view inventory/group_vars/prod/vault.yml
ansible-vault rekey inventory/group_vars/prod/vault.yml
Encrypt a string safely¶
Do not do this:
That leaks the secret into shell history.
Split vars from secrets¶
# vars.yml
app_db_user: appuser
app_db_password: "{{ vault_app_db_password }}"
# vault.yml (encrypted)
vault_app_db_password: supersecret
Multiple vault identities¶
ansible-playbook playbooks/site.yml \
--vault-id dev@prompt \
--vault-id prod@~/.ansible/prod.vault.pass
no_log¶
- name: Create DB user
community.postgresql.postgresql_user:
name: "{{ app_db_user }}"
password: "{{ app_db_password }}"
no_log: true
When Vault is not enough¶
For larger environments, prefer pulling secrets from a real secret store when practical: - HashiCorp Vault - AWS Secrets Manager - cloud KMS-backed patterns - controller credential integrations
Hard rules¶
- never put plaintext secrets in repo history
- never pass secrets directly on command lines
- never print secrets in debug or CI logs
- restrict vault password file permissions
- keep secret scope tight
11. Conditionals, Loops, and Error Handling¶
Conditionals¶
- name: Install SELinux helpers on RedHat
ansible.builtin.package:
name: policycoreutils-python-utils
state: present
when: ansible_facts['os_family'] == 'RedHat'
Loops¶
- name: Install baseline packages
ansible.builtin.package:
name: "{{ item }}"
state: present
loop:
- curl
- vim
- git
Prefer loop unless a module has a better native bulk option.
failed_when¶
- name: Run a health probe script
ansible.builtin.command: /opt/app/bin/health-probe
register: probe
changed_when: false
failed_when:
- probe.rc != 0
- "'warming up' not in probe.stdout"
changed_when¶
- name: Read current app version
ansible.builtin.command: cat /opt/app/VERSION
register: version_out
changed_when: false
ignore_errors¶
Use sparingly. It does not ignore everything. It only ignores a task that ran and returned a failed result. It does not rescue you from undefined variables, syntax errors, connection failures, or broad stupidity.
Better pattern:
- name: Try old service name
ansible.builtin.command: systemctl is-active legacy-app
register: legacy_service
changed_when: false
failed_when: false
block / rescue / always¶
- block:
- name: Deploy release
ansible.builtin.command: /opt/app/bin/deploy {{ release_id }}
- name: Check health
ansible.builtin.uri:
url: http://127.0.0.1:8080/health
status_code: 200
rescue:
- name: Roll back release
ansible.builtin.command: /opt/app/bin/rollback
always:
- name: Emit deploy summary
ansible.builtin.debug:
msg: "Deployment attempt finished"
Includes: static vs dynamic¶
| Feature | Static | Dynamic |
|---|---|---|
| tasks | import_tasks |
include_tasks |
| role | import_role |
include_role |
Use dynamic includes when the include itself depends on runtime conditions.
12. Rolling Changes and Zero-Downtime Thinking¶
Ansible does not give you zero downtime. Your design does. Ansible just enforces the choreography.
The basic rolling pattern¶
- name: Rolling deploy to web tier
hosts: web
become: true
serial: 2
max_fail_percentage: 0
pre_tasks:
- name: Drain node from load balancer
ansible.builtin.command: /usr/local/bin/lb-drain {{ inventory_hostname }}
delegate_to: localhost
changed_when: true
roles:
- role: app_release
post_tasks:
- name: Wait for local health endpoint
ansible.builtin.uri:
url: http://127.0.0.1:8080/health
status_code: 200
register: local_health
retries: 20
delay: 3
until: local_health.status == 200
- name: Re-add node to load balancer
ansible.builtin.command: /usr/local/bin/lb-enable {{ inventory_hostname }}
delegate_to: localhost
changed_when: true
The key knobs¶
| Setting | Why it matters |
|---|---|
serial |
blast radius per batch |
max_fail_percentage |
when to stop the rollout |
any_errors_fatal |
fail the whole play if one host fails |
delegate_to |
run control-plane/API tasks elsewhere |
run_once |
do something one time instead of per host |
throttle |
limit concurrency for one task |
run_once example¶
- name: Run DB migration once
ansible.builtin.command: /opt/app/bin/migrate
run_once: true
delegate_to: db1
Deployment rule¶
Never mark a node healthy because the playbook finished. Mark it healthy because the service is healthy.
13. Debugging¶
When Ansible surprises you, the bug is usually in one of four places:
- inventory scope
- variable precedence
- module behavior
- your own assumptions, which were apparently written by a goblin
Commands that pay rent¶
ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --host web1
ansible-playbook playbooks/site.yml --syntax-check
ansible-playbook playbooks/site.yml --check --diff
ansible-playbook playbooks/site.yml --list-hosts
ansible-playbook playbooks/site.yml --list-tags
ansible-playbook playbooks/site.yml --start-at-task 'Render nginx vhost'
ansible-playbook playbooks/site.yml -vvv
Debug variable state¶
- name: Debug important vars
ansible.builtin.debug:
msg:
app_env: "{{ app_env | default('UNSET') }}"
app_port: "{{ app_port | default('UNSET') }}"
distribution: "{{ ansible_facts['distribution'] | default('UNKNOWN') }}"
Fast diagnosis patterns¶
| Symptom | Usual cause |
|---|---|
| task skipped unexpectedly | when evaluated false |
| variable has weird value | precedence collision |
| module says changed every run | non-idempotent task or bad changed_when |
| check mode lies | module lacks or only partially supports check mode |
| target fails with Python issue | interpreter discovery mismatch |
| collection/module not found | dependency not declared or installed |
Retry-file myth¶
Old playbooks often say:
That only works if retry files are enabled. Modern defaults usually have them disabled. Do not build your operating habits around stale .retry files.
Better recovery tools:
- --start-at-task
- --limit
- clear role/play separation
- resumable deployment logic
14. Performance at Scale¶
Performance fixes should preserve correctness. A fast wrong playbook is just a more efficient outage.
Big wins¶
1. Disable fact gathering when you do not need it¶
2. Enable pipelining¶
3. Use fact caching¶
4. Avoid shell-heavy loops¶
Do not do this across 500 hosts unless you enjoy self-harm by latency.
5. Raise forks carefully¶
More forks can help, until the controller, bastion, network, target services, or API rate limits slap you.
Strategy plugins¶
linear: default, predictable, batch-safefree: hosts proceed independently; good for some workloads, dangerous for others
Async tasks¶
- name: Kick off long-running task
ansible.builtin.command: /opt/app/bin/reindex
async: 3600
poll: 0
- name: Check status later
ansible.builtin.async_status:
jid: "{{ job_result.ansible_job_id }}"
Performance rule¶
Optimize in this order: 1. reduce unnecessary work 2. use proper modules 3. reduce fact gathering 4. enable pipelining/caching 5. increase concurrency
Not the reverse.
15. Testing and CI/CD¶
This is one of the biggest missing pieces in many Ansible guides. Production-safe Ansible is not just writing playbooks. It is proving they are sane before they touch systems.
Minimum validation stack¶
- syntax check
- lint
- dependency install
- Molecule or other test execution
- check mode where meaningful
- idempotence check
Syntax check¶
Lint¶
Ansible-lint catches a lot of bad habits:
- missing FQCNs
- unsafe patterns
- bad task names
- dependency issues
- risky latest behavior
- broad ignore_errors
Molecule¶
Molecule is the test framework for Ansible roles, playbooks, and collections.
Good Molecule suites usually verify: - converge works - second converge is idempotent - resulting system behavior is correct
Dependency declaration matters¶
If lint or syntax-check cannot resolve your modules, your repository is underspecified.
Declare required roles and collections in requirements files and install them in CI.
Example CI flow¶
ansible-galaxy collection install -r collections/requirements.yml
ansible-galaxy role install -r roles/requirements.yml
ansible-playbook playbooks/site.yml --syntax-check
ansible-lint
molecule test
ansible-playbook playbooks/site.yml --check --diff
Idempotence test mindset¶
A good configuration playbook should usually look like this: - first run: changed - second run: mostly ok
If the second run still reports changes, investigate why.
16. Execution Environments, Builder, and Navigator¶
This is the biggest modernization in this revision.
What an EE is¶
An Execution Environment is a container image that acts as your Ansible control node.
It packages:
- ansible-core
- collections
- Python dependencies
- system packages needed by those collections
- config and runtime support
Why you should care¶
Without EEs: - “works on my laptop” nonsense - version drift across engineers and CI - mystery Python packages - AWX/controller mismatch
With EEs: - repeatable controller runtime - same dependencies in laptop/CI/controller - cleaner onboarding - fewer dependency ghost stories
Minimal EE definition¶
---
version: 3
images:
base_image:
name: docker.io/redhat/ubi9:latest
dependencies:
ansible_core:
package_pip: ansible-core
ansible_runner:
package_pip: ansible-runner
galaxy: collections/requirements.yml
Build it¶
Run with navigator¶
ansible-navigator run playbooks/site.yml \
--execution-environment-image my-ee:latest \
--mode stdout
What ansible-navigator is¶
ansible-navigator is a CLI/TUI for running, reviewing, and troubleshooting Ansible content, especially with EEs.
Useful subcommands include:
- run
- doc
- config
- collections
- images
- exec
When to adopt EEs¶
| Scenario | Recommendation |
|---|---|
| solo learning on one laptop | optional |
| team with CI | strongly recommended |
| AWX / Automation Controller | effectively standard practice |
| content with nontrivial deps | use EEs |
17. Automation Controller and AWX¶
Current terminology:
- AWX = upstream open-source project
- Automation Controller = enterprise controller component inside Red Hat Ansible Automation Platform
Historical note: older material often says “Tower.” That is legacy naming.
Why use a controller¶
CLI is enough until you need: - RBAC - central credentials - audit trails - scheduling - web/API launches - inventory syncs - standardized execution environments
What a controller actually gives you¶
| Capability | CLI | AWX / Controller |
|---|---|---|
| manual run | yes | yes |
| scheduling | crude/external | built in |
| RBAC | no | yes |
| credential management | ad hoc | structured |
| job history | scattered | centralized |
| inventory sync | manual | built in |
| standardized runtime | manual discipline | first-class with EEs |
Rule¶
Do not adopt a controller because YAML feels important and enterprise-shaped. Adopt it because you need shared execution, governance, or scale.
18. Ansible vs Terraform vs Helm¶
These tools overlap a little and fight a lot in bad designs.
Simple split¶
| Tool | Best at |
|---|---|
| Ansible | configuring and orchestrating existing systems |
| Terraform | provisioning infrastructure resources and dependency graphs |
| Helm | packaging/releasing Kubernetes app manifests |
Good boundary examples¶
- Terraform creates VMs, subnets, security groups, load balancers.
- Ansible configures the OS, deploys packages, templates configs, coordinates cutovers.
- Helm installs app stacks into Kubernetes.
- Ansible can orchestrate Helm, but Helm should still own the release content.
Anti-pattern¶
Do not make one tool impersonate all the others because you are trying to reduce the number of logos in your architecture diagram.
19. Common Production Patterns¶
Pattern: validate config before reload¶
- name: Push nginx config safely
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
validate: 'nginx -t -c %s'
notify: reload nginx
Pattern: one-time DB migration during rolling app deploy¶
- name: Run DB migration once from a safe node
ansible.builtin.command: /opt/app/bin/migrate
run_once: true
delegate_to: app1
Pattern: drain, change, health-check, re-enable¶
- name: Drain from LB
ansible.builtin.command: /usr/local/bin/lb-drain {{ inventory_hostname }}
delegate_to: localhost
- name: Deploy config
ansible.builtin.include_role:
name: app_release
- name: Wait for health
ansible.builtin.uri:
url: http://127.0.0.1:8080/health
status_code: 200
register: app_health
retries: 20
delay: 3
until: app_health.status == 200
- name: Re-enable in LB
ansible.builtin.command: /usr/local/bin/lb-enable {{ inventory_hostname }}
delegate_to: localhost
Pattern: assert preconditions before touching anything¶
- name: Assert supported platform
ansible.builtin.assert:
that:
- ansible_facts['os_family'] in ['Debian', 'RedHat']
- app_port | int >= 1024
fail_msg: "Unsupported target or invalid app_port"
Pattern: dynamic include by OS family¶
- name: Load OS-specific tasks
ansible.builtin.include_tasks: "{{ ansible_facts['os_family'] }}.yml"
Pattern: explicit read-only probe task¶
- name: Read current config version
ansible.builtin.command: cat /etc/myapp/version
register: cfg_ver
changed_when: false
failed_when: cfg_ver.rc != 0
20. Footguns¶
1. Using shell for everything¶
You lose idempotence, portability, readability, and check-mode usefulness.
2. Hiding operator-tunable vars in vars/¶
You create precedence bugs that look supernatural.
3. Trusting check mode too much¶
Some modules support it fully, some partially, some barely at all.
4. Forgetting validate¶
Broken config + automatic restart = self-inflicted outage.
5. Blind ignore_errors: true¶
That is not resiliency. That is burying evidence.
6. Running DB migrations per host¶
Congratulations, you invented distributed regret.
7. Depending on random collections installed globally¶
Your repo is now a snowflake.
8. Not pinning or declaring dependencies¶
Sooner or later CI and laptops diverge.
9. Assuming retry files exist¶
Modern defaults usually say no.
10. Printing secrets in debug output¶
Logs are forever. Or long enough to ruin your week.
11. Treating controller runtime as informal¶
If dependencies matter, use EEs.
12. Not testing second-run idempotence¶
Drift loops hide here.
13. Using stale terminology and stale docs blindly¶
“Tower,” old callback plugins, old Galaxy commands, old Python floors - all classic fossil layers.
21. Dense Cheat Sheet¶
Core commands¶
ansible all -i inventory/hosts.yml -m ansible.builtin.ping
ansible-playbook playbooks/site.yml
ansible-playbook playbooks/site.yml --syntax-check
ansible-playbook playbooks/site.yml --check --diff
ansible-playbook playbooks/site.yml --list-hosts
ansible-playbook playbooks/site.yml --list-tags
ansible-playbook playbooks/site.yml --start-at-task 'TASK NAME'
ansible-playbook playbooks/site.yml -vvv
ansible-inventory -i inventory/hosts.yml --graph
ansible-config dump --only-changed
ansible-doc ansible.builtin.template
ansible-lint
molecule test
Galaxy / dependency commands¶
ansible-galaxy role init myrole --init-path roles
ansible-galaxy role install -r roles/requirements.yml
ansible-galaxy collection install -r collections/requirements.yml
ansible-galaxy collection list
Vault commands¶
ansible-vault create inventory/group_vars/prod/vault.yml
ansible-vault edit inventory/group_vars/prod/vault.yml
ansible-vault view inventory/group_vars/prod/vault.yml
echo -n 'secret' | ansible-vault encrypt_string --stdin-name db_password
ansible-playbook playbooks/site.yml --vault-id prod@~/.ansible/prod.vault.pass
High-signal playbook patterns¶
# Safe config push
ansible.builtin.template + validate + notify
# Read-only probe
changed_when: false
# Controlled imperative task
ansible.builtin.command + creates/removes
# Rolling deploy
serial + health checks + delegate_to + run_once
# Safer facts usage
ansible_facts['distribution']
# Better modules
ansible.builtin.package / service / copy / template / uri / assert
# Better error handling
failed_when / changed_when / block-rescue
Good defaults to remember¶
- FQCNs everywhere
- defaults are for knobs, vars are for constants
- validate before reload/restart
- do not use
shellunless you must - do not assume check mode is perfect
- declare dependencies
- test second-run idempotence
- use EEs when the runtime matters
22. Reference Links¶
Official docs used to modernize this guide:
- Ansible installation guide: https://docs.ansible.com/projects/ansible-core/devel/installation_guide/intro_installation.html
- Ansible getting started: https://docs.ansible.com/ansible/devel/getting_started/index.html
- Interpreter discovery: https://docs.ansible.com/ansible/latest/reference_appendices/interpreter_discovery.html
- Error handling: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html
- Check mode and diff mode: https://docs.ansible.com/projects/ansible/latest/playbook_guide/playbooks_checkmode.html
ansible.builtin.command: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/command_module.htmlansible-galaxyCLI: https://docs.ansible.com/projects/ansible/latest/cli/ansible-galaxy.html- Collections install guide: https://docs.ansible.com/projects/ansible/latest/collections_guide/collections_installing.html
- Default callback plugin and YAML result formatting: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/default_callback.html
- Config settings reference: https://docs.ansible.com/projects/ansible/latest/reference_appendices/config.html
- Windows host management and SSH: https://docs.ansible.com/ansible/latest/os_guide/intro_windows.html
- Windows SSH: https://docs.ansible.com/projects/ansible-core/devel/os_guide/windows_ssh.html
- Ansible Builder: https://docs.ansible.com/projects/builder/
- Execution Environments getting started: https://docs.ansible.com/en/latest/getting_started_ee/index.html
- Running an EE: https://docs.ansible.com/ansible/latest/getting_started_ee/run_execution_environment.html
- Ansible Navigator: https://docs.ansible.com/projects/navigator/
- Ansible Lint: https://docs.ansible.com/projects/lint/
- Molecule: https://docs.ansible.com/projects/molecule/
- Red Hat Ansible Automation Platform docs: https://docs.redhat.com/en/documentation/red_hat_ansible_automation_platform/latest
Final take¶
The strongest mental model for Ansible is this:
Ansible is a convergence engine plus orchestration glue.
Use modules for state, inventory for truth, roles for reuse, handlers for delayed side effects, validations for safety, and execution environments for reproducible control-node runtime.
When people get hurt with Ansible, it is usually not because Ansible is mysterious. It is because they wrote shell scripts in YAML, ignored precedence, skipped validation, hid dependencies, or treated controller runtime as folklore.
Do the opposite and Ansible stays boring - which, in operations, is the highest compliment.