Ansible - The Complete Guide (Revised, Current, Production-Focused)¶

Audience: Linux/sysadmin, platform, cloud, and operations engineers Scope: Ansible fundamentals through production operating patterns Style: opinionated, practical, current-era ansible-core ecosystem Goal: get from "I know the buzzwords" to "I can ship safe automation without bricking fleets"

What changed in this revision¶

This rewrite keeps the strong operator-grade material from the original and removes or fixes the parts that would age badly or mislead people.

Fixed¶

outdated ansible-galaxy command examples
outdated YAML callback guidance
incorrect retry-file assumption
overly absolute statements about ignore_errors and check mode
dated Tower-only framing
unsafe/contradictory Vault examples
weak trivia, flashcards, and unsourced case-study filler

Added¶

ansible vs ansible-core
Execution Environments (EE)
ansible-builder
ansible-navigator
ansible-lint
FQCN guidance
modern CI/CD validation flow
safer examples using ansible_facts instead of injected fact variables

Table of Contents¶

What Ansible Is
Pick the Right Package and Install It
Project Layout That Does Not Rot
Inventory
Playbooks
Modules: Declarative First, Imperative Last
Variables, Facts, and Precedence
Templates and Handlers
Roles and Collections
Secrets and Vault
Conditionals, Loops, and Error Handling
Rolling Changes and Zero-Downtime Thinking
Debugging
Performance at Scale
Testing and CI/CD
Execution Environments, Builder, and Navigator
Automation Controller and AWX
Ansible vs Terraform vs Helm
Common Production Patterns
Footguns
Dense Cheat Sheet
Reference Links

1. What Ansible Is¶

Ansible is agentless automation. You run automation from a control node, Ansible connects to targets, executes modules, and converges them toward the desired state.

The mental model¶

Control node
  |
  | SSH / WinRM / PSRP / API / network transport
  v
Managed node or device
  |
  v
Module runs -> returns changed/ok/failed + structured result

The five things that matter¶

Principle	Meaning
Agentless	Usually nothing to install on Linux targets beyond what the module needs
Idempotent	Re-running should not keep changing things
Declarative	Ask for a state, not a sequence of shell commands
Push-based	You initiate changes from a control node
Extensible	Core + collections + plugins + inventories + callbacks

Where Ansible shines¶

fleet configuration
package/service/file management
OS and middleware standardization
rolling deployments
orchestration around APIs, network devices, cloud, and VMs
“glue” automation across systems

Where Ansible is not magic¶

It is not a general replacement for software engineering.
It is not the best tool for full-blown resource graph provisioning.
It is not fast if you write everything as shell and re-gather facts every five seconds.
It does not make dangerous ideas safe just because they are YAML.

2. Pick the Right Package and Install It¶

Modern Ansible is not one monolith anymore.

`ansible` vs `ansible-core`¶

Package	What you get	Use when
`ansible-core`	engine, CLI, builtin content, plugin framework	minimal, controlled environments
`ansible`	`ansible-core` plus a curated set of community collections	easier all-in-one workstation install

Practical recommendation¶

Use ansible-core when you want explicit dependencies and clean reproducibility.
Use ansible when you want a batteries-included learning/workstation setup.
For team-scale reproducibility, move toward Execution Environments.

Install examples¶

# Preferred for isolated workstation installs
pipx install ansible-core

# Or the broader package
pipx install ansible

# Verify
ansible --version
ansible-config dump --only-changed

Control node and target reality¶

Control node support changes over time. Match your documentation to your installed ansible-core version.
POSIX targets usually need Python for most modules.
Some targets are exceptions. Network devices often do not need remote Python because modules use other transports.
Windows is a different world: use WinRM/PSRP commonly, with modern SSH support also available in current Ansible for newer Windows/OpenSSH combinations.

First ad-hoc commands¶

ansible all -i inventory/hosts.yml -m ansible.builtin.ping
ansible web -i inventory/hosts.yml -m ansible.builtin.command -a 'uptime'
ansible web -i inventory/hosts.yml -b -m ansible.builtin.package -a 'name=nginx state=present'
ansible web -i inventory/hosts.yml -b -m ansible.builtin.service -a 'name=nginx state=started enabled=true'

Ad-hoc rule of thumb¶

Ad-hoc commands are for: - quick inspection - one-off safe changes - triage

If you might ever run it twice, it probably wants to be a playbook.

3. Project Layout That Does Not Rot¶

A sane layout buys you more than clever YAML ever will.

ansible/
├── ansible.cfg
├── inventory/
│   ├── hosts.yml
│   ├── group_vars/
│   └── host_vars/
├── playbooks/
│   ├── site.yml
│   ├── web.yml
│   └── db.yml
├── roles/
│   ├── common/
│   └── nginx/
├── collections/
│   └── requirements.yml
├── roles/requirements.yml
├── files/
├── templates/
├── molecule/
├── .ansible-lint
└── README.md

Minimal `ansible.cfg`¶

[defaults]
inventory = inventory/hosts.yml
stdout_callback = default
callback_result_format = yaml
bin_ansible_callbacks = True
interpreter_python = auto_silent
host_key_checking = True
retry_files_enabled = False
forks = 20
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .cache/facts
roles_path = roles

[ssh_connection]
pipelining = True

Why this config¶

callback_result_format = yaml replaces old callback hacks.
retry_files_enabled = False matches modern defaults and avoids stale .retry habits.
interpreter_python = auto_silent uses interpreter discovery without noisy warnings.
pipelining and fact caching improve speed.

4. Inventory¶

Inventory answers two questions:

Which hosts exist?
What do we know about them?

Static inventory example¶

all:
  children:
    web:
      hosts:
        web1:
          ansible_host: 10.0.10.11
        web2:
          ansible_host: 10.0.10.12
      vars:
        app_env: production
        http_port: 8080
    db:
      hosts:
        db1:
          ansible_host: 10.0.20.11

Useful inventory variables¶

Variable	Purpose
`ansible_host`	actual address to connect to
`ansible_user`	remote username
`ansible_port`	non-default SSH port
`ansible_connection`	ssh, winrm, psrp, local, network_cli, etc.
`ansible_python_interpreter`	force a target Python path
`ansible_become`	privilege escalation default

Dynamic inventory¶

Use inventory plugins when the source of truth is elsewhere.

Common examples: - AWS EC2 - VMware - OpenStack - Kubernetes - constructed inventories from metadata/tags

Example:

plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
keyed_groups:
  - key: tags.Role
    prefix: role
  - key: tags.Environment
    prefix: env
filters:
  instance-state-name: running

Good inventory habits¶

Put environment-specific data in group_vars/ and host_vars/.
Keep inventory names stable even if IPs change.
Group by function and environment.
Do not bury secrets in inventory.
Use ansible-inventory --graph and --list constantly.

5. Playbooks¶

A playbook is one or more plays. A play maps hosts to tasks.

Minimal playbook¶

---
- name: Configure web servers
  hosts: web
  become: true
  gather_facts: true

  tasks:
    - name: Install nginx
      ansible.builtin.package:
        name: nginx
        state: present

    - name: Ensure nginx is enabled and started
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: true

The important play-level knobs¶

Key	Why it matters
`hosts`	target scope
`become`	privilege escalation
`gather_facts`	speed vs convenience
`serial`	rolling change size
`vars`	play-scoped variables
`pre_tasks` / `post_tasks`	guardrails and cleanup
`handlers`	delayed actions triggered by changes
`strategy`	lockstep or freer execution

Tagging¶

tasks:
  - name: Install packages
    ansible.builtin.package:
      name: nginx
      state: present
    tags: [packages]

  - name: Push config
    ansible.builtin.template:
      src: nginx.conf.j2
      dest: /etc/nginx/nginx.conf
    notify: restart nginx
    tags: [config]

Run only what you need:

ansible-playbook playbooks/web.yml --tags config
ansible-playbook playbooks/web.yml --skip-tags packages
ansible-playbook playbooks/web.yml --list-tags
ansible-playbook playbooks/web.yml --list-tasks

6. Modules: Declarative First, Imperative Last¶

The single biggest quality divider in Ansible code is this:

Use a purpose-built module when one exists. Reach for command or shell only when you must.

Good¶

- name: Install nginx
  ansible.builtin.package:
    name: nginx
    state: present

- name: Drop config file
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    mode: '0644'
  notify: restart nginx

- name: Ensure service is up
  ansible.builtin.service:
    name: nginx
    state: started
    enabled: true

Bad unless you truly need it¶

- name: Do everything with shell because reasons
  ansible.builtin.shell: |
    apt-get update
    apt-get install -y nginx
    systemctl enable --now nginx

That second example is how YAML cosplay turns into pager duty.

`command` vs `shell`¶

Module	Use when	Avoid when
`ansible.builtin.command`	run a simple command safely	you need pipes, redirects, shell expansion
`ansible.builtin.shell`	you truly need shell features	a normal module exists

Make imperative tasks less stupid¶

- name: Initialize app database once
  ansible.builtin.command:
    cmd: /opt/app/bin/init-db
    creates: /var/lib/app/.db_initialized

That gives command a guardrail and partial check-mode usefulness.

Modules worth knowing cold¶

Category	Modules
packages	`package`, `apt`, `dnf`, `yum`, `pip`
services	`service`, `systemd_service`
files	`copy`, `template`, `file`, `lineinfile`, `replace`, `assemble`
users	`user`, `group`, `authorized_key`
commands	`command`, `shell`, `script`, `raw`
control flow	`include_tasks`, `import_tasks`, `include_role`, `import_role`, `meta`
validation	`assert`, `wait_for`, `uri`, `stat`, `slurp`
orchestration	`delegate_to`, `set_fact`, `add_host`, `group_by`

Use FQCNs¶

Prefer:

ansible.builtin.package:

not:

package:

Why: - clearer provenance - fewer name collisions - better linting - easier documentation lookup

7. Variables, Facts, and Precedence¶

This is where many playbooks become haunted.

The practical precedence rule¶

There is a long official precedence chain. In real life, remember this order:

Usually lower -> higher	Typical use
role defaults	safe knobs meant to be overridden
inventory vars / group_vars / host_vars	environment-specific data
play vars	local overrides for one play
task vars / include vars	narrow-scope overrides
registered vars / set_fact	runtime data
extra vars (`-e`)	explicit operator override; wins hard

`defaults/` vs `vars/` in roles¶

Path	Use for	Avoid putting here
`defaults/main.yml`	values users should tune	hard constants
`vars/main.yml`	truly internal constants you do not expect callers to override	ports, versions, feature flags, env-specific values

If operators might need to change it, it belongs in defaults/, not vars/.

Facts¶

Facts are discovered target data.

- name: Show OS family
  ansible.builtin.debug:
    msg: "{{ ansible_facts['os_family'] }}"

Use ansible_facts[...] explicitly. It is clearer and more future-proof than relying on injected top-level fact variables.

Registered variables¶

- name: Check app health
  ansible.builtin.uri:
    url: http://127.0.0.1:8080/health
    status_code: 200
  register: healthcheck

- name: Show response body
  ansible.builtin.debug:
    var: healthcheck.json

Good variable hygiene¶

keep names specific: nginx_worker_connections, not workers
keep environment data in inventory, not inside roles
avoid global variable soup
use assert early for required inputs

- name: Assert required variables exist
  ansible.builtin.assert:
    that:
      - app_name is defined
      - app_port is defined
      - app_port | int > 0
    fail_msg: "Required app variables are missing or invalid"

8. Templates and Handlers¶

Templates make dynamic config files. Handlers keep restarts from becoming denial-of-service attacks against yourself.

Template example¶

- name: Render nginx vhost
  ansible.builtin.template:
    src: nginx-vhost.conf.j2
    dest: /etc/nginx/conf.d/{{ app_name }}.conf
    mode: '0644'
    validate: 'nginx -t -c %s'
  notify: reload nginx

Why `validate` matters¶

Without validate, you can push a broken config and kill a service.

With validate, Ansible tests the candidate file before replacing the real one.

Handler example¶

handlers:
  - name: reload nginx
    ansible.builtin.service:
      name: nginx
      state: reloaded

Handler behavior that matters¶

A handler runs only if something notified it.
A handler runs once per host per play, even if multiple tasks notify it.
Use meta: flush_handlers when you need the restart/reload now, not at the end.

- name: Force handler now
  ansible.builtin.meta: flush_handlers

Template habits¶

keep logic light
put complicated logic in vars/filter plugins, not giant Jinja spaghetti
validate configs whenever the software supports it
prefer reload over restart when safe

9. Roles and Collections¶

Roles package related tasks. Collections package content at a larger namespace level.

Role structure¶

roles/
  nginx/
    defaults/main.yml
    vars/main.yml
    tasks/main.yml
    handlers/main.yml
    templates/
    files/
    meta/main.yml

Create a role skeleton¶

ansible-galaxy role init nginx --init-path roles

Use a role¶

- name: Configure web servers
  hosts: web
  become: true
  roles:
    - role: nginx
      vars:
        nginx_listen_port: 8080

`import_role` vs `include_role`¶

Feature	`import_role`	`include_role`
resolution	parse time	runtime
best for	static composition	conditional or looped inclusion
behavior	more predictable tags/parse-time structure	more flexible

Collections¶

Collections are the distribution unit for modern Ansible content.

Examples: - amazon.aws - kubernetes.core - ansible.posix - community.general

Install collections¶

ansible-galaxy collection install -r collections/requirements.yml

Example collections/requirements.yml:

collections:
  - name: ansible.posix
  - name: community.general
  - name: amazon.aws

Install roles¶

ansible-galaxy role install -r roles/requirements.yml

Dependency rule¶

Do not assume random collections happen to be installed on someone else's workstation. Declare them.

10. Secrets and Vault¶

Vault is for encrypting data you must keep with your automation. It is not a substitute for a full secret-management strategy, but it is far better than plaintext regret.

Basic Vault commands¶

ansible-vault create inventory/group_vars/prod/vault.yml
ansible-vault edit inventory/group_vars/prod/vault.yml
ansible-vault view inventory/group_vars/prod/vault.yml
ansible-vault rekey inventory/group_vars/prod/vault.yml

Encrypt a string safely¶

echo -n 'supersecret' | ansible-vault encrypt_string --stdin-name db_password

Do not do this:

ansible-vault encrypt_string 'supersecret' --name db_password

That leaks the secret into shell history.

Split vars from secrets¶

inventory/
  group_vars/
    prod/
      vars.yml
      vault.yml

# vars.yml
app_db_user: appuser
app_db_password: "{{ vault_app_db_password }}"

# vault.yml (encrypted)
vault_app_db_password: supersecret

Multiple vault identities¶

ansible-playbook playbooks/site.yml \
  --vault-id dev@prompt \
  --vault-id prod@~/.ansible/prod.vault.pass

`no_log`¶

- name: Create DB user
  community.postgresql.postgresql_user:
    name: "{{ app_db_user }}"
    password: "{{ app_db_password }}"
  no_log: true

When Vault is not enough¶

For larger environments, prefer pulling secrets from a real secret store when practical: - HashiCorp Vault - AWS Secrets Manager - cloud KMS-backed patterns - controller credential integrations

Hard rules¶

never put plaintext secrets in repo history
never pass secrets directly on command lines
never print secrets in debug or CI logs
restrict vault password file permissions
keep secret scope tight

11. Conditionals, Loops, and Error Handling¶

Conditionals¶

- name: Install SELinux helpers on RedHat
  ansible.builtin.package:
    name: policycoreutils-python-utils
    state: present
  when: ansible_facts['os_family'] == 'RedHat'

Loops¶

- name: Install baseline packages
  ansible.builtin.package:
    name: "{{ item }}"
    state: present
  loop:
    - curl
    - vim
    - git

Prefer loop unless a module has a better native bulk option.

`failed_when`¶

- name: Run a health probe script
  ansible.builtin.command: /opt/app/bin/health-probe
  register: probe
  changed_when: false
  failed_when:
    - probe.rc != 0
    - "'warming up' not in probe.stdout"

`changed_when`¶

- name: Read current app version
  ansible.builtin.command: cat /opt/app/VERSION
  register: version_out
  changed_when: false

`ignore_errors`¶

Use sparingly. It does not ignore everything. It only ignores a task that ran and returned a failed result. It does not rescue you from undefined variables, syntax errors, connection failures, or broad stupidity.

Better pattern:

- name: Try old service name
  ansible.builtin.command: systemctl is-active legacy-app
  register: legacy_service
  changed_when: false
  failed_when: false

`block` / `rescue` / `always`¶

- block:
    - name: Deploy release
      ansible.builtin.command: /opt/app/bin/deploy {{ release_id }}

    - name: Check health
      ansible.builtin.uri:
        url: http://127.0.0.1:8080/health
        status_code: 200

  rescue:
    - name: Roll back release
      ansible.builtin.command: /opt/app/bin/rollback

  always:
    - name: Emit deploy summary
      ansible.builtin.debug:
        msg: "Deployment attempt finished"

Includes: static vs dynamic¶

Feature	Static	Dynamic
tasks	`import_tasks`	`include_tasks`
role	`import_role`	`include_role`

Use dynamic includes when the include itself depends on runtime conditions.

12. Rolling Changes and Zero-Downtime Thinking¶

Ansible does not give you zero downtime. Your design does. Ansible just enforces the choreography.

The basic rolling pattern¶

- name: Rolling deploy to web tier
  hosts: web
  become: true
  serial: 2
  max_fail_percentage: 0

  pre_tasks:
    - name: Drain node from load balancer
      ansible.builtin.command: /usr/local/bin/lb-drain {{ inventory_hostname }}
      delegate_to: localhost
      changed_when: true

  roles:
    - role: app_release

  post_tasks:
    - name: Wait for local health endpoint
      ansible.builtin.uri:
        url: http://127.0.0.1:8080/health
        status_code: 200
      register: local_health
      retries: 20
      delay: 3
      until: local_health.status == 200

    - name: Re-add node to load balancer
      ansible.builtin.command: /usr/local/bin/lb-enable {{ inventory_hostname }}
      delegate_to: localhost
      changed_when: true

The key knobs¶

Setting	Why it matters
`serial`	blast radius per batch
`max_fail_percentage`	when to stop the rollout
`any_errors_fatal`	fail the whole play if one host fails
`delegate_to`	run control-plane/API tasks elsewhere
`run_once`	do something one time instead of per host
`throttle`	limit concurrency for one task

`run_once` example¶

- name: Run DB migration once
  ansible.builtin.command: /opt/app/bin/migrate
  run_once: true
  delegate_to: db1

Deployment rule¶

Never mark a node healthy because the playbook finished. Mark it healthy because the service is healthy.

13. Debugging¶

When Ansible surprises you, the bug is usually in one of four places:

inventory scope
variable precedence
module behavior
your own assumptions, which were apparently written by a goblin

Commands that pay rent¶

ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --host web1
ansible-playbook playbooks/site.yml --syntax-check
ansible-playbook playbooks/site.yml --check --diff
ansible-playbook playbooks/site.yml --list-hosts
ansible-playbook playbooks/site.yml --list-tags
ansible-playbook playbooks/site.yml --start-at-task 'Render nginx vhost'
ansible-playbook playbooks/site.yml -vvv

Debug variable state¶

- name: Debug important vars
  ansible.builtin.debug:
    msg:
      app_env: "{{ app_env | default('UNSET') }}"
      app_port: "{{ app_port | default('UNSET') }}"
      distribution: "{{ ansible_facts['distribution'] | default('UNKNOWN') }}"

Fast diagnosis patterns¶

Symptom	Usual cause
task skipped unexpectedly	`when` evaluated false
variable has weird value	precedence collision
module says changed every run	non-idempotent task or bad `changed_when`
check mode lies	module lacks or only partially supports check mode
target fails with Python issue	interpreter discovery mismatch
collection/module not found	dependency not declared or installed

Retry-file myth¶

Old playbooks often say:

ansible-playbook site.yml --limit @site.retry

That only works if retry files are enabled. Modern defaults usually have them disabled. Do not build your operating habits around stale .retry files.

Better recovery tools: - --start-at-task - --limit - clear role/play separation - resumable deployment logic

14. Performance at Scale¶

Performance fixes should preserve correctness. A fast wrong playbook is just a more efficient outage.

Big wins¶

1. Disable fact gathering when you do not need it¶

- hosts: localhost
  gather_facts: false

2. Enable pipelining¶

[ssh_connection]
pipelining = True

3. Use fact caching¶

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .cache/facts

4. Avoid shell-heavy loops¶

Do not do this across 500 hosts unless you enjoy self-harm by latency.

5. Raise forks carefully¶

[defaults]
forks = 30

More forks can help, until the controller, bastion, network, target services, or API rate limits slap you.

Strategy plugins¶

linear: default, predictable, batch-safe
free: hosts proceed independently; good for some workloads, dangerous for others

Async tasks¶

- name: Kick off long-running task
  ansible.builtin.command: /opt/app/bin/reindex
  async: 3600
  poll: 0

- name: Check status later
  ansible.builtin.async_status:
    jid: "{{ job_result.ansible_job_id }}"

Performance rule¶

Optimize in this order: 1. reduce unnecessary work 2. use proper modules 3. reduce fact gathering 4. enable pipelining/caching 5. increase concurrency

Not the reverse.

15. Testing and CI/CD¶

This is one of the biggest missing pieces in many Ansible guides. Production-safe Ansible is not just writing playbooks. It is proving they are sane before they touch systems.

Minimum validation stack¶

syntax check
lint
dependency install
Molecule or other test execution
check mode where meaningful
idempotence check

Syntax check¶

ansible-playbook playbooks/site.yml --syntax-check

Lint¶

ansible-lint

Ansible-lint catches a lot of bad habits: - missing FQCNs - unsafe patterns - bad task names - dependency issues - risky latest behavior - broad ignore_errors

Molecule¶

Molecule is the test framework for Ansible roles, playbooks, and collections.

molecule test

Good Molecule suites usually verify: - converge works - second converge is idempotent - resulting system behavior is correct

Dependency declaration matters¶

If lint or syntax-check cannot resolve your modules, your repository is underspecified.

Declare required roles and collections in requirements files and install them in CI.

Example CI flow¶

ansible-galaxy collection install -r collections/requirements.yml
ansible-galaxy role install -r roles/requirements.yml
ansible-playbook playbooks/site.yml --syntax-check
ansible-lint
molecule test
ansible-playbook playbooks/site.yml --check --diff

Idempotence test mindset¶

A good configuration playbook should usually look like this: - first run: changed - second run: mostly ok

If the second run still reports changes, investigate why.

16. Execution Environments, Builder, and Navigator¶

This is the biggest modernization in this revision.

What an EE is¶

An Execution Environment is a container image that acts as your Ansible control node.

It packages: - ansible-core - collections - Python dependencies - system packages needed by those collections - config and runtime support

Why you should care¶

Without EEs: - “works on my laptop” nonsense - version drift across engineers and CI - mystery Python packages - AWX/controller mismatch

With EEs: - repeatable controller runtime - same dependencies in laptop/CI/controller - cleaner onboarding - fewer dependency ghost stories

Minimal EE definition¶

---
version: 3
images:
  base_image:
    name: docker.io/redhat/ubi9:latest
dependencies:
  ansible_core:
    package_pip: ansible-core
  ansible_runner:
    package_pip: ansible-runner
  galaxy: collections/requirements.yml

Build it¶

ansible-builder build -t my-ee:latest

Run with navigator¶

ansible-navigator run playbooks/site.yml \
  --execution-environment-image my-ee:latest \
  --mode stdout

What `ansible-navigator` is¶

ansible-navigator is a CLI/TUI for running, reviewing, and troubleshooting Ansible content, especially with EEs.

Useful subcommands include: - run - doc - config - collections - images - exec

When to adopt EEs¶

Scenario	Recommendation
solo learning on one laptop	optional
team with CI	strongly recommended
AWX / Automation Controller	effectively standard practice
content with nontrivial deps	use EEs

17. Automation Controller and AWX¶

Current terminology:

AWX = upstream open-source project
Automation Controller = enterprise controller component inside Red Hat Ansible Automation Platform

Historical note: older material often says “Tower.” That is legacy naming.

Why use a controller¶

CLI is enough until you need: - RBAC - central credentials - audit trails - scheduling - web/API launches - inventory syncs - standardized execution environments

What a controller actually gives you¶

Capability	CLI	AWX / Controller
manual run	yes	yes
scheduling	crude/external	built in
RBAC	no	yes
credential management	ad hoc	structured
job history	scattered	centralized
inventory sync	manual	built in
standardized runtime	manual discipline	first-class with EEs

Rule¶

Do not adopt a controller because YAML feels important and enterprise-shaped. Adopt it because you need shared execution, governance, or scale.

18. Ansible vs Terraform vs Helm¶

These tools overlap a little and fight a lot in bad designs.

Simple split¶

Tool	Best at
Ansible	configuring and orchestrating existing systems
Terraform	provisioning infrastructure resources and dependency graphs
Helm	packaging/releasing Kubernetes app manifests

Good boundary examples¶

Terraform creates VMs, subnets, security groups, load balancers.
Ansible configures the OS, deploys packages, templates configs, coordinates cutovers.
Helm installs app stacks into Kubernetes.
Ansible can orchestrate Helm, but Helm should still own the release content.

Anti-pattern¶

Do not make one tool impersonate all the others because you are trying to reduce the number of logos in your architecture diagram.

19. Common Production Patterns¶

Pattern: validate config before reload¶

- name: Push nginx config safely
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    validate: 'nginx -t -c %s'
  notify: reload nginx

Pattern: one-time DB migration during rolling app deploy¶

- name: Run DB migration once from a safe node
  ansible.builtin.command: /opt/app/bin/migrate
  run_once: true
  delegate_to: app1

Pattern: drain, change, health-check, re-enable¶

- name: Drain from LB
  ansible.builtin.command: /usr/local/bin/lb-drain {{ inventory_hostname }}
  delegate_to: localhost

- name: Deploy config
  ansible.builtin.include_role:
    name: app_release

- name: Wait for health
  ansible.builtin.uri:
    url: http://127.0.0.1:8080/health
    status_code: 200
  register: app_health
  retries: 20
  delay: 3
  until: app_health.status == 200

- name: Re-enable in LB
  ansible.builtin.command: /usr/local/bin/lb-enable {{ inventory_hostname }}
  delegate_to: localhost

Pattern: assert preconditions before touching anything¶

- name: Assert supported platform
  ansible.builtin.assert:
    that:
      - ansible_facts['os_family'] in ['Debian', 'RedHat']
      - app_port | int >= 1024
    fail_msg: "Unsupported target or invalid app_port"

Pattern: dynamic include by OS family¶

- name: Load OS-specific tasks
  ansible.builtin.include_tasks: "{{ ansible_facts['os_family'] }}.yml"

Pattern: explicit read-only probe task¶

- name: Read current config version
  ansible.builtin.command: cat /etc/myapp/version
  register: cfg_ver
  changed_when: false
  failed_when: cfg_ver.rc != 0

20. Footguns¶

1. Using `shell` for everything¶

You lose idempotence, portability, readability, and check-mode usefulness.

2. Hiding operator-tunable vars in `vars/`¶

You create precedence bugs that look supernatural.

3. Trusting check mode too much¶

Some modules support it fully, some partially, some barely at all.

4. Forgetting `validate`¶

Broken config + automatic restart = self-inflicted outage.

5. Blind `ignore_errors: true`¶

That is not resiliency. That is burying evidence.

6. Running DB migrations per host¶

Congratulations, you invented distributed regret.

7. Depending on random collections installed globally¶

Your repo is now a snowflake.

8. Not pinning or declaring dependencies¶

Sooner or later CI and laptops diverge.

9. Assuming retry files exist¶

Modern defaults usually say no.

10. Printing secrets in debug output¶

Logs are forever. Or long enough to ruin your week.

11. Treating controller runtime as informal¶

If dependencies matter, use EEs.

12. Not testing second-run idempotence¶

Drift loops hide here.

13. Using stale terminology and stale docs blindly¶

“Tower,” old callback plugins, old Galaxy commands, old Python floors - all classic fossil layers.

21. Dense Cheat Sheet¶

Core commands¶

ansible all -i inventory/hosts.yml -m ansible.builtin.ping
ansible-playbook playbooks/site.yml
ansible-playbook playbooks/site.yml --syntax-check
ansible-playbook playbooks/site.yml --check --diff
ansible-playbook playbooks/site.yml --list-hosts
ansible-playbook playbooks/site.yml --list-tags
ansible-playbook playbooks/site.yml --start-at-task 'TASK NAME'
ansible-playbook playbooks/site.yml -vvv
ansible-inventory -i inventory/hosts.yml --graph
ansible-config dump --only-changed
ansible-doc ansible.builtin.template
ansible-lint
molecule test

Galaxy / dependency commands¶

ansible-galaxy role init myrole --init-path roles
ansible-galaxy role install -r roles/requirements.yml
ansible-galaxy collection install -r collections/requirements.yml
ansible-galaxy collection list

Vault commands¶

ansible-vault create inventory/group_vars/prod/vault.yml
ansible-vault edit inventory/group_vars/prod/vault.yml
ansible-vault view inventory/group_vars/prod/vault.yml
echo -n 'secret' | ansible-vault encrypt_string --stdin-name db_password
ansible-playbook playbooks/site.yml --vault-id prod@~/.ansible/prod.vault.pass

High-signal playbook patterns¶

# Safe config push
ansible.builtin.template + validate + notify

# Read-only probe
changed_when: false

# Controlled imperative task
ansible.builtin.command + creates/removes

# Rolling deploy
serial + health checks + delegate_to + run_once

# Safer facts usage
ansible_facts['distribution']

# Better modules
ansible.builtin.package / service / copy / template / uri / assert

# Better error handling
failed_when / changed_when / block-rescue

Good defaults to remember¶

FQCNs everywhere
defaults are for knobs, vars are for constants
validate before reload/restart
do not use shell unless you must
do not assume check mode is perfect
declare dependencies
test second-run idempotence
use EEs when the runtime matters

22. Reference Links¶

Official docs used to modernize this guide:

Ansible installation guide: https://docs.ansible.com/projects/ansible-core/devel/installation_guide/intro_installation.html
Ansible getting started: https://docs.ansible.com/ansible/devel/getting_started/index.html
Interpreter discovery: https://docs.ansible.com/ansible/latest/reference_appendices/interpreter_discovery.html
Error handling: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html
Check mode and diff mode: https://docs.ansible.com/projects/ansible/latest/playbook_guide/playbooks_checkmode.html
ansible.builtin.command: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/command_module.html
ansible-galaxy CLI: https://docs.ansible.com/projects/ansible/latest/cli/ansible-galaxy.html
Collections install guide: https://docs.ansible.com/projects/ansible/latest/collections_guide/collections_installing.html
Default callback plugin and YAML result formatting: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/default_callback.html
Config settings reference: https://docs.ansible.com/projects/ansible/latest/reference_appendices/config.html
Windows host management and SSH: https://docs.ansible.com/ansible/latest/os_guide/intro_windows.html
Windows SSH: https://docs.ansible.com/projects/ansible-core/devel/os_guide/windows_ssh.html
Ansible Builder: https://docs.ansible.com/projects/builder/
Execution Environments getting started: https://docs.ansible.com/en/latest/getting_started_ee/index.html
Running an EE: https://docs.ansible.com/ansible/latest/getting_started_ee/run_execution_environment.html
Ansible Navigator: https://docs.ansible.com/projects/navigator/
Ansible Lint: https://docs.ansible.com/projects/lint/
Molecule: https://docs.ansible.com/projects/molecule/
Red Hat Ansible Automation Platform docs: https://docs.redhat.com/en/documentation/red_hat_ansible_automation_platform/latest

Final take¶

The strongest mental model for Ansible is this:

Ansible is a convergence engine plus orchestration glue.

Use modules for state, inventory for truth, roles for reuse, handlers for delayed side effects, validations for safety, and execution environments for reproducible control-node runtime.

When people get hurt with Ansible, it is usually not because Ansible is mysterious. It is because they wrote shell scripts in YAML, ignored precedence, skipped validation, hid dependencies, or treated controller runtime as folklore.

Do the opposite and Ansible stays boring - which, in operations, is the highest compliment.