Skip to content

Ansible - The Complete Guide (Revised, Current, Production-Focused)

Audience: Linux/sysadmin, platform, cloud, and operations engineers Scope: Ansible fundamentals through production operating patterns Style: opinionated, practical, current-era ansible-core ecosystem Goal: get from "I know the buzzwords" to "I can ship safe automation without bricking fleets"


What changed in this revision

This rewrite keeps the strong operator-grade material from the original and removes or fixes the parts that would age badly or mislead people.

Fixed

  • outdated ansible-galaxy command examples
  • outdated YAML callback guidance
  • incorrect retry-file assumption
  • overly absolute statements about ignore_errors and check mode
  • dated Tower-only framing
  • unsafe/contradictory Vault examples
  • weak trivia, flashcards, and unsourced case-study filler

Added

  • ansible vs ansible-core
  • Execution Environments (EE)
  • ansible-builder
  • ansible-navigator
  • ansible-lint
  • FQCN guidance
  • modern CI/CD validation flow
  • safer examples using ansible_facts instead of injected fact variables

Table of Contents

  1. What Ansible Is
  2. Pick the Right Package and Install It
  3. Project Layout That Does Not Rot
  4. Inventory
  5. Playbooks
  6. Modules: Declarative First, Imperative Last
  7. Variables, Facts, and Precedence
  8. Templates and Handlers
  9. Roles and Collections
  10. Secrets and Vault
  11. Conditionals, Loops, and Error Handling
  12. Rolling Changes and Zero-Downtime Thinking
  13. Debugging
  14. Performance at Scale
  15. Testing and CI/CD
  16. Execution Environments, Builder, and Navigator
  17. Automation Controller and AWX
  18. Ansible vs Terraform vs Helm
  19. Common Production Patterns
  20. Footguns
  21. Dense Cheat Sheet
  22. Reference Links

1. What Ansible Is

Ansible is agentless automation. You run automation from a control node, Ansible connects to targets, executes modules, and converges them toward the desired state.

The mental model

Control node
  |
  | SSH / WinRM / PSRP / API / network transport
  v
Managed node or device
  |
  v
Module runs -> returns changed/ok/failed + structured result

The five things that matter

Principle Meaning
Agentless Usually nothing to install on Linux targets beyond what the module needs
Idempotent Re-running should not keep changing things
Declarative Ask for a state, not a sequence of shell commands
Push-based You initiate changes from a control node
Extensible Core + collections + plugins + inventories + callbacks

Where Ansible shines

  • fleet configuration
  • package/service/file management
  • OS and middleware standardization
  • rolling deployments
  • orchestration around APIs, network devices, cloud, and VMs
  • “glue” automation across systems

Where Ansible is not magic

  • It is not a general replacement for software engineering.
  • It is not the best tool for full-blown resource graph provisioning.
  • It is not fast if you write everything as shell and re-gather facts every five seconds.
  • It does not make dangerous ideas safe just because they are YAML.

2. Pick the Right Package and Install It

Modern Ansible is not one monolith anymore.

ansible vs ansible-core

Package What you get Use when
ansible-core engine, CLI, builtin content, plugin framework minimal, controlled environments
ansible ansible-core plus a curated set of community collections easier all-in-one workstation install

Practical recommendation

  • Use ansible-core when you want explicit dependencies and clean reproducibility.
  • Use ansible when you want a batteries-included learning/workstation setup.
  • For team-scale reproducibility, move toward Execution Environments.

Install examples

# Preferred for isolated workstation installs
pipx install ansible-core

# Or the broader package
pipx install ansible

# Verify
ansible --version
ansible-config dump --only-changed

Control node and target reality

  • Control node support changes over time. Match your documentation to your installed ansible-core version.
  • POSIX targets usually need Python for most modules.
  • Some targets are exceptions. Network devices often do not need remote Python because modules use other transports.
  • Windows is a different world: use WinRM/PSRP commonly, with modern SSH support also available in current Ansible for newer Windows/OpenSSH combinations.

First ad-hoc commands

ansible all -i inventory/hosts.yml -m ansible.builtin.ping
ansible web -i inventory/hosts.yml -m ansible.builtin.command -a 'uptime'
ansible web -i inventory/hosts.yml -b -m ansible.builtin.package -a 'name=nginx state=present'
ansible web -i inventory/hosts.yml -b -m ansible.builtin.service -a 'name=nginx state=started enabled=true'

Ad-hoc rule of thumb

Ad-hoc commands are for: - quick inspection - one-off safe changes - triage

If you might ever run it twice, it probably wants to be a playbook.


3. Project Layout That Does Not Rot

A sane layout buys you more than clever YAML ever will.

ansible/
├── ansible.cfg
├── inventory/
│   ├── hosts.yml
│   ├── group_vars/
│   └── host_vars/
├── playbooks/
│   ├── site.yml
│   ├── web.yml
│   └── db.yml
├── roles/
│   ├── common/
│   └── nginx/
├── collections/
│   └── requirements.yml
├── roles/requirements.yml
├── files/
├── templates/
├── molecule/
├── .ansible-lint
└── README.md

Minimal ansible.cfg

[defaults]
inventory = inventory/hosts.yml
stdout_callback = default
callback_result_format = yaml
bin_ansible_callbacks = True
interpreter_python = auto_silent
host_key_checking = True
retry_files_enabled = False
forks = 20
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .cache/facts
roles_path = roles

[ssh_connection]
pipelining = True

Why this config

  • callback_result_format = yaml replaces old callback hacks.
  • retry_files_enabled = False matches modern defaults and avoids stale .retry habits.
  • interpreter_python = auto_silent uses interpreter discovery without noisy warnings.
  • pipelining and fact caching improve speed.

4. Inventory

Inventory answers two questions:

  1. Which hosts exist?
  2. What do we know about them?

Static inventory example

all:
  children:
    web:
      hosts:
        web1:
          ansible_host: 10.0.10.11
        web2:
          ansible_host: 10.0.10.12
      vars:
        app_env: production
        http_port: 8080
    db:
      hosts:
        db1:
          ansible_host: 10.0.20.11

Useful inventory variables

Variable Purpose
ansible_host actual address to connect to
ansible_user remote username
ansible_port non-default SSH port
ansible_connection ssh, winrm, psrp, local, network_cli, etc.
ansible_python_interpreter force a target Python path
ansible_become privilege escalation default

Dynamic inventory

Use inventory plugins when the source of truth is elsewhere.

Common examples: - AWS EC2 - VMware - OpenStack - Kubernetes - constructed inventories from metadata/tags

Example:

plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
keyed_groups:
  - key: tags.Role
    prefix: role
  - key: tags.Environment
    prefix: env
filters:
  instance-state-name: running

Good inventory habits

  • Put environment-specific data in group_vars/ and host_vars/.
  • Keep inventory names stable even if IPs change.
  • Group by function and environment.
  • Do not bury secrets in inventory.
  • Use ansible-inventory --graph and --list constantly.

5. Playbooks

A playbook is one or more plays. A play maps hosts to tasks.

Minimal playbook

---
- name: Configure web servers
  hosts: web
  become: true
  gather_facts: true

  tasks:
    - name: Install nginx
      ansible.builtin.package:
        name: nginx
        state: present

    - name: Ensure nginx is enabled and started
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: true

The important play-level knobs

Key Why it matters
hosts target scope
become privilege escalation
gather_facts speed vs convenience
serial rolling change size
vars play-scoped variables
pre_tasks / post_tasks guardrails and cleanup
handlers delayed actions triggered by changes
strategy lockstep or freer execution

Tagging

tasks:
  - name: Install packages
    ansible.builtin.package:
      name: nginx
      state: present
    tags: [packages]

  - name: Push config
    ansible.builtin.template:
      src: nginx.conf.j2
      dest: /etc/nginx/nginx.conf
    notify: restart nginx
    tags: [config]

Run only what you need:

ansible-playbook playbooks/web.yml --tags config
ansible-playbook playbooks/web.yml --skip-tags packages
ansible-playbook playbooks/web.yml --list-tags
ansible-playbook playbooks/web.yml --list-tasks

6. Modules: Declarative First, Imperative Last

The single biggest quality divider in Ansible code is this:

Use a purpose-built module when one exists. Reach for command or shell only when you must.

Good

- name: Install nginx
  ansible.builtin.package:
    name: nginx
    state: present

- name: Drop config file
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    mode: '0644'
  notify: restart nginx

- name: Ensure service is up
  ansible.builtin.service:
    name: nginx
    state: started
    enabled: true

Bad unless you truly need it

- name: Do everything with shell because reasons
  ansible.builtin.shell: |
    apt-get update
    apt-get install -y nginx
    systemctl enable --now nginx

That second example is how YAML cosplay turns into pager duty.

command vs shell

Module Use when Avoid when
ansible.builtin.command run a simple command safely you need pipes, redirects, shell expansion
ansible.builtin.shell you truly need shell features a normal module exists

Make imperative tasks less stupid

- name: Initialize app database once
  ansible.builtin.command:
    cmd: /opt/app/bin/init-db
    creates: /var/lib/app/.db_initialized

That gives command a guardrail and partial check-mode usefulness.

Modules worth knowing cold

Category Modules
packages package, apt, dnf, yum, pip
services service, systemd_service
files copy, template, file, lineinfile, replace, assemble
users user, group, authorized_key
commands command, shell, script, raw
control flow include_tasks, import_tasks, include_role, import_role, meta
validation assert, wait_for, uri, stat, slurp
orchestration delegate_to, set_fact, add_host, group_by

Use FQCNs

Prefer:

ansible.builtin.package:

not:

package:

Why: - clearer provenance - fewer name collisions - better linting - easier documentation lookup


7. Variables, Facts, and Precedence

This is where many playbooks become haunted.

The practical precedence rule

There is a long official precedence chain. In real life, remember this order:

Usually lower -> higher Typical use
role defaults safe knobs meant to be overridden
inventory vars / group_vars / host_vars environment-specific data
play vars local overrides for one play
task vars / include vars narrow-scope overrides
registered vars / set_fact runtime data
extra vars (-e) explicit operator override; wins hard

defaults/ vs vars/ in roles

Path Use for Avoid putting here
defaults/main.yml values users should tune hard constants
vars/main.yml truly internal constants you do not expect callers to override ports, versions, feature flags, env-specific values

If operators might need to change it, it belongs in defaults/, not vars/.

Facts

Facts are discovered target data.

- name: Show OS family
  ansible.builtin.debug:
    msg: "{{ ansible_facts['os_family'] }}"

Use ansible_facts[...] explicitly. It is clearer and more future-proof than relying on injected top-level fact variables.

Registered variables

- name: Check app health
  ansible.builtin.uri:
    url: http://127.0.0.1:8080/health
    status_code: 200
  register: healthcheck

- name: Show response body
  ansible.builtin.debug:
    var: healthcheck.json

Good variable hygiene

  • keep names specific: nginx_worker_connections, not workers
  • keep environment data in inventory, not inside roles
  • avoid global variable soup
  • use assert early for required inputs
- name: Assert required variables exist
  ansible.builtin.assert:
    that:
      - app_name is defined
      - app_port is defined
      - app_port | int > 0
    fail_msg: "Required app variables are missing or invalid"

8. Templates and Handlers

Templates make dynamic config files. Handlers keep restarts from becoming denial-of-service attacks against yourself.

Template example

- name: Render nginx vhost
  ansible.builtin.template:
    src: nginx-vhost.conf.j2
    dest: /etc/nginx/conf.d/{{ app_name }}.conf
    mode: '0644'
    validate: 'nginx -t -c %s'
  notify: reload nginx

Why validate matters

Without validate, you can push a broken config and kill a service.

With validate, Ansible tests the candidate file before replacing the real one.

Handler example

handlers:
  - name: reload nginx
    ansible.builtin.service:
      name: nginx
      state: reloaded

Handler behavior that matters

  • A handler runs only if something notified it.
  • A handler runs once per host per play, even if multiple tasks notify it.
  • Use meta: flush_handlers when you need the restart/reload now, not at the end.
- name: Force handler now
  ansible.builtin.meta: flush_handlers

Template habits

  • keep logic light
  • put complicated logic in vars/filter plugins, not giant Jinja spaghetti
  • validate configs whenever the software supports it
  • prefer reload over restart when safe

9. Roles and Collections

Roles package related tasks. Collections package content at a larger namespace level.

Role structure

roles/
  nginx/
    defaults/main.yml
    vars/main.yml
    tasks/main.yml
    handlers/main.yml
    templates/
    files/
    meta/main.yml

Create a role skeleton

ansible-galaxy role init nginx --init-path roles

Use a role

- name: Configure web servers
  hosts: web
  become: true
  roles:
    - role: nginx
      vars:
        nginx_listen_port: 8080

import_role vs include_role

Feature import_role include_role
resolution parse time runtime
best for static composition conditional or looped inclusion
behavior more predictable tags/parse-time structure more flexible

Collections

Collections are the distribution unit for modern Ansible content.

Examples: - amazon.aws - kubernetes.core - ansible.posix - community.general

Install collections

ansible-galaxy collection install -r collections/requirements.yml

Example collections/requirements.yml:

collections:
  - name: ansible.posix
  - name: community.general
  - name: amazon.aws

Install roles

ansible-galaxy role install -r roles/requirements.yml

Dependency rule

Do not assume random collections happen to be installed on someone else's workstation. Declare them.


10. Secrets and Vault

Vault is for encrypting data you must keep with your automation. It is not a substitute for a full secret-management strategy, but it is far better than plaintext regret.

Basic Vault commands

ansible-vault create inventory/group_vars/prod/vault.yml
ansible-vault edit inventory/group_vars/prod/vault.yml
ansible-vault view inventory/group_vars/prod/vault.yml
ansible-vault rekey inventory/group_vars/prod/vault.yml

Encrypt a string safely

echo -n 'supersecret' | ansible-vault encrypt_string --stdin-name db_password

Do not do this:

ansible-vault encrypt_string 'supersecret' --name db_password

That leaks the secret into shell history.

Split vars from secrets

inventory/
  group_vars/
    prod/
      vars.yml
      vault.yml
# vars.yml
app_db_user: appuser
app_db_password: "{{ vault_app_db_password }}"

# vault.yml (encrypted)
vault_app_db_password: supersecret

Multiple vault identities

ansible-playbook playbooks/site.yml \
  --vault-id dev@prompt \
  --vault-id prod@~/.ansible/prod.vault.pass

no_log

- name: Create DB user
  community.postgresql.postgresql_user:
    name: "{{ app_db_user }}"
    password: "{{ app_db_password }}"
  no_log: true

When Vault is not enough

For larger environments, prefer pulling secrets from a real secret store when practical: - HashiCorp Vault - AWS Secrets Manager - cloud KMS-backed patterns - controller credential integrations

Hard rules

  • never put plaintext secrets in repo history
  • never pass secrets directly on command lines
  • never print secrets in debug or CI logs
  • restrict vault password file permissions
  • keep secret scope tight

11. Conditionals, Loops, and Error Handling

Conditionals

- name: Install SELinux helpers on RedHat
  ansible.builtin.package:
    name: policycoreutils-python-utils
    state: present
  when: ansible_facts['os_family'] == 'RedHat'

Loops

- name: Install baseline packages
  ansible.builtin.package:
    name: "{{ item }}"
    state: present
  loop:
    - curl
    - vim
    - git

Prefer loop unless a module has a better native bulk option.

failed_when

- name: Run a health probe script
  ansible.builtin.command: /opt/app/bin/health-probe
  register: probe
  changed_when: false
  failed_when:
    - probe.rc != 0
    - "'warming up' not in probe.stdout"

changed_when

- name: Read current app version
  ansible.builtin.command: cat /opt/app/VERSION
  register: version_out
  changed_when: false

ignore_errors

Use sparingly. It does not ignore everything. It only ignores a task that ran and returned a failed result. It does not rescue you from undefined variables, syntax errors, connection failures, or broad stupidity.

Better pattern:

- name: Try old service name
  ansible.builtin.command: systemctl is-active legacy-app
  register: legacy_service
  changed_when: false
  failed_when: false

block / rescue / always

- block:
    - name: Deploy release
      ansible.builtin.command: /opt/app/bin/deploy {{ release_id }}

    - name: Check health
      ansible.builtin.uri:
        url: http://127.0.0.1:8080/health
        status_code: 200

  rescue:
    - name: Roll back release
      ansible.builtin.command: /opt/app/bin/rollback

  always:
    - name: Emit deploy summary
      ansible.builtin.debug:
        msg: "Deployment attempt finished"

Includes: static vs dynamic

Feature Static Dynamic
tasks import_tasks include_tasks
role import_role include_role

Use dynamic includes when the include itself depends on runtime conditions.


12. Rolling Changes and Zero-Downtime Thinking

Ansible does not give you zero downtime. Your design does. Ansible just enforces the choreography.

The basic rolling pattern

- name: Rolling deploy to web tier
  hosts: web
  become: true
  serial: 2
  max_fail_percentage: 0

  pre_tasks:
    - name: Drain node from load balancer
      ansible.builtin.command: /usr/local/bin/lb-drain {{ inventory_hostname }}
      delegate_to: localhost
      changed_when: true

  roles:
    - role: app_release

  post_tasks:
    - name: Wait for local health endpoint
      ansible.builtin.uri:
        url: http://127.0.0.1:8080/health
        status_code: 200
      register: local_health
      retries: 20
      delay: 3
      until: local_health.status == 200

    - name: Re-add node to load balancer
      ansible.builtin.command: /usr/local/bin/lb-enable {{ inventory_hostname }}
      delegate_to: localhost
      changed_when: true

The key knobs

Setting Why it matters
serial blast radius per batch
max_fail_percentage when to stop the rollout
any_errors_fatal fail the whole play if one host fails
delegate_to run control-plane/API tasks elsewhere
run_once do something one time instead of per host
throttle limit concurrency for one task

run_once example

- name: Run DB migration once
  ansible.builtin.command: /opt/app/bin/migrate
  run_once: true
  delegate_to: db1

Deployment rule

Never mark a node healthy because the playbook finished. Mark it healthy because the service is healthy.


13. Debugging

When Ansible surprises you, the bug is usually in one of four places:

  1. inventory scope
  2. variable precedence
  3. module behavior
  4. your own assumptions, which were apparently written by a goblin

Commands that pay rent

ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --host web1
ansible-playbook playbooks/site.yml --syntax-check
ansible-playbook playbooks/site.yml --check --diff
ansible-playbook playbooks/site.yml --list-hosts
ansible-playbook playbooks/site.yml --list-tags
ansible-playbook playbooks/site.yml --start-at-task 'Render nginx vhost'
ansible-playbook playbooks/site.yml -vvv

Debug variable state

- name: Debug important vars
  ansible.builtin.debug:
    msg:
      app_env: "{{ app_env | default('UNSET') }}"
      app_port: "{{ app_port | default('UNSET') }}"
      distribution: "{{ ansible_facts['distribution'] | default('UNKNOWN') }}"

Fast diagnosis patterns

Symptom Usual cause
task skipped unexpectedly when evaluated false
variable has weird value precedence collision
module says changed every run non-idempotent task or bad changed_when
check mode lies module lacks or only partially supports check mode
target fails with Python issue interpreter discovery mismatch
collection/module not found dependency not declared or installed

Retry-file myth

Old playbooks often say:

ansible-playbook site.yml --limit @site.retry

That only works if retry files are enabled. Modern defaults usually have them disabled. Do not build your operating habits around stale .retry files.

Better recovery tools: - --start-at-task - --limit - clear role/play separation - resumable deployment logic


14. Performance at Scale

Performance fixes should preserve correctness. A fast wrong playbook is just a more efficient outage.

Big wins

1. Disable fact gathering when you do not need it

- hosts: localhost
  gather_facts: false

2. Enable pipelining

[ssh_connection]
pipelining = True

3. Use fact caching

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .cache/facts

4. Avoid shell-heavy loops

Do not do this across 500 hosts unless you enjoy self-harm by latency.

5. Raise forks carefully

[defaults]
forks = 30

More forks can help, until the controller, bastion, network, target services, or API rate limits slap you.

Strategy plugins

  • linear: default, predictable, batch-safe
  • free: hosts proceed independently; good for some workloads, dangerous for others

Async tasks

- name: Kick off long-running task
  ansible.builtin.command: /opt/app/bin/reindex
  async: 3600
  poll: 0

- name: Check status later
  ansible.builtin.async_status:
    jid: "{{ job_result.ansible_job_id }}"

Performance rule

Optimize in this order: 1. reduce unnecessary work 2. use proper modules 3. reduce fact gathering 4. enable pipelining/caching 5. increase concurrency

Not the reverse.


15. Testing and CI/CD

This is one of the biggest missing pieces in many Ansible guides. Production-safe Ansible is not just writing playbooks. It is proving they are sane before they touch systems.

Minimum validation stack

  1. syntax check
  2. lint
  3. dependency install
  4. Molecule or other test execution
  5. check mode where meaningful
  6. idempotence check

Syntax check

ansible-playbook playbooks/site.yml --syntax-check

Lint

ansible-lint

Ansible-lint catches a lot of bad habits: - missing FQCNs - unsafe patterns - bad task names - dependency issues - risky latest behavior - broad ignore_errors

Molecule

Molecule is the test framework for Ansible roles, playbooks, and collections.

molecule test

Good Molecule suites usually verify: - converge works - second converge is idempotent - resulting system behavior is correct

Dependency declaration matters

If lint or syntax-check cannot resolve your modules, your repository is underspecified.

Declare required roles and collections in requirements files and install them in CI.

Example CI flow

ansible-galaxy collection install -r collections/requirements.yml
ansible-galaxy role install -r roles/requirements.yml
ansible-playbook playbooks/site.yml --syntax-check
ansible-lint
molecule test
ansible-playbook playbooks/site.yml --check --diff

Idempotence test mindset

A good configuration playbook should usually look like this: - first run: changed - second run: mostly ok

If the second run still reports changes, investigate why.


16. Execution Environments, Builder, and Navigator

This is the biggest modernization in this revision.

What an EE is

An Execution Environment is a container image that acts as your Ansible control node.

It packages: - ansible-core - collections - Python dependencies - system packages needed by those collections - config and runtime support

Why you should care

Without EEs: - “works on my laptop” nonsense - version drift across engineers and CI - mystery Python packages - AWX/controller mismatch

With EEs: - repeatable controller runtime - same dependencies in laptop/CI/controller - cleaner onboarding - fewer dependency ghost stories

Minimal EE definition

---
version: 3
images:
  base_image:
    name: docker.io/redhat/ubi9:latest
dependencies:
  ansible_core:
    package_pip: ansible-core
  ansible_runner:
    package_pip: ansible-runner
  galaxy: collections/requirements.yml

Build it

ansible-builder build -t my-ee:latest

Run with navigator

ansible-navigator run playbooks/site.yml \
  --execution-environment-image my-ee:latest \
  --mode stdout

What ansible-navigator is

ansible-navigator is a CLI/TUI for running, reviewing, and troubleshooting Ansible content, especially with EEs.

Useful subcommands include: - run - doc - config - collections - images - exec

When to adopt EEs

Scenario Recommendation
solo learning on one laptop optional
team with CI strongly recommended
AWX / Automation Controller effectively standard practice
content with nontrivial deps use EEs

17. Automation Controller and AWX

Current terminology:

  • AWX = upstream open-source project
  • Automation Controller = enterprise controller component inside Red Hat Ansible Automation Platform

Historical note: older material often says “Tower.” That is legacy naming.

Why use a controller

CLI is enough until you need: - RBAC - central credentials - audit trails - scheduling - web/API launches - inventory syncs - standardized execution environments

What a controller actually gives you

Capability CLI AWX / Controller
manual run yes yes
scheduling crude/external built in
RBAC no yes
credential management ad hoc structured
job history scattered centralized
inventory sync manual built in
standardized runtime manual discipline first-class with EEs

Rule

Do not adopt a controller because YAML feels important and enterprise-shaped. Adopt it because you need shared execution, governance, or scale.


18. Ansible vs Terraform vs Helm

These tools overlap a little and fight a lot in bad designs.

Simple split

Tool Best at
Ansible configuring and orchestrating existing systems
Terraform provisioning infrastructure resources and dependency graphs
Helm packaging/releasing Kubernetes app manifests

Good boundary examples

  • Terraform creates VMs, subnets, security groups, load balancers.
  • Ansible configures the OS, deploys packages, templates configs, coordinates cutovers.
  • Helm installs app stacks into Kubernetes.
  • Ansible can orchestrate Helm, but Helm should still own the release content.

Anti-pattern

Do not make one tool impersonate all the others because you are trying to reduce the number of logos in your architecture diagram.


19. Common Production Patterns

Pattern: validate config before reload

- name: Push nginx config safely
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    validate: 'nginx -t -c %s'
  notify: reload nginx

Pattern: one-time DB migration during rolling app deploy

- name: Run DB migration once from a safe node
  ansible.builtin.command: /opt/app/bin/migrate
  run_once: true
  delegate_to: app1

Pattern: drain, change, health-check, re-enable

- name: Drain from LB
  ansible.builtin.command: /usr/local/bin/lb-drain {{ inventory_hostname }}
  delegate_to: localhost

- name: Deploy config
  ansible.builtin.include_role:
    name: app_release

- name: Wait for health
  ansible.builtin.uri:
    url: http://127.0.0.1:8080/health
    status_code: 200
  register: app_health
  retries: 20
  delay: 3
  until: app_health.status == 200

- name: Re-enable in LB
  ansible.builtin.command: /usr/local/bin/lb-enable {{ inventory_hostname }}
  delegate_to: localhost

Pattern: assert preconditions before touching anything

- name: Assert supported platform
  ansible.builtin.assert:
    that:
      - ansible_facts['os_family'] in ['Debian', 'RedHat']
      - app_port | int >= 1024
    fail_msg: "Unsupported target or invalid app_port"

Pattern: dynamic include by OS family

- name: Load OS-specific tasks
  ansible.builtin.include_tasks: "{{ ansible_facts['os_family'] }}.yml"

Pattern: explicit read-only probe task

- name: Read current config version
  ansible.builtin.command: cat /etc/myapp/version
  register: cfg_ver
  changed_when: false
  failed_when: cfg_ver.rc != 0

20. Footguns

1. Using shell for everything

You lose idempotence, portability, readability, and check-mode usefulness.

2. Hiding operator-tunable vars in vars/

You create precedence bugs that look supernatural.

3. Trusting check mode too much

Some modules support it fully, some partially, some barely at all.

4. Forgetting validate

Broken config + automatic restart = self-inflicted outage.

5. Blind ignore_errors: true

That is not resiliency. That is burying evidence.

6. Running DB migrations per host

Congratulations, you invented distributed regret.

7. Depending on random collections installed globally

Your repo is now a snowflake.

8. Not pinning or declaring dependencies

Sooner or later CI and laptops diverge.

9. Assuming retry files exist

Modern defaults usually say no.

10. Printing secrets in debug output

Logs are forever. Or long enough to ruin your week.

11. Treating controller runtime as informal

If dependencies matter, use EEs.

12. Not testing second-run idempotence

Drift loops hide here.

13. Using stale terminology and stale docs blindly

“Tower,” old callback plugins, old Galaxy commands, old Python floors - all classic fossil layers.


21. Dense Cheat Sheet

Core commands

ansible all -i inventory/hosts.yml -m ansible.builtin.ping
ansible-playbook playbooks/site.yml
ansible-playbook playbooks/site.yml --syntax-check
ansible-playbook playbooks/site.yml --check --diff
ansible-playbook playbooks/site.yml --list-hosts
ansible-playbook playbooks/site.yml --list-tags
ansible-playbook playbooks/site.yml --start-at-task 'TASK NAME'
ansible-playbook playbooks/site.yml -vvv
ansible-inventory -i inventory/hosts.yml --graph
ansible-config dump --only-changed
ansible-doc ansible.builtin.template
ansible-lint
molecule test

Galaxy / dependency commands

ansible-galaxy role init myrole --init-path roles
ansible-galaxy role install -r roles/requirements.yml
ansible-galaxy collection install -r collections/requirements.yml
ansible-galaxy collection list

Vault commands

ansible-vault create inventory/group_vars/prod/vault.yml
ansible-vault edit inventory/group_vars/prod/vault.yml
ansible-vault view inventory/group_vars/prod/vault.yml
echo -n 'secret' | ansible-vault encrypt_string --stdin-name db_password
ansible-playbook playbooks/site.yml --vault-id prod@~/.ansible/prod.vault.pass

High-signal playbook patterns

# Safe config push
ansible.builtin.template + validate + notify

# Read-only probe
changed_when: false

# Controlled imperative task
ansible.builtin.command + creates/removes

# Rolling deploy
serial + health checks + delegate_to + run_once

# Safer facts usage
ansible_facts['distribution']

# Better modules
ansible.builtin.package / service / copy / template / uri / assert

# Better error handling
failed_when / changed_when / block-rescue

Good defaults to remember

  • FQCNs everywhere
  • defaults are for knobs, vars are for constants
  • validate before reload/restart
  • do not use shell unless you must
  • do not assume check mode is perfect
  • declare dependencies
  • test second-run idempotence
  • use EEs when the runtime matters

Official docs used to modernize this guide:

  • Ansible installation guide: https://docs.ansible.com/projects/ansible-core/devel/installation_guide/intro_installation.html
  • Ansible getting started: https://docs.ansible.com/ansible/devel/getting_started/index.html
  • Interpreter discovery: https://docs.ansible.com/ansible/latest/reference_appendices/interpreter_discovery.html
  • Error handling: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html
  • Check mode and diff mode: https://docs.ansible.com/projects/ansible/latest/playbook_guide/playbooks_checkmode.html
  • ansible.builtin.command: https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/command_module.html
  • ansible-galaxy CLI: https://docs.ansible.com/projects/ansible/latest/cli/ansible-galaxy.html
  • Collections install guide: https://docs.ansible.com/projects/ansible/latest/collections_guide/collections_installing.html
  • Default callback plugin and YAML result formatting: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/default_callback.html
  • Config settings reference: https://docs.ansible.com/projects/ansible/latest/reference_appendices/config.html
  • Windows host management and SSH: https://docs.ansible.com/ansible/latest/os_guide/intro_windows.html
  • Windows SSH: https://docs.ansible.com/projects/ansible-core/devel/os_guide/windows_ssh.html
  • Ansible Builder: https://docs.ansible.com/projects/builder/
  • Execution Environments getting started: https://docs.ansible.com/en/latest/getting_started_ee/index.html
  • Running an EE: https://docs.ansible.com/ansible/latest/getting_started_ee/run_execution_environment.html
  • Ansible Navigator: https://docs.ansible.com/projects/navigator/
  • Ansible Lint: https://docs.ansible.com/projects/lint/
  • Molecule: https://docs.ansible.com/projects/molecule/
  • Red Hat Ansible Automation Platform docs: https://docs.redhat.com/en/documentation/red_hat_ansible_automation_platform/latest

Final take

The strongest mental model for Ansible is this:

Ansible is a convergence engine plus orchestration glue.

Use modules for state, inventory for truth, roles for reuse, handlers for delayed side effects, validations for safety, and execution environments for reproducible control-node runtime.

When people get hurt with Ansible, it is usually not because Ansible is mysterious. It is because they wrote shell scripts in YAML, ignored precedence, skipped validation, hid dependencies, or treated controller runtime as folklore.

Do the opposite and Ansible stays boring - which, in operations, is the highest compliment.