Quiz: Ansible Deep Dive¶

15 questions

L1 (4 questions)¶

1. You define app_version in inventory group_vars, playbook group_vars, and host_vars. All three conflict. Which value wins, and how would you override all of them for a single run?

Show answer

Ansible has 22 levels of variable precedence. For these three: host_vars (most specific) beats playbook group_vars, which beats inventory group_vars. To override everything for a single run, use --extra-vars (-e) on the command line — extra vars have the highest precedence and override all other sources. The precedence order from lowest to highest: role defaults → inventory group_vars → playbook group_vars → inventory host_vars → playbook host_vars → play vars → role vars → task vars → extra vars. When debugging precedence issues, use 'ansible -m debug -a var=app_version' against the target host.

2. A handler named 'restart nginx' is notified by a task, but the handler never executes even though the playbook succeeds. What are three possible reasons?

Show answer

1. The notifying task reported 'ok' (no change) instead of 'changed' — handlers only fire when the task actually changes something. Running the same playbook twice means the second run has no changes, so no handler fires.
2. The task failed before the handler could run, and the failure was caught by ignore_errors — handlers are skipped for failed tasks.
3. The handler name doesn't match exactly (whitespace, case) — Ansible matches handler names as strings. Fix: use 'listen' topics instead of exact name matching, or use 'meta: flush_handlers' to force handler execution mid-play rather than waiting until end of play.

3. You set async: 3600 and poll: 0 on a long-running backup task. The playbook finishes immediately. How do you verify the backup actually completed?

Show answer

With poll: 0, Ansible fires the task and moves on without waiting (fire-and-forget). The task returns a job ID in ansible_job_id. To check completion:
1. Register the result: 'register: backup_job'.
2. Add a later task using the async_status module: 'async_status: jid={{ backup_job.ansible_job_id }}' with 'register: job_result' and 'until: job_result.finished' with 'retries: 120' and 'delay: 30'. This polls every 30 seconds for up to an hour. Without this follow-up, you have no way to know if the backup succeeded, failed, or is still running.

4. Your dynamic inventory plugin for AWS EC2 returns instances grouped by tags, but a newly launched instance with the correct tags does not appear. What do you check?

Show answer

1. Cache: the aws_ec2 inventory plugin caches results. Run with '--flush-cache' or set 'cache: false' in the plugin config.
2. Filters: check the plugin's 'filters' section — if you filter by instance-state-name=running and the instance is still 'pending', it won't appear.
3. Regions: the plugin only queries configured regions — verify the instance region matches.
4. IAM permissions: the credentials need ec2:DescribeInstances permission.
5. keyed_groups: if you use tag-based grouping, check that the tag key/value match exactly (case-sensitive). Debug with 'ansible-inventory --list --yaml' to see what the plugin actually returns.

L2 (7 questions)¶

1. What is the practical difference between include_role and import_role, and when does it matter?

Show answer

import_role is static: Ansible preprocesses it at playbook parse time. All tasks are visible upfront, tags and when conditions are inherited by all tasks, and errors are caught early. include_role is dynamic: it runs at task execution time. Tasks are loaded on-the-fly, variables can be computed at runtime to decide which role to include, and it supports loops. It matters when:
1. You need conditional role loading based on runtime facts — use include_role.
2. You want --list-tasks to show all tasks — use import_role (dynamic includes are invisible to list-tasks).
3. You need to loop over a role with different vars — use include_role (import_role cannot loop). Default to import_role for predictability; use include_role only when you need runtime dynamism.

2. A block of 3 tasks has a rescue section and an always section. Task 2 fails. In what order do remaining tasks execute, and what happens if a rescue task also fails?

Show answer

Execution order: Task 1 (succeeds) → Task 2 (fails) → Task 3 is SKIPPED → rescue tasks execute in order → always tasks execute in order. If a rescue task fails: the remaining rescue tasks are skipped, but the always section STILL executes (always means always). The play is then marked as failed. This mirrors try/catch/finally in programming: rescue = catch, always = finally. The always block is guaranteed to run regardless of success, failure, or rescue failure. Use always for cleanup (removing temp files, releasing locks) that must happen no matter what.

3. You need to deploy a config file that differs per host but shares 80% of its content. Jinja2 templates work but are becoming unreadable. What are your alternatives?

Show answer

Options beyond monolithic Jinja2:
1. Template inheritance — use {% extends "base.conf.j2" %} with {% block %} overrides per host group.
2. assemble module — split config into fragments (00-header, 10-database, 20-cache) and assemble them per host. Each fragment can be conditionally included.
3. Template with defaults — define a base dict in group_vars/all with per-host overrides via host_vars, then use {{ config | combine(host_overrides, recursive=True) }}.
4. ini_file or lineinfile for surgical edits to a shipped default config instead of templating the whole file.
5. For complex configs (nginx, haproxy), use a role that generates config from structured data rather than a flat template.

4. You run a playbook that works in check mode but fails in real mode. The failing task is a template that writes to /etc/app/config.yml. What are the likely causes?

Show answer

1. A prior task creates the /etc/app/ directory but that task also reported "ok" in check mode because the directory already existed in a previous run — on a fresh host, the directory does not exist and the template fails with "No such file or directory."
2. The template uses a variable set by set_fact or register from a previous task that does not actually execute in check mode (set_fact in check mode still runs, but registered results from shell/command tasks will be skipped, leaving the variable undefined).
3. A handler that sets permissions or SELinux context runs only when notified, and check mode does not trigger handlers.
4. File ownership requires a user created by a prior task that was skipped in check mode. Fix: add check_mode: false to tasks that must run even in check mode (e.g., directory creation), or use the stat module with when: not ansible_check_mode to guard dependent tasks.

5. Your Ansible vault-encrypted vars file works locally but fails in CI with "Decryption failed." The vault password is passed via environment variable. What do you check?

Show answer

1. The vault password file or script must be executable and output ONLY the password with no trailing newline. Check with: cat -A .vault_pass (should show password$ with no extra chars).
2. In CI, the env var may have a trailing newline from the secret injection mechanism — use printf "%s" "$VAULT_PASS" > .vault_pass instead of echo.
3. Vault ID mismatch — if you encrypted with --vault-id dev@prompt but decrypt without specifying the vault ID, it fails. Use --vault-id dev@.vault_pass.
4. The file was re-encrypted with a different password (someone rotated the vault password without updating CI). Verify with: ansible-vault view secrets.yml --vault-password-file .vault_pass.
5. File encoding issues — the encrypted file was modified by a Windows editor that added BOM or changed line endings.

6. What are the security implications of ansible_become and how do you harden privilege escalation in production?

Show answer

1. become: true runs tasks via sudo by default. If become_password is stored in plaintext in inventory, it is a security risk — encrypt with vault.
2. Limit sudoers — do not grant ALL permissions. Use /etc/sudoers.d/ with specific commands: "ansible ALL=(root) NOPASSWD: /usr/bin/systemctl restart nginx".
3. Use become_method: su or become_method: doas if sudo is not available.
4. become_user should be the minimum-privilege user needed (not always root).
5. For sensitive tasks, use become_flags: "-H -S" to sanitize the environment.
6. Audit trail: AAP logs which user triggered which become task. On the target, sudo logs to /var/log/secure.
7. Avoid become: true at the play level — set it per-task so only tasks that need root get root.
8. In CI, use a dedicated service account with scoped sudo rules, never a personal account.

7. How do you manage Ansible in a mono-repo with multiple teams, each owning different roles and inventories?

Show answer

1. Directory structure: inventories/{team-a,team-b}/ with separate group_vars/host_vars per team. Shared roles in roles/, team-specific roles in roles/team-a-*.
2. Use ansible.cfg per team directory or ANSIBLE_CONFIG env var in CI to point to team-specific config.
3. Collections: publish team-specific modules as internal collections to a private Automation Hub.
4. RBAC: in AAP, create organizations per team with separate credentials and job templates.
5. Testing: each team has their own Molecule test suite. CI runs only affected tests based on changed paths (e.g., paths: roles/team-a-** triggers team-a tests).
6. Linting: enforce ansible-lint profiles per team (some teams may have stricter rules).
7. Shared roles need an OWNERS file and require cross-team review.
8. Pin collection and role versions in requirements.yml per team to prevent one team's updates from breaking another.

L3 (4 questions)¶

1. Your playbook manages 2,000 hosts and takes 45 minutes. Profile shows gather_facts takes 12 minutes and the package install task takes 20 minutes. How do you cut this to under 15 minutes?

Show answer

Attack both bottlenecks: Facts —
1. Enable fact caching (redis backend, TTL 4 hours). Subsequent runs skip gathering entirely.
2. Use gather_subset: min or gather_subset: [!all, !min, network] to collect only what you need.
3. For hosts where you never use facts, set gather_facts: false. Packages —
1. Increase forks from 5 to 50+ (ansible.cfg or -f 50).
2. Use strategy: free so fast hosts do not wait for slow ones.
3. Use async with poll: 0 + async_status loop for package installs (fire all installs in parallel, poll for completion).
4. Enable SSH pipelining (pipelining = True) to eliminate per-task SSH handshake overhead.
5. Use mitogen strategy plugin (3-7x faster task execution).
6. If hosts have a local mirror, ensure the package manager hits it instead of upstream repos. Combined, this typically achieves 5-10x speedup.

2. How do you implement a blue-green deployment pattern purely with Ansible (no external orchestrator)?

Show answer

Pattern: maintain two host groups (blue, green).
1. Determine active group by querying LB (delegate_to LB host, check which backend pool is active). Store result in a fact.
2. Deploy to the inactive group (serial: "100%" — all at once, since they are not serving traffic).
3. Run smoke tests against the inactive group (uri module hitting health endpoints).
4. If smoke tests pass, switch LB to point to the newly deployed group (update haproxy config, or call cloud LB API via uri module).
5. Run a canary period with both groups in the pool, then drain the old group. Implementation details: use group_vars to parameterize which group is blue vs green. Use run_once + delegate_to for LB operations. Store deployment state in a file or external KV store so the next run knows which group is active. Failure handling: if smoke tests fail, do not switch LB — the rescue block reverts the deploy on the inactive group.

3. You need to write a custom inventory plugin that queries your internal CMDB API. Walk through the implementation.

Show answer

Create plugins/inventory/cmdb.py in your collection or playbook directory.
1. Subclass BaseInventoryPlugin (and optionally Constructable for keyed_groups/compose support).
2. Set NAME = "namespace.collection.cmdb".
3. Implement verify_file() to recognize your config file (e.g., ends with cmdb.yml).
4. Implement parse(inventory, loader, path, cache) — this is the main method. Read the config YAML with self._read_config_data(path). Call your CMDB API (use requests, handle pagination). For each host: self.inventory.add_host(hostname), self.inventory.set_variable(hostname, key, value), self.inventory.add_group(group), self.inventory.add_child(group, hostname).
5. Add caching: check self.cache before API call, populate on miss.
6. Config file (cmdb.yml): plugin: namespace.collection.cmdb, api_url: https://cmdb.internal, filters: {environment: prod}.
7. Test with ansible-inventory -i cmdb.yml --list --yaml. Common mistakes: forgetting to call super().parse() first, not handling API pagination, not setting ansible_host for hosts where hostname != IP.

4. A playbook using delegate_to to configure a load balancer from within an app server play is causing tasks to run with the wrong variables. Explain why and how to fix it.

Show answer

When you delegate_to a different host, Ansible runs the task on the delegated host but uses the DELEGATING host's variables by default (the inventory_hostname context). This means:
1. ansible_host resolves to the delegated host (correct for SSH), but all other vars (group_vars, host_vars) come from the original host.
2. If the LB host has different vars (e.g., lb_vip, lb_backend_port), they are NOT available unless you explicitly reference hostvars[lb_host].lb_vip. Fix approaches:
1. Use hostvars[] to explicitly pull vars from the target: {{ hostvars[lb_hostname].lb_vip }}.
2. Use delegate_facts: true if the task gathers facts — stores facts under the delegated host, not the delegating host.
3. Consider a separate play targeting the LB group directly instead of delegation, which is cleaner for complex LB configuration.
4. If using loops with delegation, each loop iteration delegates independently — variables from the looping host are available, not from previous delegates.