Interview Gauntlet: Ansible Playbook 9x Slower¶
Category: Debugging Difficulty: L2-L3 Duration: 15-20 minutes Domains: Ansible, LDAP
Round 1: The Opening¶
Interviewer: "An Ansible playbook that used to run in 5 minutes now takes 45 minutes. Nothing in the playbook code has changed. What do you investigate?"
Strong Answer:¶
"A 9x slowdown without code changes points to environmental factors. I'd start by identifying where the time is being spent. Ansible has a built-in callback plugin for profiling: set callback_whitelist = timer, profile_tasks in ansible.cfg (or callbacks_enabled in newer versions) and re-run the playbook. The profile_tasks plugin shows the wall-clock time for every task, sorted by duration. This immediately tells me if one task is taking 40 minutes or if all tasks are uniformly slower. If all tasks are uniformly slow, it's likely a connection or authentication issue — SSH setup time per host, or slow DNS resolution for hostnames. If specific tasks are slow, I'd look at what those tasks do. Common slowdown causes: the inventory grew (running against 500 hosts instead of 50 and nobody noticed), fact gathering is slow (hitting every host with setup module), a package repository is unreachable or slow (causing yum/apt tasks to wait for timeouts), or a network-dependent task is hitting a slow external service."
Common Weak Answers:¶
- "Maybe the network is slow." — Too vague. Which network hop? Between the control node and the managed hosts? Between managed hosts and a package repo?
- "Upgrade Ansible." — Not diagnostic. The question is why it got slower, not how to make Ansible faster.
- "Use
asyncandpollfor parallel execution." — This is an optimization technique, not a diagnostic approach. You need to find the bottleneck first.
Round 2: The Probe¶
Interviewer: "You run the profiler and discover that fact gathering — the implicit setup module at the start — takes 35 minutes total. Each host takes about 4 seconds for fact gathering, which is fine, but you have 500 hosts. Walk me through why this is slow and how you fix it."
What the interviewer is testing: Understanding of Ansible's execution model, specifically the interaction between forks, fact gathering, and serial execution.
Strong Answer:¶
"Ansible runs tasks in batches controlled by the forks setting. The default is 5 forks, meaning Ansible gathers facts from 5 hosts in parallel. With 500 hosts at 4 seconds each and 5 forks: 500 / 5 = 100 batches * 4 seconds = 400 seconds, roughly 7 minutes. But you said 35 minutes, which is about 5x slower than the math suggests. That means either the effective parallelism is less than 5 (some hosts are taking longer, and one slow host in a batch delays the whole batch), or the forks setting is lower than 5. For the fix: first, increase forks in ansible.cfg. Setting forks = 50 would reduce fact gathering to about 40 seconds for 500 hosts. The control node needs enough CPU and file descriptors to handle 50 SSH sessions simultaneously. Second, consider whether you actually need facts from all hosts. If the playbook doesn't use facts, add gather_facts: false at the play level. If you only need specific facts, use gather_subset to limit collection: gather_subset: ['!all', 'network'] gathers only network facts. Third, enable fact caching. Ansible can cache facts in a JSON file or Redis. With fact_caching = jsonfile and fact_caching_timeout = 3600, subsequent runs within an hour skip fact gathering entirely for hosts that were already gathered."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "What's the default forks value in Ansible?" It's 5. This is in the documentation and in
ansible --versionoutput in some versions. It's a commonly asked Ansible question. If you don't remember the exact default, saying "it's low — I think 5 — and I always increase it for large inventories" is the right approach.
Round 3: The Constraint¶
Interviewer: "You increase forks to 50 and enable fact caching. The first run is faster, but you notice that each individual host's fact gathering takes 4 seconds when it used to take 0.5 seconds. Something is making the setup module slow on each host. What could cause this?"
Strong Answer:¶
"The setup module runs on the managed host and collects system information by reading from /proc, /sys, running commands like ip addr, uname, etc. If individual host fact gathering went from 0.5 seconds to 4 seconds, something on the host is slow. The most common cause in my experience: a slow name resolution lookup. The setup module collects the FQDN, which involves a reverse DNS lookup. If the DNS server is slow or the hosts' /etc/nsswitch.conf is configured to query LDAP for host lookups, each fact gather blocks on the LDAP query. I'd check nsswitch.conf on a managed host: if the hosts: line includes ldap or sss (SSSD backed by LDAP), every hostname resolution goes through LDAP. If the LDAP server is overloaded or has high latency, every host waits. Other possibilities: the facter or ohai modules are installed and being invoked alongside setup (Ansible can use these for compatibility), adding overhead. Or a custom fact script in /etc/ansible/facts.d/ is slow — maybe it's querying an API or running a slow command. I'd run time ansible <host> -m setup on a single host and strace the remote process to see where it blocks."
The Senior Signal:¶
What separates a senior answer: Identifying LDAP as the cause of slow hostname resolution during fact gathering. This is a real-world issue that happens when organizations use SSSD for centralized authentication and include
sssorldapin thehosts:line ofnsswitch.conf. It's often invisible because normal operations don't trigger hostname lookups at scale, but Ansible fact gathering does. The fix is either to removeldap/sssfrom thehosts:nsswitch line (if hosts are resolvable via DNS) or to fix the LDAP server's performance.
Round 4: The Curveball¶
Interviewer: "You're right — it's LDAP. The nsswitch.conf has hosts: files ldap dns. Every hostname resolution queries the LDAP server first, and the LDAP server is on the other side of a WAN link with 200ms latency. But the security team says LDAP must stay in the nsswitch configuration for compliance. How do you fix the Ansible performance without removing LDAP from nsswitch?"
Strong Answer:¶
"If LDAP must stay in nsswitch, I need to make the LDAP lookups fast instead of eliminating them. Option one: SSSD caching. If the hosts are using SSSD, it already has a caching layer. I'd verify SSSD is running and check its cache settings: sssd.conf should have entry_cache_timeout set appropriately. If SSSD is caching properly, the first lookup hits LDAP but subsequent lookups serve from cache, which is sub-millisecond. The issue might be that the cache is cold or SSSD is configured with too-short TTLs. Option two: nscd (Name Service Cache Daemon). Run nscd on each host to cache nsswitch lookups locally. The hosts cache in nscd can be configured with positive-time-to-live hosts 3600 to cache successful lookups for an hour. Option three: change the nsswitch order to hosts: files dns ldap. DNS is fast (local resolver), so most hostnames resolve via DNS before LDAP is tried. LDAP is still in the chain for compliance but is only hit for names that DNS can't resolve. I'd confirm with the security team that reordering satisfies their requirement while improving performance. Option four: for Ansible specifically, I can set gather_subset: ['!all', '!fqdn', 'network', 'hardware'] to skip the FQDN collection that triggers the hostname lookup, while still collecting the facts the playbook needs."
Trap Question Variant:¶
The right answer requires balancing technical and organizational constraints. Candidates who say "just remove LDAP from nsswitch" are ignoring the stated constraint. Candidates who say "nothing can be done if LDAP must stay" are giving up too easily. The senior approach finds a way to satisfy both the performance need and the compliance requirement — either through caching, reordering, or narrowing what Ansible collects.
Round 5: The Synthesis¶
Interviewer: "This playbook went from 5 minutes to 45 minutes and nobody noticed for two weeks. What does that tell you about how you'd manage Ansible at scale?"
Strong Answer:¶
"It tells me that operational tooling needs the same observability as production services. If a deployment pipeline's duration tripled, CI/CD dashboards would flag it. But Ansible playbooks often run from a cron job or an engineer's laptop with no monitoring. I'd implement three things. First, playbook runtime monitoring: log the start time, end time, and per-task duration for every playbook run. Ship it to a time-series database and set alerts for significant regression — 'playbook X took 3x longer than its 7-day average.' The ARA (Ansible Run Analysis) project does this well, providing a web dashboard of playbook history with per-task breakdowns. Second, infrastructure dependency awareness: the playbook depends on LDAP, DNS, package repositories, and SSH connectivity. Any of these can degrade and slow down automation. Map these dependencies and monitor their health independently. Third, regular performance baselines: include a 'dry-run timing' step in the CI pipeline that runs the playbook in check mode against a subset of hosts weekly. If the timing changes, investigate before it becomes a 45-minute problem. The meta-lesson is: Ansible is infrastructure, and infrastructure deserves monitoring. When tools slow down invisibly, teams route around them — they start making manual changes instead of running the playbook, which creates drift, which creates bigger problems."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Ansible profiling and systematic slowdown diagnosis |
| 2 | Understanding Ansible execution model (forks, fact gathering, caching) |
| 3 | Root cause analysis of per-host performance degradation |
| 4 | Problem-solving under organizational constraints (security vs performance) |
| 5 | Operational maturity — monitoring and observability for automation tools |