Fleet Operations Footguns¶

Mistakes that turn a routine change into a fleet-wide outage.

1. Running against `all` without `--limit`¶

You meant to test on 5 servers. You ran against 1,500. The command had a typo. Every server is now misconfigured. You have 10 minutes of downtime and 1,500 servers to fix.

Fix: Always use --limit during testing. Use --check (dry run) first. Set ansible.cfg to prompt when targeting more than N hosts: [defaults] display_args_to_stdout = True.

War story: GitLab's 2017 database incident was partly caused by an engineer running a command against the wrong server. The intended target was a replica; the actual target was the primary. No --limit, no dry run, no confirmation prompt. 300 GB of production data deleted.

2. No rollback plan¶

You push a config change to the fleet. It breaks the application. You have no snapshot of the previous config, no way to quickly restore it. You spend 2 hours manually reconstructing the old state.

Fix: Before every fleet change, capture the current state. Use rpm -qa, config file checksums, or Ansible's --diff mode. Keep the previous config in version control.

3. Serial = 100%¶

You run a playbook with no serial setting (defaults to all hosts). A restart task bounces every server simultaneously. All backends go offline. The load balancer has nothing to route to.

Fix: Always set serial for service-affecting changes. Start with 1 (canary), then 5-10%, then increase. Never restart all instances of a service at once.

Default trap: Ansible's default serial is all hosts (100%). If you omit serial: in your playbook, a handler that runs systemctl restart nginx will bounce every nginx instance in your inventory simultaneously. Set serial: 1 for canary, then serial: "25%" for rolling.

4. Ignoring partial failures¶

Your fleet script hits errors on 30 hosts but continues. You report "fleet patched" and close the ticket. Those 30 hosts are now running old, vulnerable software. The next security audit finds them.

Fix: Track and report failures explicitly. Exit with a non-zero code on partial failure. Leave a manifest of failed hosts for follow-up.

5. SSH agent forwarding across the fleet¶

You forward your SSH agent to hop through a bastion. Your agent is now available on every server you touch. If any server is compromised, the attacker can use your forwarded credentials.

Fix: Use ProxyJump instead of agent forwarding. If you must forward, use ssh -o ForwardAgent=yes selectively, never in ~/.ssh/config globally.

6. Running destructive commands without idempotency¶

Your fleet script deletes a directory and recreates it. If the script runs twice (cron overlap, manual re-run), the second run deletes the freshly created data from the first run.

Fix: Make every operation idempotent. Check state before acting. Use lock files to prevent concurrent execution. Ansible modules are idempotent by design — use them instead of raw shell commands.

7. Flat inventory for a large fleet¶

You have one hosts.txt with 1,500 entries. No grouping by role, location, or environment. You can't target just the webservers in DC1 without grepping and piping. During an incident, this costs minutes you don't have.

Fix: Structure your inventory by role, location, and environment. Use Ansible group hierarchies. Make it trivial to target any slice of the fleet.

8. No circuit breaker¶

Your fleet script encounters 10 consecutive SSH timeouts but keeps trying the next host. The network switch for that rack is down. You waste 30 minutes waiting for timeouts on hosts that will never connect.

Fix: Implement a circuit breaker. If N consecutive failures occur, halt and report. Something systemic is wrong and continuing won't help.

9. Clock skew across the fleet¶

Half your fleet has NTP misconfigured. Clocks drift by minutes or hours. Log correlation becomes impossible. Kerberos auth fails. TLS certificates appear expired. Scheduled cron jobs fire at the wrong time.

Fix: Enforce NTP (chrony) configuration fleet-wide. Monitor clock skew in your observability stack. Alert on drift > 1 second.

Debug clue: If Kerberos auth suddenly fails on a subset of servers, check clock skew first. Kerberos has a default tolerance of 5 minutes. chronyc tracking shows current offset; chronyc sources shows which NTP servers are reachable. Clock skew is the silent killer of distributed auth.

10. Manual changes on "just one server"¶

Someone SSH'd in and tweaked a config to debug an issue. They forgot to revert it. The next fleet-wide config push skips that file because it uses creates: (idempotent check). That server now has a unique config. Six months later, it breaks in a way no other server does.

Fix: Treat manual changes as a bug. Run drift detection regularly. Use immutable infrastructure patterns where possible. If you must change one server, open a ticket and update the source config.