Fleet Operations at Scale - Street-Level Ops¶
Real-world patterns and gotchas from managing 1,500+ server fleets.
Quick Diagnosis Commands¶
# How many hosts are reachable right now?
ansible all -f 100 -m ping --one-line | grep -c SUCCESS
ansible all -f 100 -m ping --one-line | grep UNREACHABLE | awk -F'|' '{print $1}'
# Quick fleet-wide check
ansible webservers -f 50 -m shell -a 'uptime; free -m | grep Mem; df -h / | tail -1' --one-line
# Find hosts with high disk usage
ansible all -f 100 -m shell -a 'df -h / | tail -1' \
| grep -E '[89][0-9]%|100%'
# Find hosts not running a critical service
ansible all -f 100 -m shell -a 'systemctl is-active nginx || echo DEAD' \
| grep DEAD
# Check package version consistency across fleet
ansible webservers -f 50 -m shell -a 'rpm -q openssl' \
| sort | uniq -c | sort -rn
# Find hosts with pending reboots (kernel update)
ansible all -f 100 -m shell -a \
'if [ "$(rpm -q kernel --last | head -1 | awk "{print \$1}")" != "kernel-$(uname -r)" ]; then echo REBOOT_NEEDED; fi' \
| grep REBOOT_NEEDED
Scale note: At fleet scale, every serial operation is a multiplier. A 2-second SSH connect timeout across 1,500 hosts is 50 minutes of wall time at forks=1. Always think in terms of parallelism and batch size.
Gotcha: Ansible Fork Exhaustion¶
You set forks = 200 and target 1,500 hosts. The Ansible control node runs out of memory and the SSH connection table fills up. Half the tasks fail with cryptic connection errors.
Fix: Keep forks reasonable (50-100). Use --forks tuned to your control node's resources. For very large fleets, split into batches by group:
for group in dc1-web dc1-db dc2-web dc2-db; do
ansible "${group}" -f 50 -m command -a 'uptime' &
done
wait
Gotcha: SSH Known Hosts Explosion¶
You reprovision servers and their SSH host keys change. The next fleet-wide command fails on every reprovisioned host with Host key verification failed.
Fix: For ephemeral/reprovisioned hosts, use:
# ansible.cfg
[ssh_connection]
ssh_args = -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/dev/null
Gotcha:
StrictHostKeyChecking=noaccepts any key silently, including MITM keys.StrictHostKeyChecking=accept-newonly accepts keys for hosts not yet in known_hosts, and rejects changed keys. Always preferaccept-newoverno.
Gotcha: One Bad Host Blocks the Batch¶
An Ansible playbook with serial: 10 hits a host that hangs on SSH. The entire batch waits for the timeout. With 1,500 hosts, one flaky server per batch means hours of delays.
Fix: Set aggressive SSH timeouts:
[ssh_connection]
ssh_args = -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=3
timeout = 30
Gotcha: Inconsistent Execution Order¶
You run ansible all -m command -a 'reboot'. Ansible doesn't guarantee execution order. Load balancer backends, database replicas, and application servers reboot in random order. Cascading failures follow.
Fix: Use serial in playbooks. Define explicit groups and ordering:
- hosts: databases
serial: 1
# ... reboot and validate
- hosts: appservers
serial: "10%"
# ... reboot and validate
- hosts: loadbalancers
serial: 1
# ... reboot and validate
Pattern: Canary-Then-Fleet¶
#!/usr/bin/env bash
set -euo pipefail
CANARY="web-001"
FLEET_GROUP="webservers"
PLAYBOOK="patch-webservers.yml"
log() { echo "[$(date -u '+%H:%M:%S')] $*"; }
# Step 1: Canary
log "Running canary on ${CANARY}..."
ansible-playbook "${PLAYBOOK}" --limit "${CANARY}" -f 1
if ! ssh "${CANARY}" 'curl -sf http://localhost:8080/health' &>/dev/null; then
log "CANARY FAILED! Aborting fleet rollout."
exit 1
fi
log "Canary passed. Waiting 5 minutes for soak..."
sleep 300
# Step 2: Check canary metrics
error_rate=$(curl -sf "http://prometheus:9090/api/v1/query?query=rate(http_errors_total{host=\"${CANARY}\"}[5m])" \
| jq -r '.data.result[0].value[1] // "0"')
if (( $(echo "${error_rate} > 0.01" | bc -l) )); then
log "Canary error rate too high (${error_rate}). Aborting."
exit 1
fi
# Step 3: Fleet rollout
log "Canary healthy. Rolling out to fleet..."
ansible-playbook "${PLAYBOOK}" --limit "${FLEET_GROUP}" \
-e "serial_pct=10" -f 50
Pattern: Fact Collection and Reporting¶
# Collect fleet facts and build a CSV report
ansible all -f 100 -m setup -a 'filter=ansible_*' --tree /tmp/fleet-facts/
# Parse into a report
for f in /tmp/fleet-facts/*; do
host=$(basename "${f}")
jq -r '
.ansible_facts |
[.ansible_hostname,
.ansible_distribution + " " + .ansible_distribution_version,
.ansible_processor_vcpus,
(.ansible_memtotal_mb | tostring) + "MB",
.ansible_default_ipv4.address
] | @csv
' "${f}"
done > fleet-inventory-report.csv
Pattern: Safe Batch Script with Abort¶
#!/usr/bin/env bash
set -euo pipefail
BATCH_SIZE=20
MAX_FAILURES=5
PAUSE_BETWEEN_BATCHES=60
declare -a ALL_HOSTS=()
mapfile -t ALL_HOSTS < hosts.txt
total=${#ALL_HOSTS[@]}
failures=0
processed=0
for (( i=0; i<total; i+=BATCH_SIZE )); do
batch=("${ALL_HOSTS[@]:i:BATCH_SIZE}")
echo "=== Batch $((i/BATCH_SIZE + 1)): ${#batch[@]} hosts ==="
for host in "${batch[@]}"; do
if ! process_host "${host}"; then
(( failures++ ))
echo "FAILED: ${host} (${failures}/${MAX_FAILURES})"
if (( failures >= MAX_FAILURES )); then
echo "ABORT: Max failures reached. ${total} - ${processed} hosts remaining."
echo "Resume from host index $((i + ${#batch[@]}))"
exit 1
fi
fi
(( processed++ ))
done
if (( i + BATCH_SIZE < total )); then
echo "Batch complete. Pausing ${PAUSE_BETWEEN_BATCHES}s..."
sleep "${PAUSE_BETWEEN_BATCHES}"
fi
done
echo "Complete. ${processed}/${total} processed, ${failures} failures."
Remember: The three pillars of safe fleet operations: idempotent (run twice, same result), resumable (pick up where you left off after failure), observable (know which hosts succeeded, failed, or were skipped). Mnemonic: IRO — I Run Once (correctly, every time).
Pattern: Idempotent Fleet Scripts¶
# Instead of "do X", check "is X already done?"
apply_config() {
local host=$1
# Check current state first
local current_version
current_version=$(ssh "${host}" 'rpm -q myapp --qf "%{VERSION}"' 2>/dev/null)
if [[ "${current_version}" == "${TARGET_VERSION}" ]]; then
echo "${host}: already at ${TARGET_VERSION}, skipping"
return 0
fi
# Apply change
ssh "${host}" "yum install -y myapp-${TARGET_VERSION}"
}
Emergency: Fleet-Wide Incident¶
# 1. Assess scope — how many hosts are affected?
ansible all -f 100 -m shell -a 'systemctl is-active myapp' --one-line \
| grep -v SUCCESS | wc -l
# 2. Get logs from affected hosts
ansible all -f 100 -m shell -a 'journalctl -u myapp --since "1 hour ago" --no-pager | tail -5' \
--limit @/tmp/affected-hosts.txt --tree /tmp/incident-logs/
# 3. Apply hotfix to affected hosts only
ansible-playbook hotfix.yml --limit @/tmp/affected-hosts.txt -f 50
# 4. Validate fix
ansible all -f 100 -m uri -a 'url=http://localhost:8080/health' \
--limit @/tmp/affected-hosts.txt
Debug clue: If
ansible all -m pingshows a mix of SUCCESS and UNREACHABLE, pipe the output throughgrep UNREACHABLE | awk '{print $1}'to build a list of problem hosts. Cross-reference withlast rebootoutput on reachable hosts to spot patterns (e.g., all unreachable hosts are in the same rack or VLAN).