Fleet Operations at Scale - Street-Level Ops¶

Real-world patterns and gotchas from managing 1,500+ server fleets.

Quick Diagnosis Commands¶

# How many hosts are reachable right now?
ansible all -f 100 -m ping --one-line | grep -c SUCCESS
ansible all -f 100 -m ping --one-line | grep UNREACHABLE | awk -F'|' '{print $1}'

# Quick fleet-wide check
ansible webservers -f 50 -m shell -a 'uptime; free -m | grep Mem; df -h / | tail -1' --one-line

# Find hosts with high disk usage
ansible all -f 100 -m shell -a 'df -h / | tail -1' \
    | grep -E '[89][0-9]%|100%'

# Find hosts not running a critical service
ansible all -f 100 -m shell -a 'systemctl is-active nginx || echo DEAD' \
    | grep DEAD

# Check package version consistency across fleet
ansible webservers -f 50 -m shell -a 'rpm -q openssl' \
    | sort | uniq -c | sort -rn

# Find hosts with pending reboots (kernel update)
ansible all -f 100 -m shell -a \
    'if [ "$(rpm -q kernel --last | head -1 | awk "{print \$1}")" != "kernel-$(uname -r)" ]; then echo REBOOT_NEEDED; fi' \
    | grep REBOOT_NEEDED

Scale note: At fleet scale, every serial operation is a multiplier. A 2-second SSH connect timeout across 1,500 hosts is 50 minutes of wall time at forks=1. Always think in terms of parallelism and batch size.

Gotcha: Ansible Fork Exhaustion¶

You set forks = 200 and target 1,500 hosts. The Ansible control node runs out of memory and the SSH connection table fills up. Half the tasks fail with cryptic connection errors.

Fix: Keep forks reasonable (50-100). Use --forks tuned to your control node's resources. For very large fleets, split into batches by group:

for group in dc1-web dc1-db dc2-web dc2-db; do
    ansible "${group}" -f 50 -m command -a 'uptime' &
done
wait

Gotcha: SSH Known Hosts Explosion¶

You reprovision servers and their SSH host keys change. The next fleet-wide command fails on every reprovisioned host with Host key verification failed.

Fix: For ephemeral/reprovisioned hosts, use:

# ansible.cfg
[ssh_connection]
ssh_args = -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/dev/null

For production, manage known hosts properly via config management.

Gotcha: StrictHostKeyChecking=no accepts any key silently, including MITM keys. StrictHostKeyChecking=accept-new only accepts keys for hosts not yet in known_hosts, and rejects changed keys. Always prefer accept-new over no.

Gotcha: One Bad Host Blocks the Batch¶

An Ansible playbook with serial: 10 hits a host that hangs on SSH. The entire batch waits for the timeout. With 1,500 hosts, one flaky server per batch means hours of delays.

Fix: Set aggressive SSH timeouts:

[ssh_connection]
ssh_args = -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=3
timeout = 30

Gotcha: Inconsistent Execution Order¶

You run ansible all -m command -a 'reboot'. Ansible doesn't guarantee execution order. Load balancer backends, database replicas, and application servers reboot in random order. Cascading failures follow.

Fix: Use serial in playbooks. Define explicit groups and ordering:

- hosts: databases
  serial: 1
  # ... reboot and validate

- hosts: appservers
  serial: "10%"
  # ... reboot and validate

- hosts: loadbalancers
  serial: 1
  # ... reboot and validate

Pattern: Canary-Then-Fleet¶

#!/usr/bin/env bash
set -euo pipefail

CANARY="web-001"
FLEET_GROUP="webservers"
PLAYBOOK="patch-webservers.yml"

log() { echo "[$(date -u '+%H:%M:%S')] $*"; }

# Step 1: Canary
log "Running canary on ${CANARY}..."
ansible-playbook "${PLAYBOOK}" --limit "${CANARY}" -f 1
if ! ssh "${CANARY}" 'curl -sf http://localhost:8080/health' &>/dev/null; then
    log "CANARY FAILED! Aborting fleet rollout."
    exit 1
fi
log "Canary passed. Waiting 5 minutes for soak..."
sleep 300

# Step 2: Check canary metrics
error_rate=$(curl -sf "http://prometheus:9090/api/v1/query?query=rate(http_errors_total{host=\"${CANARY}\"}[5m])" \
    | jq -r '.data.result[0].value[1] // "0"')
if (( $(echo "${error_rate} > 0.01" | bc -l) )); then
    log "Canary error rate too high (${error_rate}). Aborting."
    exit 1
fi

# Step 3: Fleet rollout
log "Canary healthy. Rolling out to fleet..."
ansible-playbook "${PLAYBOOK}" --limit "${FLEET_GROUP}" \
    -e "serial_pct=10" -f 50

Pattern: Fact Collection and Reporting¶

# Collect fleet facts and build a CSV report
ansible all -f 100 -m setup -a 'filter=ansible_*' --tree /tmp/fleet-facts/

# Parse into a report
for f in /tmp/fleet-facts/*; do
    host=$(basename "${f}")
    jq -r '
        .ansible_facts |
        [.ansible_hostname,
         .ansible_distribution + " " + .ansible_distribution_version,
         .ansible_processor_vcpus,
         (.ansible_memtotal_mb | tostring) + "MB",
         .ansible_default_ipv4.address
        ] | @csv
    ' "${f}"
done > fleet-inventory-report.csv

Pattern: Safe Batch Script with Abort¶

#!/usr/bin/env bash
set -euo pipefail

BATCH_SIZE=20
MAX_FAILURES=5
PAUSE_BETWEEN_BATCHES=60

declare -a ALL_HOSTS=()
mapfile -t ALL_HOSTS < hosts.txt

total=${#ALL_HOSTS[@]}
failures=0
processed=0

for (( i=0; i<total; i+=BATCH_SIZE )); do
    batch=("${ALL_HOSTS[@]:i:BATCH_SIZE}")
    echo "=== Batch $((i/BATCH_SIZE + 1)): ${#batch[@]} hosts ==="

    for host in "${batch[@]}"; do
        if ! process_host "${host}"; then
            (( failures++ ))
            echo "FAILED: ${host} (${failures}/${MAX_FAILURES})"

            if (( failures >= MAX_FAILURES )); then
                echo "ABORT: Max failures reached. ${total} - ${processed} hosts remaining."
                echo "Resume from host index $((i + ${#batch[@]}))"
                exit 1
            fi
        fi
        (( processed++ ))
    done

    if (( i + BATCH_SIZE < total )); then
        echo "Batch complete. Pausing ${PAUSE_BETWEEN_BATCHES}s..."
        sleep "${PAUSE_BETWEEN_BATCHES}"
    fi
done

echo "Complete. ${processed}/${total} processed, ${failures} failures."

Remember: The three pillars of safe fleet operations: idempotent (run twice, same result), resumable (pick up where you left off after failure), observable (know which hosts succeeded, failed, or were skipped). Mnemonic: IRO — I Run Once (correctly, every time).

Pattern: Idempotent Fleet Scripts¶

# Instead of "do X", check "is X already done?"
apply_config() {
    local host=$1

    # Check current state first
    local current_version
    current_version=$(ssh "${host}" 'rpm -q myapp --qf "%{VERSION}"' 2>/dev/null)

    if [[ "${current_version}" == "${TARGET_VERSION}" ]]; then
        echo "${host}: already at ${TARGET_VERSION}, skipping"
        return 0
    fi

    # Apply change
    ssh "${host}" "yum install -y myapp-${TARGET_VERSION}"
}

Emergency: Fleet-Wide Incident¶

# 1. Assess scope — how many hosts are affected?
ansible all -f 100 -m shell -a 'systemctl is-active myapp' --one-line \
    | grep -v SUCCESS | wc -l

# 2. Get logs from affected hosts
ansible all -f 100 -m shell -a 'journalctl -u myapp --since "1 hour ago" --no-pager | tail -5' \
    --limit @/tmp/affected-hosts.txt --tree /tmp/incident-logs/

# 3. Apply hotfix to affected hosts only
ansible-playbook hotfix.yml --limit @/tmp/affected-hosts.txt -f 50

# 4. Validate fix
ansible all -f 100 -m uri -a 'url=http://localhost:8080/health' \
    --limit @/tmp/affected-hosts.txt

Debug clue: If ansible all -m ping shows a mix of SUCCESS and UNREACHABLE, pipe the output through grep UNREACHABLE | awk '{print $1}' to build a list of problem hosts. Cross-reference with last reboot output on reachable hosts to spot patterns (e.g., all unreachable hosts are in the same rack or VLAN).