Skip to content

Portal | Level: L2: Operations | Topics: Fleet Operations, Ansible, Bash / Shell Scripting | Domain: DevOps & Tooling

Fleet Operations at Scale - Primer

Why This Matters

Managing 10 servers is system administration. Managing 1,500 is fleet operations. The techniques are fundamentally different. You can't SSH into 1,500 servers one at a time. You can't review 1,500 sets of logs by hand. Fleet ops is about patterns, automation, observability, and controlled blast radius — it's the discipline of making changes to hundreds or thousands of machines with confidence that nothing breaks.

Core Concepts

The Fleet Mindset

Individual servers are cattle, not pets. You don't name them after characters, you don't SSH in to tweak configs, and you don't have special procedures for specific machines. Every server in a role should be interchangeable. If it's not, you've got configuration drift — and drift is the enemy.

Pets Cattle
Unique, hand-configured Identical, automated
Failure is a crisis Failure is expected
Repaired when sick Replaced when sick
Named (db-master-01) Numbered (db-042)

Name origin: The "pets vs cattle" metaphor was coined by Bill Baker of Microsoft in a 2012 presentation. Gavin McCance at CERN later popularized it. The idea: pets are irreplaceable and get names, cattle are interchangeable and get numbers. Some teams now add a third category — "chickens" — for ephemeral, short-lived workloads like CI runners and serverless functions that exist for seconds.

Inventory Management

Your inventory is the source of truth for what exists, where it is, and what role it plays:

# Simple inventory structure
inventory/
  hosts.yaml          # All hosts with metadata
  groups/
    webservers.txt    # Host lists by role
    databases.txt
    switches.txt
  locations/
    dc1-rack01.txt    # Host lists by location
    dc1-rack02.txt
# hosts.yaml — structured inventory
hosts:
  web-001:
    ip: 10.0.1.101
    bmc: 10.0.99.101
    rack: dc1-r01
    role: webserver
    os: centos-8
    cpu: 32
    ram_gb: 128
    serial: ABC123
  web-002:
    ip: 10.0.1.102
    bmc: 10.0.99.102
    rack: dc1-r01
    role: webserver
    os: centos-8
    cpu: 32
    ram_gb: 128
    serial: DEF456

Dynamic Inventory

Static files don't scale. Generate inventory from your CMDB, cloud APIs, or Kubernetes:

#!/usr/bin/env bash
# Dynamic inventory from CMDB API
curl -sf https://cmdb.internal/api/hosts \
    -H "Authorization: Bearer ${CMDB_TOKEN}" \
    | jq -r '.[] | select(.status == "active") | .hostname' \
    | sort

Parallel Execution

The Problem with Serial Execution

# This takes 2.5 hours for 1,500 hosts at 6 seconds each
for host in "${HOSTS[@]}"; do
    ssh "${host}" 'uptime'
done

Gotcha: The serial loop for host in $HOSTS; do ssh $host; done is not just slow — it is also fragile. If one host hangs (e.g., SSH timeout on a dead machine), the entire loop stalls. Parallel tools like GNU Parallel, xargs -P, or Ansible forks handle timeouts per-host without blocking the rest.

Ansible Forks

# Run against 50 hosts simultaneously
ansible webservers -f 50 -m command -a 'uptime'

# Serial percentage for rolling updates
# In playbook:
# - hosts: webservers
#   serial: "10%"        # 10% of fleet at a time
#   max_fail_percentage: 5

GNU Parallel

# Run a command across all hosts, 20 at a time
cat hosts.txt | parallel -j 20 --tag 'ssh {} uptime'

# With timeout and results logging
cat hosts.txt | parallel -j 20 --tag --timeout 30 \
    --results /tmp/fleet-results/ \
    'ssh -o ConnectTimeout=5 {} "uptime; df -h / | tail -1"'

xargs for Simple Cases

# Ping sweep — 50 concurrent
cat hosts.txt | xargs -P 50 -I{} sh -c 'ssh -o ConnectTimeout=5 {} uptime 2>/dev/null && echo "{}: OK" || echo "{}: FAIL"'

Rolling Operations

Never change the entire fleet at once. Rolling operations limit blast radius:

Fleet: 1,500 servers
├── Canary batch: 1 server (0.07%)
│   └── Wait 30 min, validate
├── First batch: 15 servers (1%)
│   └── Wait 15 min, validate
├── Second batch: 150 servers (10%)
│   └── Wait 10 min, validate
└── Remaining: ~1,334 servers
    └── Batches of 150, 5 min gaps

Ansible Rolling Strategy

- hosts: webservers
  serial:
    - 1           # Canary
    - "5%"        # Small batch
    - "25%"       # Larger batches
  max_fail_percentage: 2  # Abort if >2% fail

  pre_tasks:
    - name: Pull from load balancer
      command: /opt/scripts/lb-drain.sh {{ inventory_hostname }}

  tasks:
    - name: Apply update
      yum:
        name: myapp
        state: latest
      notify: restart myapp

  post_tasks:
    - name: Validate health
      uri:
        url: "http://{{ inventory_hostname }}:8080/health"
        status_code: 200
      retries: 5
      delay: 10

    - name: Re-add to load balancer
      command: /opt/scripts/lb-add.sh {{ inventory_hostname }}

Fleet Observability

Aggregate Health Checks

#!/usr/bin/env bash
# Fleet health dashboard — run every 5 minutes via cron
set -euo pipefail

declare -A SUMMARY=([ok]=0 [warn]=0 [crit]=0 [unreachable]=0)

check_host() {
    local host=$1
    local status

    if ! ssh -o ConnectTimeout=5 "${host}" 'true' 2>/dev/null; then
        echo "${host}|unreachable"
        return
    fi

    # Collect metrics
    local output
    output=$(ssh -o ConnectTimeout=10 "${host}" '
        load=$(cat /proc/loadavg | cut -d" " -f1)
        mem_pct=$(free | awk "/Mem:/{printf \"%.0f\", \$3/\$2*100}")
        disk_pct=$(df / | tail -1 | awk "{print \$5}" | tr -d "%")
        echo "${load}|${mem_pct}|${disk_pct}"
    ')

    IFS='|' read -r load mem disk <<< "${output}"

    if (( disk > 90 )) || (( mem > 95 )); then
        echo "${host}|crit|load=${load},mem=${mem}%,disk=${disk}%"
    elif (( disk > 80 )) || (( mem > 85 )); then
        echo "${host}|warn|load=${load},mem=${mem}%,disk=${disk}%"
    else
        echo "${host}|ok|load=${load},mem=${mem}%,disk=${disk}%"
    fi
}

export -f check_host
cat hosts.txt | parallel -j 50 check_host {} | sort -t'|' -k2

Drift Detection

# Compare package versions across the fleet
ansible webservers -f 50 -m command -a 'rpm -q nginx' \
    | sort | uniq -c | sort -rn

# Compare config file checksums
ansible webservers -f 50 -m stat -a 'path=/etc/nginx/nginx.conf' \
    | grep checksum | sort | uniq -c

Change Management

Change Windows

Production fleet changes happen in maintenance windows:

Change Classification:
  Standard  → Pre-approved, low risk (e.g., security patches)
              Execute during business hours, rolling, no approval needed
  Normal    → Moderate risk (e.g., config changes, service upgrades)
              Schedule maintenance window, get approval, have rollback plan
  Emergency → Active incident or critical vulnerability
              Execute immediately, document after, post-mortem required

War story: In 2014, a Facebook engineer accidentally deployed a configuration change to every machine in production simultaneously — no canary, no rolling. The change broke the internal configuration management system itself, which meant the rollback mechanism was also broken. It took hours to manually restore. This incident led to widespread adoption of canary + progressive rollout patterns across the industry. The lesson: never deploy a change to the system that deploys changes without a separate rollback path.

Remember: Fleet rollout rule of thumb: 1-10-100. Start with 1 server (canary), then 10% of the fleet, then the remaining 100%. Wait and validate between each step. If any step fails, stop and investigate before proceeding.

Rollback Strategy

Every fleet change needs a rollback plan before execution:

# Before: snapshot the current state
ansible webservers -f 50 -m command -a 'rpm -qa' \
    --tree /tmp/fleet-snapshot/pre-change/

# After: if something breaks, restore
ansible webservers -f 50 -m yum -a 'name=nginx-1.24.0 state=present allow_downgrade=yes'

Communication Patterns

Phone-Home Architecture

Servers report status to a central collector rather than being polled:

  [Server 1] ──POST /status──→ ┌──────────────┐
  [Server 2] ──POST /status──→ │  Collector    │ → Dashboard
  [Server N] ──POST /status──→ │  (HTTP API)   │ → Alerts
                                └──────────────┘

Advantages over polling: - Scales better (servers push, collector just receives) - Works through NAT/firewalls - Missing reports = server is down (dead-man's switch)

Fleet Command Bus

For ad-hoc commands, use a pull-based pattern:

1. Operator posts command to message queue (tagged with role/group)
2. Agents on servers poll queue for commands matching their role
3. Agents execute and post results back
4. Operator views aggregated results

This is how Salt, MCollective, and Bolt work under the hood.


Wiki Navigation

Prerequisites