Portal | Level: L2: Operations | Topics: Fleet Operations, Ansible, Bash / Shell Scripting | Domain: DevOps & Tooling
Fleet Operations at Scale - Primer¶
Why This Matters¶
Managing 10 servers is system administration. Managing 1,500 is fleet operations. The techniques are fundamentally different. You can't SSH into 1,500 servers one at a time. You can't review 1,500 sets of logs by hand. Fleet ops is about patterns, automation, observability, and controlled blast radius — it's the discipline of making changes to hundreds or thousands of machines with confidence that nothing breaks.
Core Concepts¶
The Fleet Mindset¶
Individual servers are cattle, not pets. You don't name them after characters, you don't SSH in to tweak configs, and you don't have special procedures for specific machines. Every server in a role should be interchangeable. If it's not, you've got configuration drift — and drift is the enemy.
| Pets | Cattle |
|---|---|
| Unique, hand-configured | Identical, automated |
| Failure is a crisis | Failure is expected |
| Repaired when sick | Replaced when sick |
| Named (db-master-01) | Numbered (db-042) |
Name origin: The "pets vs cattle" metaphor was coined by Bill Baker of Microsoft in a 2012 presentation. Gavin McCance at CERN later popularized it. The idea: pets are irreplaceable and get names, cattle are interchangeable and get numbers. Some teams now add a third category — "chickens" — for ephemeral, short-lived workloads like CI runners and serverless functions that exist for seconds.
Inventory Management¶
Your inventory is the source of truth for what exists, where it is, and what role it plays:
# Simple inventory structure
inventory/
hosts.yaml # All hosts with metadata
groups/
webservers.txt # Host lists by role
databases.txt
switches.txt
locations/
dc1-rack01.txt # Host lists by location
dc1-rack02.txt
# hosts.yaml — structured inventory
hosts:
web-001:
ip: 10.0.1.101
bmc: 10.0.99.101
rack: dc1-r01
role: webserver
os: centos-8
cpu: 32
ram_gb: 128
serial: ABC123
web-002:
ip: 10.0.1.102
bmc: 10.0.99.102
rack: dc1-r01
role: webserver
os: centos-8
cpu: 32
ram_gb: 128
serial: DEF456
Dynamic Inventory¶
Static files don't scale. Generate inventory from your CMDB, cloud APIs, or Kubernetes:
#!/usr/bin/env bash
# Dynamic inventory from CMDB API
curl -sf https://cmdb.internal/api/hosts \
-H "Authorization: Bearer ${CMDB_TOKEN}" \
| jq -r '.[] | select(.status == "active") | .hostname' \
| sort
Parallel Execution¶
The Problem with Serial Execution¶
# This takes 2.5 hours for 1,500 hosts at 6 seconds each
for host in "${HOSTS[@]}"; do
ssh "${host}" 'uptime'
done
Gotcha: The serial loop
for host in $HOSTS; do ssh $host; doneis not just slow — it is also fragile. If one host hangs (e.g., SSH timeout on a dead machine), the entire loop stalls. Parallel tools like GNU Parallel, xargs -P, or Ansible forks handle timeouts per-host without blocking the rest.
Ansible Forks¶
# Run against 50 hosts simultaneously
ansible webservers -f 50 -m command -a 'uptime'
# Serial percentage for rolling updates
# In playbook:
# - hosts: webservers
# serial: "10%" # 10% of fleet at a time
# max_fail_percentage: 5
GNU Parallel¶
# Run a command across all hosts, 20 at a time
cat hosts.txt | parallel -j 20 --tag 'ssh {} uptime'
# With timeout and results logging
cat hosts.txt | parallel -j 20 --tag --timeout 30 \
--results /tmp/fleet-results/ \
'ssh -o ConnectTimeout=5 {} "uptime; df -h / | tail -1"'
xargs for Simple Cases¶
# Ping sweep — 50 concurrent
cat hosts.txt | xargs -P 50 -I{} sh -c 'ssh -o ConnectTimeout=5 {} uptime 2>/dev/null && echo "{}: OK" || echo "{}: FAIL"'
Rolling Operations¶
Never change the entire fleet at once. Rolling operations limit blast radius:
Fleet: 1,500 servers
├── Canary batch: 1 server (0.07%)
│ └── Wait 30 min, validate
├── First batch: 15 servers (1%)
│ └── Wait 15 min, validate
├── Second batch: 150 servers (10%)
│ └── Wait 10 min, validate
└── Remaining: ~1,334 servers
└── Batches of 150, 5 min gaps
Ansible Rolling Strategy¶
- hosts: webservers
serial:
- 1 # Canary
- "5%" # Small batch
- "25%" # Larger batches
max_fail_percentage: 2 # Abort if >2% fail
pre_tasks:
- name: Pull from load balancer
command: /opt/scripts/lb-drain.sh {{ inventory_hostname }}
tasks:
- name: Apply update
yum:
name: myapp
state: latest
notify: restart myapp
post_tasks:
- name: Validate health
uri:
url: "http://{{ inventory_hostname }}:8080/health"
status_code: 200
retries: 5
delay: 10
- name: Re-add to load balancer
command: /opt/scripts/lb-add.sh {{ inventory_hostname }}
Fleet Observability¶
Aggregate Health Checks¶
#!/usr/bin/env bash
# Fleet health dashboard — run every 5 minutes via cron
set -euo pipefail
declare -A SUMMARY=([ok]=0 [warn]=0 [crit]=0 [unreachable]=0)
check_host() {
local host=$1
local status
if ! ssh -o ConnectTimeout=5 "${host}" 'true' 2>/dev/null; then
echo "${host}|unreachable"
return
fi
# Collect metrics
local output
output=$(ssh -o ConnectTimeout=10 "${host}" '
load=$(cat /proc/loadavg | cut -d" " -f1)
mem_pct=$(free | awk "/Mem:/{printf \"%.0f\", \$3/\$2*100}")
disk_pct=$(df / | tail -1 | awk "{print \$5}" | tr -d "%")
echo "${load}|${mem_pct}|${disk_pct}"
')
IFS='|' read -r load mem disk <<< "${output}"
if (( disk > 90 )) || (( mem > 95 )); then
echo "${host}|crit|load=${load},mem=${mem}%,disk=${disk}%"
elif (( disk > 80 )) || (( mem > 85 )); then
echo "${host}|warn|load=${load},mem=${mem}%,disk=${disk}%"
else
echo "${host}|ok|load=${load},mem=${mem}%,disk=${disk}%"
fi
}
export -f check_host
cat hosts.txt | parallel -j 50 check_host {} | sort -t'|' -k2
Drift Detection¶
# Compare package versions across the fleet
ansible webservers -f 50 -m command -a 'rpm -q nginx' \
| sort | uniq -c | sort -rn
# Compare config file checksums
ansible webservers -f 50 -m stat -a 'path=/etc/nginx/nginx.conf' \
| grep checksum | sort | uniq -c
Change Management¶
Change Windows¶
Production fleet changes happen in maintenance windows:
Change Classification:
Standard → Pre-approved, low risk (e.g., security patches)
Execute during business hours, rolling, no approval needed
Normal → Moderate risk (e.g., config changes, service upgrades)
Schedule maintenance window, get approval, have rollback plan
Emergency → Active incident or critical vulnerability
Execute immediately, document after, post-mortem required
War story: In 2014, a Facebook engineer accidentally deployed a configuration change to every machine in production simultaneously — no canary, no rolling. The change broke the internal configuration management system itself, which meant the rollback mechanism was also broken. It took hours to manually restore. This incident led to widespread adoption of canary + progressive rollout patterns across the industry. The lesson: never deploy a change to the system that deploys changes without a separate rollback path.
Remember: Fleet rollout rule of thumb: 1-10-100. Start with 1 server (canary), then 10% of the fleet, then the remaining 100%. Wait and validate between each step. If any step fails, stop and investigate before proceeding.
Rollback Strategy¶
Every fleet change needs a rollback plan before execution:
# Before: snapshot the current state
ansible webservers -f 50 -m command -a 'rpm -qa' \
--tree /tmp/fleet-snapshot/pre-change/
# After: if something breaks, restore
ansible webservers -f 50 -m yum -a 'name=nginx-1.24.0 state=present allow_downgrade=yes'
Communication Patterns¶
Phone-Home Architecture¶
Servers report status to a central collector rather than being polled:
[Server 1] ──POST /status──→ ┌──────────────┐
[Server 2] ──POST /status──→ │ Collector │ → Dashboard
[Server N] ──POST /status──→ │ (HTTP API) │ → Alerts
└──────────────┘
Advantages over polling: - Scales better (servers push, collector just receives) - Works through NAT/firewalls - Missing reports = server is down (dead-man's switch)
Fleet Command Bus¶
For ad-hoc commands, use a pull-based pattern:
1. Operator posts command to message queue (tagged with role/group)
2. Agents on servers poll queue for commands matching their role
3. Agents execute and post results back
4. Operator views aggregated results
This is how Salt, MCollective, and Bolt work under the hood.
Wiki Navigation¶
Prerequisites¶
- Ansible Automation (Topic Pack, L1)
- Linux Ops (Topic Pack, L0)
Related Content¶
- RHCE (EX294) Exam Preparation (Topic Pack, L2) — Ansible, Bash / Shell Scripting
- Advanced Bash for Ops (Topic Pack, L1) — Bash / Shell Scripting
- Ansible Automation (Topic Pack, L1) — Ansible
- Ansible Core Flashcards (CLI) (flashcard_deck, L1) — Ansible
- Ansible Deep Dive (Topic Pack, L2) — Ansible
- Ansible Drills (Drill, L1) — Ansible
- Ansible Exercises (Quest Ladder) (CLI) (Exercise Set, L1) — Ansible
- Ansible Lab: Conditionals and Loops (Lab, L1) — Ansible
- Ansible Lab: Facts and Variables (Lab, L0) — Ansible
- Ansible Lab: Install Nginx (Idempotency) (Lab, L1) — Ansible
Pages that link here¶
- Advanced Bash for Ops
- Ansible Automation
- Ansible Deep Dive
- Ansible Drills
- Anti-Primer: Fleet Ops
- Cron & Job Scheduling
- Fleet Operations at Scale
- Master Curriculum: 40 Weeks
- RHCE (EX294) Exam Preparation
- Symptoms: Ansible Playbook Hangs, SSH Agent Forwarding Broken, Root Cause Is Firewall Rule
- Symptoms: Node NotReady, NIC Firmware Bug, Fix Is Ansible Playbook