Fleet Ops¶
16 cards — 🟢 3 easy | 🟡 4 medium | 🔴 3 hard
🟢 Easy (3)¶
1. What is the "cattle, not pets" principle in fleet operations?
Show answer
Servers should be identical, automated, and interchangeable (cattle). They are replaced when sick, not repaired. This contrasts with pet servers that are unique, hand-configured, and treated as irreplaceable — which does not scale.2. Why is serial execution impractical for fleet operations?
Show answer
Running a command serially on 1,500 hosts at 6 seconds each takes 2.5 hours. Parallel execution tools like Ansible forks, GNU parallel, or xargs -P run commands on many hosts simultaneously, reducing the time dramatically.3. What is the purpose of a fleet inventory and why do static files not scale?
Show answer
An inventory is the source of truth for what exists, where it is, and what role it plays. Static files do not scale because they become stale; instead, generate inventory dynamically from a CMDB, cloud APIs, or Kubernetes.🟡 Medium (4)¶
1. Describe the recommended rolling operation batch progression for a fleet of 1,500 servers.
Show answer
Start with a canary batch of 1 server, wait 30 minutes and validate. Then 15 servers (1%), wait 15 minutes. Then 150 servers (10%), wait 10 minutes. Then remaining servers in batches of 150 with 5-minute gaps. This limits blast radius while still completing in reasonable time.2. How does Ansible implement rolling updates with automatic abort on failures?
Show answer
Use the serial directive with escalating batch sizes (e.g., 1, then 5%, then 25%) and max_fail_percentage (e.g.,2) to abort if more than 2% of hosts fail. Combine with pre_tasks to drain from load balancer and post_tasks to validate health and re-add.
3. How do you detect configuration drift across a fleet?
Show answer
Compare package versions across the fleet (e.g., ansible webservers -m command -a 'rpm -q nginx' | sort | uniq -c) or compare config file checksums (e.g., ansible -m stat -a 'path=/etc/nginx/nginx.conf' | grep checksum | sort | uniq -c). Variations indicate drift.4. What is a phone-home architecture and what are its advantages?
Show answer
Servers push status reports to a central collector via HTTP POST rather than being polled. Advantages: scales better (servers push, collector receives), works through NAT/firewalls, and missing reports serve as a dead-man's switch indicating the server is down.🔴 Hard (3)¶
1. Describe a fleet-wide aggregate health check script pattern.
Show answer
Use GNU parallel to SSH into all hosts concurrently (e.g., 50 at a time). For each host, collect load average, memory percent, and disk percent. Classify hosts as ok, warn (disk > 80% or mem > 85%), crit (disk > 90% or mem > 95%), or unreachable. Aggregate results into a summary dashboard and run via cron every 5 minutes.2. How does a fleet command bus (pull-based) pattern work?
Show answer
An operator posts a command to a message queue tagged with a role or group. Agents on servers poll the queue for commands matching their role. Agents execute the command and post results back to the queue. The operator views aggregated results. This is the pattern used by Salt, MCollective, and Bolt.3. What should a fleet change rollback strategy include?