Thinking Out Loud: Ansible¶
A senior SRE's internal monologue while working through a real Ansible task. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
A security patch for OpenSSL needs to be rolled out to 150 production servers across 4 environments (dev/staging/prod-us/prod-eu) within 24 hours. The servers run a mix of Ubuntu 20.04 and 22.04. The patch requires a service restart to take effect, and some servers run latency-sensitive services that need coordinated restarts.
The Monologue¶
150 servers, 24-hour window, mixed OS versions, coordinated restarts. This is exactly what Ansible is for. Let me plan the rollout before touching anything.
First, let me verify the current state. Which servers need the patch?
ansible all -i inventory/production -m shell -a "dpkg -l openssl | grep ii" --become -o 2>/dev/null | head -20
Mixed versions. Some on 3.0.2-0ubuntu1.12, some on 3.0.2-0ubuntu1.10. The target is 3.0.2-0ubuntu1.15 (the security patch). Let me count how many actually need updating.
ansible all -i inventory/production -m shell -a "apt list --upgradable 2>/dev/null | grep openssl" --become -o 2>/dev/null | grep -c "openssl"
128 out of 150 need the update. 22 are already patched (probably the newest servers that got the latest image). Good — fewer to update.
Now, I need to think about the rollout order. Security says 24 hours, but I'm not going to blast all 128 servers at once. Rolling updates, environment by environment: dev first, then staging, then prod-us, then prod-eu.
Mental Model: Progressive Rollout for Config Changes¶
Treat infrastructure changes like code deploys: roll out progressively through environments. If the OpenSSL update breaks something (unlikely but possible — especially with apps that pin TLS versions), you want to catch it in dev, not discover it in production. The time cost of progressive rollout is small; the blast radius reduction is enormous.
Let me write the playbook. I want to handle: package update, service restart, and verification — with different restart strategies for different service types.
cat <<'YAML' > /tmp/openssl-patch.yml
---
- name: OpenSSL Security Patch Rollout
hosts: "{{ target_group }}"
become: true
serial: "{{ serial_count | default('25%') }}"
max_fail_percentage: 10
pre_tasks:
- name: Verify current openssl version
command: openssl version
register: pre_version
changed_when: false
- name: Check if update is available
apt:
name: openssl
state: latest
update_cache: yes
cache_valid_time: 3600
check_mode: yes
register: update_check
- name: Skip if already patched
meta: end_host
when: not update_check.changed
tasks:
- name: Update openssl and libssl
apt:
name:
- openssl
- "libssl3"
state: latest
register: update_result
- name: Restart affected services
include_tasks: restart-services.yml
when: update_result.changed
post_tasks:
- name: Verify new openssl version
command: openssl version
register: post_version
changed_when: false
- name: Health check
uri:
url: "http://localhost:{{ health_port | default(8080) }}/health"
return_content: yes
timeout: 10
register: health
retries: 3
delay: 5
until: health.status == 200
when: health_port is defined
- name: Report
debug:
msg: "{{ inventory_hostname }}: {{ pre_version.stdout }} -> {{ post_version.stdout }}"
YAML
Key decisions in this playbook:
- serial: 25% — process 25% of the target group at a time, not all at once
- max_fail_percentage: 10 — if more than 10% fail, stop the rollout entirely
- Check mode pre-check — skip hosts that are already patched
- Health check after restart — verify the server is healthy before moving on
Let me test on dev first.
Check mode looks good — 12 servers would be updated. Let me run it for real on dev.
Dev done. All 12 patched, all health checks passed. Let me verify one manually.
Mental Model: Ansible Serial + max_fail_percentage = Circuit Breaker¶
serialcontrols the batch size (how many hosts are updated at once).max_fail_percentageis the circuit breaker — if too many hosts fail, Ansible stops the entire play. Together, they prevent a bad update from rolling across the entire fleet. Always use both for production changes.serial: 1is the safest but slowest;serial: 25%is a good balance for large fleets.
Staging next.
Staging done. No failures. Now for production. I need to be more careful here. Let me use a smaller serial count and add the coordinated restart logic for load-balanced services.
For production, I need to: 1. Remove the server from the load balancer before restarting 2. Restart the affected services 3. Wait for health check to pass 4. Re-add to the load balancer
ansible-playbook /tmp/openssl-patch.yml -i inventory/production \
-e target_group=prod_us_servers \
-e serial_count=5 \
-e health_port=8080 \
-v
Running 5 at a time in prod-us. Watching the output...
Hmm — one host failed the health check after restart. The service came back but the health endpoint timed out on the first two retries, then succeeded on the third. Let me check if that's normal for this service (it might be a slow starter) or if the update caused an issue.
ansible prod_us_servers -i inventory/production -m shell -a "journalctl -u myservice --since '10 minutes ago' | grep -i error | head -5" --become --limit "web-prod-us-47"
No errors in the logs. The health check was just slow to respond after restart — the application takes 15 seconds to warm up. My retry config (3 retries, 5 seconds apart) was too tight. Let me adjust for the remaining servers.
ansible-playbook /tmp/openssl-patch.yml -i inventory/production \
-e target_group=prod_eu_servers \
-e serial_count=5 \
-e health_port=8080 \
-v
Prod-EU rolling through. All green. Let me verify the fleet-wide status.
All 128 previously-unpatched servers now running the target version. 22 were already patched. 150/150 on the correct version. Rollout complete.
Let me save this playbook properly and commit it. I always keep patch rollout playbooks — they're reusable templates for the next CVE.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
Run apt upgrade openssl on each server individually |
Use Ansible with a structured playbook for consistent, parallel, auditable rollout | Manual upgrades don't scale, can't be rolled back, and leave no audit trail |
| Blast the update to all 150 servers at once | Roll out progressively: dev -> staging -> prod-us -> prod-eu with serial batches | Progressive rollout catches issues before they affect the entire fleet |
| Not think about post-update health checks | Include health checks with retries as part of the playbook | An update that leaves the service unhealthy is worse than no update |
| Not set max_fail_percentage | Use serial + max_fail_percentage as a circuit breaker | If the update is bad, stop rolling it out to more servers |
Key Heuristics Used¶
- Progressive Rollout: Treat infrastructure patches like code deploys — dev, staging, prod in sequence. Each environment validates the change for the next.
- Serial + Circuit Breaker: Use
serialfor batch size andmax_fail_percentageto stop the rollout if too many hosts fail. - Post-Change Health Verification: Always verify health after applying changes. An update that breaks the service is a worse outcome than a vulnerable but running service (at least temporarily).
Cross-References¶
- Primer — Ansible playbook structure, inventory management, and module types
- Street Ops — Rollout patterns, ad-hoc commands for fleet inspection, and variable precedence
- Footguns — Running without serial/max_fail_percentage, not testing on dev/staging first, and ansible-playbook without --check