Skip to content

Thinking Out Loud: Ansible

A senior SRE's internal monologue while working through a real Ansible task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

A security patch for OpenSSL needs to be rolled out to 150 production servers across 4 environments (dev/staging/prod-us/prod-eu) within 24 hours. The servers run a mix of Ubuntu 20.04 and 22.04. The patch requires a service restart to take effect, and some servers run latency-sensitive services that need coordinated restarts.

The Monologue

150 servers, 24-hour window, mixed OS versions, coordinated restarts. This is exactly what Ansible is for. Let me plan the rollout before touching anything.

First, let me verify the current state. Which servers need the patch?

ansible all -i inventory/production -m shell -a "dpkg -l openssl | grep ii" --become -o 2>/dev/null | head -20

Mixed versions. Some on 3.0.2-0ubuntu1.12, some on 3.0.2-0ubuntu1.10. The target is 3.0.2-0ubuntu1.15 (the security patch). Let me count how many actually need updating.

ansible all -i inventory/production -m shell -a "apt list --upgradable 2>/dev/null | grep openssl" --become -o 2>/dev/null | grep -c "openssl"

128 out of 150 need the update. 22 are already patched (probably the newest servers that got the latest image). Good — fewer to update.

Now, I need to think about the rollout order. Security says 24 hours, but I'm not going to blast all 128 servers at once. Rolling updates, environment by environment: dev first, then staging, then prod-us, then prod-eu.

Mental Model: Progressive Rollout for Config Changes

Treat infrastructure changes like code deploys: roll out progressively through environments. If the OpenSSL update breaks something (unlikely but possible — especially with apps that pin TLS versions), you want to catch it in dev, not discover it in production. The time cost of progressive rollout is small; the blast radius reduction is enormous.

Let me write the playbook. I want to handle: package update, service restart, and verification — with different restart strategies for different service types.

cat <<'YAML' > /tmp/openssl-patch.yml
---
- name: OpenSSL Security Patch Rollout
  hosts: "{{ target_group }}"
  become: true
  serial: "{{ serial_count | default('25%') }}"
  max_fail_percentage: 10

  pre_tasks:
    - name: Verify current openssl version
      command: openssl version
      register: pre_version
      changed_when: false

    - name: Check if update is available
      apt:
        name: openssl
        state: latest
        update_cache: yes
        cache_valid_time: 3600
      check_mode: yes
      register: update_check

    - name: Skip if already patched
      meta: end_host
      when: not update_check.changed

  tasks:
    - name: Update openssl and libssl
      apt:
        name:
          - openssl
          - "libssl3"
        state: latest
      register: update_result

    - name: Restart affected services
      include_tasks: restart-services.yml
      when: update_result.changed

  post_tasks:
    - name: Verify new openssl version
      command: openssl version
      register: post_version
      changed_when: false

    - name: Health check
      uri:
        url: "http://localhost:{{ health_port | default(8080) }}/health"
        return_content: yes
        timeout: 10
      register: health
      retries: 3
      delay: 5
      until: health.status == 200
      when: health_port is defined

    - name: Report
      debug:
        msg: "{{ inventory_hostname }}: {{ pre_version.stdout }} -> {{ post_version.stdout }}"
YAML

Key decisions in this playbook: - serial: 25% — process 25% of the target group at a time, not all at once - max_fail_percentage: 10 — if more than 10% fail, stop the rollout entirely - Check mode pre-check — skip hosts that are already patched - Health check after restart — verify the server is healthy before moving on

Let me test on dev first.

ansible-playbook /tmp/openssl-patch.yml -i inventory/dev -e target_group=dev_servers --check --diff

Check mode looks good — 12 servers would be updated. Let me run it for real on dev.

ansible-playbook /tmp/openssl-patch.yml -i inventory/dev -e target_group=dev_servers -v

Dev done. All 12 patched, all health checks passed. Let me verify one manually.

ansible dev_servers[0] -i inventory/dev -m shell -a "openssl version" --become

Mental Model: Ansible Serial + max_fail_percentage = Circuit Breaker

serial controls the batch size (how many hosts are updated at once). max_fail_percentage is the circuit breaker — if too many hosts fail, Ansible stops the entire play. Together, they prevent a bad update from rolling across the entire fleet. Always use both for production changes. serial: 1 is the safest but slowest; serial: 25% is a good balance for large fleets.

Staging next.

ansible-playbook /tmp/openssl-patch.yml -i inventory/staging -e target_group=staging_servers -v

Staging done. No failures. Now for production. I need to be more careful here. Let me use a smaller serial count and add the coordinated restart logic for load-balanced services.

For production, I need to: 1. Remove the server from the load balancer before restarting 2. Restart the affected services 3. Wait for health check to pass 4. Re-add to the load balancer

ansible-playbook /tmp/openssl-patch.yml -i inventory/production \
  -e target_group=prod_us_servers \
  -e serial_count=5 \
  -e health_port=8080 \
  -v

Running 5 at a time in prod-us. Watching the output...

Hmm — one host failed the health check after restart. The service came back but the health endpoint timed out on the first two retries, then succeeded on the third. Let me check if that's normal for this service (it might be a slow starter) or if the update caused an issue.

ansible prod_us_servers -i inventory/production -m shell -a "journalctl -u myservice --since '10 minutes ago' | grep -i error | head -5" --become --limit "web-prod-us-47"

No errors in the logs. The health check was just slow to respond after restart — the application takes 15 seconds to warm up. My retry config (3 retries, 5 seconds apart) was too tight. Let me adjust for the remaining servers.

ansible-playbook /tmp/openssl-patch.yml -i inventory/production \
  -e target_group=prod_eu_servers \
  -e serial_count=5 \
  -e health_port=8080 \
  -v

Prod-EU rolling through. All green. Let me verify the fleet-wide status.

ansible all -i inventory/production -m shell -a "openssl version" --become -o | sort | uniq -c

All 128 previously-unpatched servers now running the target version. 22 were already patched. 150/150 on the correct version. Rollout complete.

Let me save this playbook properly and commit it. I always keep patch rollout playbooks — they're reusable templates for the next CVE.

What Made This Senior-Level

Junior Would... Senior Does... Why
Run apt upgrade openssl on each server individually Use Ansible with a structured playbook for consistent, parallel, auditable rollout Manual upgrades don't scale, can't be rolled back, and leave no audit trail
Blast the update to all 150 servers at once Roll out progressively: dev -> staging -> prod-us -> prod-eu with serial batches Progressive rollout catches issues before they affect the entire fleet
Not think about post-update health checks Include health checks with retries as part of the playbook An update that leaves the service unhealthy is worse than no update
Not set max_fail_percentage Use serial + max_fail_percentage as a circuit breaker If the update is bad, stop rolling it out to more servers

Key Heuristics Used

  1. Progressive Rollout: Treat infrastructure patches like code deploys — dev, staging, prod in sequence. Each environment validates the change for the next.
  2. Serial + Circuit Breaker: Use serial for batch size and max_fail_percentage to stop the rollout if too many hosts fail.
  3. Post-Change Health Verification: Always verify health after applying changes. An update that breaks the service is a worse outcome than a vulnerable but running service (at least temporarily).

Cross-References

  • Primer — Ansible playbook structure, inventory management, and module types
  • Street Ops — Rollout patterns, ad-hoc commands for fleet inspection, and variable precedence
  • Footguns — Running without serial/max_fail_percentage, not testing on dev/staging first, and ansible-playbook without --check