Skip to content

Postmortem: Ansible Playbook Targets Production Instead of Staging

Field Value
ID PM-016
Date 2025-04-08
Severity SEV-3
Duration 15m (detection to resolution)
Time to Detect 10m
Time to Mitigate 5m
Customer Impact None — NTP misconfiguration was caught and reversed before clock drift affected any customer-facing service
Revenue Impact None
Teams Involved Platform Engineering, SRE On-Call, Infra Automation
Postmortem Author Deirdre Calloway
Postmortem Date 2025-04-10

Executive Summary

On 2025-04-08 at 14:22 UTC, an engineer on the Platform Engineering team ran an NTP configuration playbook against production hosts instead of staging, passing -i inventory/production.yml when -i inventory/staging.yml was intended. The playbook reconfigured 47 production hosts to point their NTP synchronization at a staging NTP server (ntp-staging-01.internal) that is less reliable and has no HA redundancy. The SRE on-call team's production NTP drift monitoring fired within 10 minutes, and the correct NTP configuration was restored from the production inventory within 5 minutes of detection. No clock drift of consequence occurred, and no customer-facing services were affected.

Timeline (All times UTC)

Time Event
14:15 Engineer Tomasz Wiśniewski opens a terminal to run configure-ntp.yml against staging to validate a change to NTP server pool selection
14:19 Tomasz clones the playbook invocation from a Slack snippet shared by a colleague last week; the snippet references inventory/production.yml
14:22 Playbook executes: ansible-playbook -i inventory/production.yml configure-ntp.yml — no confirmation prompt is shown; playbook begins targeting all 47 hosts in the [ntp_clients] production group
14:23 Playbook completes successfully; chrony.conf on 47 production hosts now references ntp-staging-01.internal instead of the production NTP pool
14:24 Hosts begin syncing time from ntp-staging-01.internal; this server has higher jitter and occasional unavailability during staging load tests
14:32 PagerDuty alert fires: "Production NTP drift > 50ms on 12 hosts" — Alertmanager rule NTPClientDriftHigh triggers; SRE on-call Priya Nambiar acknowledges
14:33 Priya SSHes to app-prod-07.internal, runs chronyc sources -v; immediately sees ntp-staging-01.internal as the selected source — this is anomalous
14:34 Priya checks the Ansible automation log in Slack #infra-changes; sees Tomasz's playbook run at 14:22 against production.yml
14:35 Priya pages Tomasz directly; Tomasz confirms the wrong inventory was used
14:36 Tomasz re-runs playbook with correct inventory: ansible-playbook -i inventory/staging.yml configure-ntp.yml — wait, corrects to inventory/production.yml with the production NTP config already in place
14:37 Playbook re-run with production NTP pool config completes; all 47 hosts now referencing ntp-pool-prod.internal
14:38 chronyc sources -v on sample hosts confirms correct NTP source; drift subsides
14:40 NTP drift alert resolves; Priya marks incident mitigated
14:41 Tomasz posts incident summary in #incidents; SEV-3 declared retroactively
14:45 Postmortem scheduled for 2025-04-10

Impact

Customer Impact

None. NTP misconfiguration at this scale requires sustained clock drift (typically > 5 minutes of desync) before applications begin experiencing timeout mismatches, certificate validation failures, or distributed transaction ordering errors. The 15-minute window of incorrect NTP configuration, combined with the fact that ntp-staging-01.internal was still reachable and providing roughly correct time (just with higher jitter), meant that no host accumulated meaningful drift before the configuration was corrected.

Internal Impact

  • Tomasz Wiśniewski: approximately 30 minutes of unplanned work (the playbook run, the correction, and initial incident documentation)
  • Priya Nambiar (SRE on-call): approximately 45 minutes of on-call investigation and coordination
  • 1 postmortem meeting scheduled (1 hour, 4 attendees)
  • Delayed Tomasz's planned staging validation by approximately 2 hours while the incident was processed

Data Impact

None. NTP configuration is stateless and was fully restored. No data was written to or read from incorrect locations as a result of the clock misconfiguration.

Root Cause

What Happened (Technical)

The immediate cause was a copy-paste error: Tomasz copied an Ansible playbook invocation from a Slack message without verifying the -i flag value. The Slack snippet had been shared in the context of a production runbook demonstration the prior week and explicitly referenced inventory/production.yml. When Tomasz used it as a template for a staging-targeting run, he did not update the inventory path.

The configure-ntp.yml playbook itself is correct. It reads the NTP server pool from a variable (ntp_servers) that is defined in the inventory group vars. inventory/production.yml defines ntp_servers to include the production pool entries; inventory/staging.yml defines a different set including ntp-staging-01.internal. Because the wrong inventory was passed, the playbook applied staging NTP configuration to all hosts in the production [ntp_clients] group.

There is no confirmation step in the playbook invocation workflow. Ansible does not prompt "you are targeting 47 production hosts — continue?" by default, and the team has not added a custom pre-task confirmation prompt for production-tagged inventories. The playbook ran to completion in approximately 60 seconds with no human checkpoint.

The inventory/production.yml and inventory/staging.yml files share identical top-level structure and group names. The only meaningful difference visible at a glance is the hostname suffix (.internal vs. .staging.internal) and the NTP variable values buried in the group vars. An engineer scanning the inventory filename quickly could easily miss which environment they are targeting.

Contributing Factors

  1. Similar inventory filenames with no visual guard: inventory/production.yml and inventory/staging.yml differ only by environment name. There is no color coding, no prefix like PROD_ vs. STG_, and no file permission restriction on the production inventory file.

  2. No dry-run (--check) step in the team's standard workflow: The Platform Engineering runbook for applying configuration playbooks does not mandate a --check pass before a live run. Had --check been standard practice, Tomasz would have seen the plan applied to production hosts and likely caught the error.

  3. No confirmation prompt for production-tagged runs: Ansible supports pre-task prompts (vars_prompt) and pause tasks conditioned on inventory metadata. Neither is implemented in this playbook. A simple when: "'production' in group_names" guard with a pause task requiring manual "yes" input would have stopped the run before any changes were applied.

What We Got Lucky About

  1. NTP misconfiguration is one of the slowest-acting production mistakes available. Meaningful clock drift that causes application-level failures typically takes 10–60 minutes depending on the application's tolerance. The 10-minute detection window was fast enough to intervene before any drift-sensitive code path was exercised.

  2. The NTP drift monitoring alert was already in place and tuned conservatively (50ms threshold). If this had been a playbook that modified firewall rules, security group settings, or service discovery configurations, the blast radius would have been immediate and far larger before any alert fired.

  3. The staging NTP server was reachable from production hosts. If ntp-staging-01.internal had been unreachable from the production network segment, hosts would have fallen back to their hardware clock immediately and could have drifted faster, potentially triggering more severe alerts sooner — or not alerting at all if fallback behavior masked the anomaly.

Detection

How We Detected

The SRE team's production NTP monitoring rule (NTPClientDriftHigh) fires when any production host exceeds 50ms of offset from its configured NTP source, sustained for 2 minutes. ntp-staging-01.internal has higher jitter than the production pool (it is co-located with load-test infrastructure that creates network noise), so within 10 minutes of hosts switching to it, 12 hosts crossed the drift threshold. PagerDuty paged Priya at 14:32 UTC.

Why We Didn't Detect Sooner

The alert threshold (50ms sustained for 2 minutes) is appropriate for detecting real NTP problems but is not sensitive enough to catch the moment of reconfiguration. Between 14:23 and 14:32, the hosts were synchronizing to a degraded but functional NTP source. An alert on NTP source hostname change (i.e., "production hosts are now syncing from a non-production server") would have fired immediately at 14:23. No such alert exists.

Response

What Went Well

  1. The NTP drift alert was properly tuned and paged the correct on-call engineer without delay.
  2. Priya's first investigative step (checking the NTP source with chronyc sources -v) directly identified the root cause within 1 minute of acknowledging the alert.
  3. The #infra-changes Slack channel had an automatic log of the Ansible playbook invocation, including the exact command and inventory path, making root cause identification immediate.
  4. The remediation (re-running the playbook with the correct configuration) was a single command that took under 2 minutes to execute and verify.

What Went Poorly

  1. There was no production-run confirmation gate in the playbook workflow. A 30-second friction step would have prevented the incident entirely.
  2. The team's runbook for playbook execution does not require a --check pass. This is a known gap that has been discussed but not acted on.
  3. The Slack snippet sharing culture means engineers copy invocations that may reference the wrong environment without noticing; there is no Slack snippet hygiene or annotation convention.

Action Items

ID Action Priority Owner Status Due Date
AI-016-01 Add pre-task confirmation prompt to all playbooks when inventory_hostname is in a group tagged env: production High Tomasz Wiśniewski Open 2025-04-18
AI-016-02 Update Platform Engineering runbook: mandate --check dry-run before any production playbook execution High Deirdre Calloway Open 2025-04-15
AI-016-03 Add NTP source hostname alert: fire if any production host NTP source does not match the approved production pool pattern Medium Priya Nambiar Open 2025-04-25
AI-016-04 Rename inventory files to include explicit env prefix: PROD-inventory.yml / STG-inventory.yml and update all docs/scripts Medium Deirdre Calloway Open 2025-04-22
AI-016-05 Add file-permission restriction (read-only for non-privileged users) on inventory/production.yml with a wrapper script for authorized use Low Tomasz Wiśniewski Open 2025-04-30

Lessons Learned

  1. Confirmation gates are cheap insurance: A single pause task or vars_prompt requiring a human "yes" before touching production hosts costs 10 seconds and prevents the entire class of wrong-inventory mistakes. Frictionless automation is a virtue in development environments; production automation should have deliberate friction at the point of no return.

  2. Alert on configuration identity, not just configuration outcomes: Alerting on NTP drift detected the symptom. Alerting on "production hosts are syncing from a non-production NTP source" would have detected the cause at the moment it happened. Consider adding source-identity assertions to monitoring wherever the identity of a dependency matters as much as its behavior.

  3. Shared command snippets are living hazards: A Slack message with a working command becomes an authoritative-looking template the moment it is shared. Teams that rely on copy-paste invocations should maintain a canonical runbook location (not Slack) for operational commands, with environment parameters explicitly documented and validated before use.

Cross-References

  • Failure Pattern: Human error / wrong-target execution; environment confusion
  • Topic Packs: Ansible best practices, inventory management, production change controls
  • Runbook: runbooks/ansible/playbook-execution-checklist.md
  • Decision Tree: Triage → NTP anomaly → check chronyc sources → verify against approved NTP pool list → check recent Ansible runs in #infra-changes