Postmortem: Ansible Playbook Targets Production Instead of Staging¶

Field	Value
ID	PM-016
Date	2025-04-08
Severity	SEV-3
Duration	15m (detection to resolution)
Time to Detect	10m
Time to Mitigate	5m
Customer Impact	None — NTP misconfiguration was caught and reversed before clock drift affected any customer-facing service
Revenue Impact	None
Teams Involved	Platform Engineering, SRE On-Call, Infra Automation
Postmortem Author	Deirdre Calloway
Postmortem Date	2025-04-10

Executive Summary¶

On 2025-04-08 at 14:22 UTC, an engineer on the Platform Engineering team ran an NTP configuration playbook against production hosts instead of staging, passing -i inventory/production.yml when -i inventory/staging.yml was intended. The playbook reconfigured 47 production hosts to point their NTP synchronization at a staging NTP server (ntp-staging-01.internal) that is less reliable and has no HA redundancy. The SRE on-call team's production NTP drift monitoring fired within 10 minutes, and the correct NTP configuration was restored from the production inventory within 5 minutes of detection. No clock drift of consequence occurred, and no customer-facing services were affected.

Timeline (All times UTC)¶

Time	Event
14:15	Engineer Tomasz Wiśniewski opens a terminal to run `configure-ntp.yml` against staging to validate a change to NTP server pool selection
14:19	Tomasz clones the playbook invocation from a Slack snippet shared by a colleague last week; the snippet references `inventory/production.yml`
14:22	Playbook executes: `ansible-playbook -i inventory/production.yml configure-ntp.yml` — no confirmation prompt is shown; playbook begins targeting all 47 hosts in the `[ntp_clients]` production group
14:23	Playbook completes successfully; `chrony.conf` on 47 production hosts now references `ntp-staging-01.internal` instead of the production NTP pool
14:24	Hosts begin syncing time from `ntp-staging-01.internal`; this server has higher jitter and occasional unavailability during staging load tests
14:32	PagerDuty alert fires: "Production NTP drift > 50ms on 12 hosts" — Alertmanager rule `NTPClientDriftHigh` triggers; SRE on-call Priya Nambiar acknowledges
14:33	Priya SSHes to `app-prod-07.internal`, runs `chronyc sources -v`; immediately sees `ntp-staging-01.internal` as the selected source — this is anomalous
14:34	Priya checks the Ansible automation log in Slack `#infra-changes`; sees Tomasz's playbook run at 14:22 against `production.yml`
14:35	Priya pages Tomasz directly; Tomasz confirms the wrong inventory was used
14:36	Tomasz re-runs playbook with correct inventory: `ansible-playbook -i inventory/staging.yml configure-ntp.yml` — wait, corrects to `inventory/production.yml` with the production NTP config already in place
14:37	Playbook re-run with production NTP pool config completes; all 47 hosts now referencing `ntp-pool-prod.internal`
14:38	`chronyc sources -v` on sample hosts confirms correct NTP source; drift subsides
14:40	NTP drift alert resolves; Priya marks incident mitigated
14:41	Tomasz posts incident summary in `#incidents`; SEV-3 declared retroactively
14:45	Postmortem scheduled for 2025-04-10

Impact¶

Customer Impact¶

None. NTP misconfiguration at this scale requires sustained clock drift (typically > 5 minutes of desync) before applications begin experiencing timeout mismatches, certificate validation failures, or distributed transaction ordering errors. The 15-minute window of incorrect NTP configuration, combined with the fact that ntp-staging-01.internal was still reachable and providing roughly correct time (just with higher jitter), meant that no host accumulated meaningful drift before the configuration was corrected.

Internal Impact¶

Tomasz Wiśniewski: approximately 30 minutes of unplanned work (the playbook run, the correction, and initial incident documentation)
Priya Nambiar (SRE on-call): approximately 45 minutes of on-call investigation and coordination
1 postmortem meeting scheduled (1 hour, 4 attendees)
Delayed Tomasz's planned staging validation by approximately 2 hours while the incident was processed

Data Impact¶

None. NTP configuration is stateless and was fully restored. No data was written to or read from incorrect locations as a result of the clock misconfiguration.

Root Cause¶

What Happened (Technical)¶

The immediate cause was a copy-paste error: Tomasz copied an Ansible playbook invocation from a Slack message without verifying the -i flag value. The Slack snippet had been shared in the context of a production runbook demonstration the prior week and explicitly referenced inventory/production.yml. When Tomasz used it as a template for a staging-targeting run, he did not update the inventory path.

The configure-ntp.yml playbook itself is correct. It reads the NTP server pool from a variable (ntp_servers) that is defined in the inventory group vars. inventory/production.yml defines ntp_servers to include the production pool entries; inventory/staging.yml defines a different set including ntp-staging-01.internal. Because the wrong inventory was passed, the playbook applied staging NTP configuration to all hosts in the production [ntp_clients] group.

There is no confirmation step in the playbook invocation workflow. Ansible does not prompt "you are targeting 47 production hosts — continue?" by default, and the team has not added a custom pre-task confirmation prompt for production-tagged inventories. The playbook ran to completion in approximately 60 seconds with no human checkpoint.

The inventory/production.yml and inventory/staging.yml files share identical top-level structure and group names. The only meaningful difference visible at a glance is the hostname suffix (.internal vs. .staging.internal) and the NTP variable values buried in the group vars. An engineer scanning the inventory filename quickly could easily miss which environment they are targeting.

Contributing Factors¶

Similar inventory filenames with no visual guard: inventory/production.yml and inventory/staging.yml differ only by environment name. There is no color coding, no prefix like PROD_ vs. STG_, and no file permission restriction on the production inventory file.
No dry-run (--check) step in the team's standard workflow: The Platform Engineering runbook for applying configuration playbooks does not mandate a --check pass before a live run. Had --check been standard practice, Tomasz would have seen the plan applied to production hosts and likely caught the error.
No confirmation prompt for production-tagged runs: Ansible supports pre-task prompts (vars_prompt) and pause tasks conditioned on inventory metadata. Neither is implemented in this playbook. A simple when: "'production' in group_names" guard with a pause task requiring manual "yes" input would have stopped the run before any changes were applied.

What We Got Lucky About¶

NTP misconfiguration is one of the slowest-acting production mistakes available. Meaningful clock drift that causes application-level failures typically takes 10–60 minutes depending on the application's tolerance. The 10-minute detection window was fast enough to intervene before any drift-sensitive code path was exercised.
The NTP drift monitoring alert was already in place and tuned conservatively (50ms threshold). If this had been a playbook that modified firewall rules, security group settings, or service discovery configurations, the blast radius would have been immediate and far larger before any alert fired.
The staging NTP server was reachable from production hosts. If ntp-staging-01.internal had been unreachable from the production network segment, hosts would have fallen back to their hardware clock immediately and could have drifted faster, potentially triggering more severe alerts sooner — or not alerting at all if fallback behavior masked the anomaly.

Detection¶

How We Detected¶

The SRE team's production NTP monitoring rule (NTPClientDriftHigh) fires when any production host exceeds 50ms of offset from its configured NTP source, sustained for 2 minutes. ntp-staging-01.internal has higher jitter than the production pool (it is co-located with load-test infrastructure that creates network noise), so within 10 minutes of hosts switching to it, 12 hosts crossed the drift threshold. PagerDuty paged Priya at 14:32 UTC.

Why We Didn't Detect Sooner¶

The alert threshold (50ms sustained for 2 minutes) is appropriate for detecting real NTP problems but is not sensitive enough to catch the moment of reconfiguration. Between 14:23 and 14:32, the hosts were synchronizing to a degraded but functional NTP source. An alert on NTP source hostname change (i.e., "production hosts are now syncing from a non-production server") would have fired immediately at 14:23. No such alert exists.

Response¶

What Went Well¶

The NTP drift alert was properly tuned and paged the correct on-call engineer without delay.
Priya's first investigative step (checking the NTP source with chronyc sources -v) directly identified the root cause within 1 minute of acknowledging the alert.
The #infra-changes Slack channel had an automatic log of the Ansible playbook invocation, including the exact command and inventory path, making root cause identification immediate.
The remediation (re-running the playbook with the correct configuration) was a single command that took under 2 minutes to execute and verify.

What Went Poorly¶

There was no production-run confirmation gate in the playbook workflow. A 30-second friction step would have prevented the incident entirely.
The team's runbook for playbook execution does not require a --check pass. This is a known gap that has been discussed but not acted on.
The Slack snippet sharing culture means engineers copy invocations that may reference the wrong environment without noticing; there is no Slack snippet hygiene or annotation convention.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
AI-016-01	Add pre-task confirmation prompt to all playbooks when `inventory_hostname` is in a group tagged `env: production`	High	Tomasz Wiśniewski	Open	2025-04-18
AI-016-02	Update Platform Engineering runbook: mandate `--check` dry-run before any production playbook execution	High	Deirdre Calloway	Open	2025-04-15
AI-016-03	Add NTP source hostname alert: fire if any production host NTP source does not match the approved production pool pattern	Medium	Priya Nambiar	Open	2025-04-25
AI-016-04	Rename inventory files to include explicit env prefix: `PROD-inventory.yml` / `STG-inventory.yml` and update all docs/scripts	Medium	Deirdre Calloway	Open	2025-04-22
AI-016-05	Add file-permission restriction (read-only for non-privileged users) on `inventory/production.yml` with a wrapper script for authorized use	Low	Tomasz Wiśniewski	Open	2025-04-30

Lessons Learned¶

Confirmation gates are cheap insurance: A single pause task or vars_prompt requiring a human "yes" before touching production hosts costs 10 seconds and prevents the entire class of wrong-inventory mistakes. Frictionless automation is a virtue in development environments; production automation should have deliberate friction at the point of no return.
Alert on configuration identity, not just configuration outcomes: Alerting on NTP drift detected the symptom. Alerting on "production hosts are syncing from a non-production NTP source" would have detected the cause at the moment it happened. Consider adding source-identity assertions to monitoring wherever the identity of a dependency matters as much as its behavior.
Shared command snippets are living hazards: A Slack message with a working command becomes an authoritative-looking template the moment it is shared. Teams that rely on copy-paste invocations should maintain a canonical runbook location (not Slack) for operational commands, with environment parameters explicitly documented and validated before use.

Cross-References¶

Failure Pattern: Human error / wrong-target execution; environment confusion
Topic Packs: Ansible best practices, inventory management, production change controls
Runbook: runbooks/ansible/playbook-execution-checklist.md
Decision Tree: Triage → NTP anomaly → check chronyc sources → verify against approved NTP pool list → check recent Ansible runs in #infra-changes