Monitoring Migration Footguns¶

Mistakes that leave you blind during a migration or create a monitoring system nobody trusts.

1. Big-bang cutover from Nagios to Prometheus¶

You spend a month building Prometheus rules, declare it "ready," and shut down Nagios on a Friday. Over the weekend, three incidents happen. Two are missed because the Prometheus rules had threshold errors. Nobody knows how to use Grafana yet. The team re-enables Nagios on Monday.

Fix: Parallel run both systems for at least 4-6 weeks. Compare alert fidelity daily. Only cut over alerting when the team is confident and trained. Keep Nagios in read-only mode for another month after cutover.

2. Translating every Nagios check 1:1¶

You have 350 Nagios checks. You translate all 350 into Prometheus alerting rules. Half of them were noisy in Nagios and are noisy in Prometheus. The team immediately mutes everything, which is worse than having no monitoring.

Fix: Audit every check before translating. Ask: "Did anyone act on this alert in the last 6 months?" If not, do not migrate it. Use the migration as a cleanup opportunity. Start with 50 critical alerts, not 350.

3. Forgetting to migrate notification routing¶

You migrate all the checks but not the escalation policies. In Nagios, disk-full alerts went to the ops team and database alerts went to the DBA team. In Alertmanager, everything goes to a single Slack channel. The DBA team misses their alerts because they are buried in ops noise.

Fix: Map Nagios contact groups and notification rules to Alertmanager routes and receivers before cutover. Test routing with amtool to verify alerts reach the right teams.

4. No SNMP migration plan for network devices¶

Nagios monitored 40 switches and routers via SNMP. You migrate servers to Prometheus but forget network devices. A core switch hits 95% CPU and nobody notices for 3 hours because SNMP checks died with Nagios.

Fix: Deploy snmp_exporter before decommissioning Nagios SNMP checks. Generate custom snmp.yml from your device MIBs. Verify every network device is scraped and key metrics (interface traffic, CPU, memory, errors) are being collected.

5. Prometheus scrape interval mismatched to check expectations¶

Nagios checked critical services every 30 seconds. You set Prometheus scrape interval to 60 seconds. During an incident, the team sees data points 60 seconds apart in Grafana and complains that "the new system is less precise." They lose confidence.

Fix: Set scrape intervals to match or exceed the old check frequency for critical services. 15 seconds is a reasonable default. For high-frequency needs, use 5-10 seconds. Make sure the team understands the new cadence.

6. Deleting Nagios historical data immediately¶

You shut down Nagios and delete the server. Two months later, someone asks for CPU trend data from 6 months ago for capacity planning. That data only existed in Nagios RRD files. It is gone.

Fix: Keep the old Nagios/Zabbix server running in read-only mode for 6-12 months after migration. Export key historical data to CSV before decommissioning. Communicate the data retention gap to stakeholders.

7. Not testing Alertmanager routing before cutover¶

You write Alertmanager config with routes and receivers. You deploy it. You do not test it. The first real alert triggers a route match that sends to a webhook URL that returns 404 because the Slack webhook was rotated last month.

Fix: Test every receiver before cutover. Use amtool alert add to inject test alerts. Verify they arrive in the correct channel/email/PagerDuty. Retest after any Alertmanager config change.

8. Assuming the team will learn PromQL on their own¶

You deploy Prometheus and Grafana. You send a link to the PromQL documentation. Nobody reads it. The on-call engineer gets paged, opens Grafana, and has no idea how to query for what they need. They SSH into the server and check manually — exactly what the old system let them avoid.

Fix: Schedule structured training: 4 sessions over 2 weeks. Cover basic PromQL, common queries, Grafana navigation, and alert investigation. Create cheat sheets. Build runbook dashboards that answer the most common on-call questions without requiring PromQL.

9. Custom NRPE plugins with no documentation¶

You have 30 custom NRPE plugins written by someone who left 2 years ago. Each is a Bash or Perl script with no comments, no README, and cryptic variable names. You have no idea what they check or what the thresholds mean. You skip migrating them. One of them was the only check for a critical payment processing service.

Fix: Audit every custom plugin before the migration. Run each one manually and document what it checks. Identify which are critical. Write Prometheus equivalents (custom exporters, textfile collectors, or blackbox probes) for the critical ones.

10. Running parallel systems but not actually comparing them¶

You run Nagios and Prometheus side by side for 6 weeks. You never systematically compare which alerts fire in which system. You declare success because "nothing major happened." Then you cut over and discover 15 checks that were only in Nagios.

Fix: Run a daily comparison script. Log every alert from both systems. Identify discrepancies. The parallel run is only valuable if you actively compare. Assign someone to review the comparison report daily.