Portal | Level: L2: Operations | Topics: Monitoring Migration, Monitoring Fundamentals, Prometheus, Grafana | Domain: Observability
Monitoring Migration (Legacy to Modern) - Primer¶
Why This Matters¶
If you have been in infrastructure long enough, you have inherited a Nagios or Zabbix installation that someone set up in 2014 and nobody fully understands anymore. The checks work — mostly — but adding new ones requires tribal knowledge, the dashboards are from a different era, and the alerting is a maze of email rules and escalation chains. Moving to Prometheus and Grafana is not just a technology upgrade. It is a chance to rethink what you monitor, how you alert, and how your team interacts with operational data.
I have done this migration three times at different organizations, each with hundreds of Nagios checks and years of accumulated configuration. The technology swap is the easy part. The hard part is the parallel-run period, the team training, the check-by-check equivalence mapping, and the political challenge of convincing people to trust the new system before you turn off the old one.
Core Concepts¶
1. Why Migrate?¶
Legacy (Nagios/Zabbix) Modern (Prometheus/Grafana)
───────────────────── ───────────────────────────
Check-based (pass/fail) Metric-based (continuous values)
Push or active polling Pull model (scrape)
Plugin scripts per check Exporters expose all metrics
Static thresholds Dynamic queries (PromQL)
Host-centric Service-centric, label-based
Config files / web UI Config-as-code (YAML)
Limited dimensional data Multi-dimensional labels
Custom dashboards are painful Grafana dashboards are powerful
Scaling requires proxies/agents Federation and remote write
2. Migration Phases¶
Phase 1: Assessment (2-4 weeks)
├── Inventory all existing checks
├── Classify by type (host, service, app, network)
├── Identify check owners
├── Map checks to Prometheus equivalents
└── Document current alerting rules and escalations
Phase 2: Foundation (2-4 weeks)
├── Deploy Prometheus + Grafana
├── Deploy node_exporter on all hosts
├── Deploy blackbox_exporter for endpoint checks
├── Set up basic recording rules
└── Create initial dashboards
Phase 3: Parallel Run (4-8 weeks)
├── Both systems monitoring simultaneously
├── Compare alert fidelity (same incidents detected?)
├── Tune Prometheus thresholds to match reality
├── Train team on PromQL and Grafana
└── Build confidence in new system
Phase 4: Cutover (1-2 weeks)
├── Route alerting through Prometheus/Alertmanager
├── Disable legacy alerting (keep monitoring passive)
├── Monitor for gaps (incidents legacy caught but new missed)
└── Team on-call uses new system exclusively
Phase 5: Decommission (2-4 weeks)
├── Remove legacy agents (NRPE, Zabbix agent)
├── Archive legacy configuration
├── Shut down legacy server
└── Update documentation and runbooks
3. Mapping Nagios Checks to Prometheus¶
| Nagios Check | Prometheus Equivalent |
|---|---|
| check_disk | node_filesystem_avail_bytes / node_filesystem_size_bytes via node_exporter |
| check_load | node_load1, node_load5, node_load15 via node_exporter |
| check_mem | node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes via node_exporter |
| check_procs | node_procs_running, node_procs_blocked via node_exporter |
| check_swap | node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes |
| check_http | probe_success, probe_http_status_code via blackbox_exporter |
| check_tcp | probe_success with TCP module via blackbox_exporter |
| check_ping | probe_success with ICMP module via blackbox_exporter |
| check_ntp | node_ntp_offset_seconds via node_exporter |
| check_mysql | mysql_up, mysql_global_status_* via mysqld_exporter |
| check_postgres | pg_up, pg_stat_* via postgres_exporter |
| Custom NRPE plugin | Custom exporter or pushgateway for batch jobs |
4. Nagios Check to PromQL Translation¶
# Nagios: check_disk -w 20% -c 10% -p /
# Translation:
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "{{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free"
# Nagios: check_load -w 4,3,2 -c 8,6,4
# Translation:
- alert: HighLoad
expr: node_load5 / count without(cpu) (node_cpu_seconds_total{mode="idle"}) > 0.75
for: 10m
labels:
severity: warning
# Nagios: check_http -H app.example.com -u /health -t 10
# Translation (via blackbox_exporter):
- alert: EndpointDown
expr: probe_success{job="blackbox-http"} == 0
for: 2m
labels:
severity: critical
5. Exporter Deployment¶
Prometheus
(scrapes every 15-60s)
│
┌────────────┼────────────┐
│ │ │
node_exporter blackbox app metrics
(port 9100) (port 9115) (port 8080/metrics)
│ │ │
OS metrics HTTP/TCP Application
CPU/mem/disk endpoint custom metrics
network/io probes
# Prometheus scrape config for node_exporter
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- 'server1:9100'
- 'server2:9100'
- 'server3:9100'
# Or use service discovery:
# ec2_sd_configs:
# - region: us-east-1
# port: 9100
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://app.example.com/health
- https://api.example.com/status
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
6. Zabbix to Prometheus Mapping¶
| Zabbix Concept | Prometheus Equivalent |
|---|---|
| Zabbix Agent | node_exporter + app exporters |
| Zabbix Proxy | Prometheus federation or remote write |
| Item (data collection) | Scrape target metric |
| Trigger | Alerting rule (PromQL expression) |
| Template | Recording rules + dashboards |
| Host group | Labels (job, environment, team) |
| Action (alerting) | Alertmanager routes and receivers |
| Screen/Dashboard | Grafana dashboard |
| Discovery rule | Service discovery (EC2, Consul, K8s) |
7. Data Migration Considerations¶
What to migrate:
✓ Alert rules (translated to PromQL)
✓ Dashboard layouts (recreated in Grafana)
✓ Contact/escalation policies (Alertmanager routes)
✓ Downtime/maintenance windows (Alertmanager silences)
What NOT to migrate:
✗ Historical metric data (different data model)
✗ Legacy check scripts (replace with exporters)
✗ Agent configurations (replaced by exporters)
Historical data strategy:
Option A: Accept the data break. New system, new data.
Option B: Keep legacy system read-only for N months for historical queries.
Option C: Export key metrics to CSV for compliance/reporting.
8. Team Training Plan¶
Week 1: Foundations
- Prometheus data model (metrics, labels, time series)
- Basic PromQL (rate, sum, avg, histogram_quantile)
- Grafana navigation and basic dashboard creation
Week 2: Operations
- Writing alerting rules
- Alertmanager routing and silences
- Debugging missing metrics and scrape failures
Week 3: Advanced
- Recording rules for performance
- Service discovery configuration
- Federation and high availability
Week 4: On-Call Practice
- Respond to alerts using new tools
- Build runbook dashboards
- Practice common triage workflows in Grafana
Common Pitfalls¶
- Big-bang cutover — Turning off Nagios and turning on Prometheus on the same day. You will miss things. Always parallel-run for at least 4 weeks.
- 1:1 check translation — Copying every Nagios check as a Prometheus alert. Many Nagios checks are redundant or obsolete. The migration is a chance to clean up.
- Ignoring the team — The new system is only useful if the team knows how to use it. Budget training time. People will revert to the old system if the new one is confusing.
- No baseline comparison — Not comparing alert fidelity during parallel run. If Nagios catches an incident that Prometheus misses, you have a gap.
- Migrating custom NRPE plugins as-is — Some Nagios plugins are shell scripts that check very specific things. These need to become either custom exporters or pushgateway jobs, not ported directly.
- Forgetting network device monitoring — Nagios and Zabbix handle SNMP well. Prometheus requires snmp_exporter, which needs MIB configuration. Do not leave network devices unmonitored during the gap.
- Underestimating Prometheus storage — Prometheus with 1,000 targets and 500 metrics each generates millions of time series. Plan storage and retention from day one.
Wiki Navigation¶
Prerequisites¶
- Monitoring Fundamentals (Topic Pack, L1)
- Observability Deep Dive (Topic Pack, L2)
Related Content¶
- Monitoring Fundamentals (Topic Pack, L1) — Grafana, Monitoring Fundamentals, Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Grafana, Prometheus
- Observability Architecture (Reference, L2) — Grafana, Prometheus
- Observability Deep Dive (Topic Pack, L2) — Grafana, Prometheus
- Skillcheck: Observability (Assessment, L2) — Grafana, Prometheus
- Track: Observability (Reference, L2) — Grafana, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
- Alerting Rules Drills (Drill, L2) — Prometheus
- Capacity Planning (Topic Pack, L2) — Prometheus