Skip to content

Portal | Level: L2: Operations | Topics: Monitoring Migration, Monitoring Fundamentals, Prometheus, Grafana | Domain: Observability

Monitoring Migration (Legacy to Modern) - Primer

Why This Matters

If you have been in infrastructure long enough, you have inherited a Nagios or Zabbix installation that someone set up in 2014 and nobody fully understands anymore. The checks work — mostly — but adding new ones requires tribal knowledge, the dashboards are from a different era, and the alerting is a maze of email rules and escalation chains. Moving to Prometheus and Grafana is not just a technology upgrade. It is a chance to rethink what you monitor, how you alert, and how your team interacts with operational data.

I have done this migration three times at different organizations, each with hundreds of Nagios checks and years of accumulated configuration. The technology swap is the easy part. The hard part is the parallel-run period, the team training, the check-by-check equivalence mapping, and the political challenge of convincing people to trust the new system before you turn off the old one.

Core Concepts

1. Why Migrate?

Legacy (Nagios/Zabbix)              Modern (Prometheus/Grafana)
─────────────────────               ───────────────────────────
Check-based (pass/fail)             Metric-based (continuous values)
Push or active polling              Pull model (scrape)
Plugin scripts per check            Exporters expose all metrics
Static thresholds                   Dynamic queries (PromQL)
Host-centric                        Service-centric, label-based
Config files / web UI               Config-as-code (YAML)
Limited dimensional data            Multi-dimensional labels
Custom dashboards are painful       Grafana dashboards are powerful
Scaling requires proxies/agents     Federation and remote write

2. Migration Phases

Phase 1: Assessment (2-4 weeks)
├── Inventory all existing checks
├── Classify by type (host, service, app, network)
├── Identify check owners
├── Map checks to Prometheus equivalents
└── Document current alerting rules and escalations

Phase 2: Foundation (2-4 weeks)
├── Deploy Prometheus + Grafana
├── Deploy node_exporter on all hosts
├── Deploy blackbox_exporter for endpoint checks
├── Set up basic recording rules
└── Create initial dashboards

Phase 3: Parallel Run (4-8 weeks)
├── Both systems monitoring simultaneously
├── Compare alert fidelity (same incidents detected?)
├── Tune Prometheus thresholds to match reality
├── Train team on PromQL and Grafana
└── Build confidence in new system

Phase 4: Cutover (1-2 weeks)
├── Route alerting through Prometheus/Alertmanager
├── Disable legacy alerting (keep monitoring passive)
├── Monitor for gaps (incidents legacy caught but new missed)
└── Team on-call uses new system exclusively

Phase 5: Decommission (2-4 weeks)
├── Remove legacy agents (NRPE, Zabbix agent)
├── Archive legacy configuration
├── Shut down legacy server
└── Update documentation and runbooks

3. Mapping Nagios Checks to Prometheus

Nagios Check Prometheus Equivalent
check_disk node_filesystem_avail_bytes / node_filesystem_size_bytes via node_exporter
check_load node_load1, node_load5, node_load15 via node_exporter
check_mem node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes via node_exporter
check_procs node_procs_running, node_procs_blocked via node_exporter
check_swap node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes
check_http probe_success, probe_http_status_code via blackbox_exporter
check_tcp probe_success with TCP module via blackbox_exporter
check_ping probe_success with ICMP module via blackbox_exporter
check_ntp node_ntp_offset_seconds via node_exporter
check_mysql mysql_up, mysql_global_status_* via mysqld_exporter
check_postgres pg_up, pg_stat_* via postgres_exporter
Custom NRPE plugin Custom exporter or pushgateway for batch jobs

4. Nagios Check to PromQL Translation

# Nagios: check_disk -w 20% -c 10% -p /
# Translation:
- alert: DiskSpaceLow
  expr: |
    (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Disk space low on {{ $labels.instance }}"
    description: "{{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free"

# Nagios: check_load -w 4,3,2 -c 8,6,4
# Translation:
- alert: HighLoad
  expr: node_load5 / count without(cpu) (node_cpu_seconds_total{mode="idle"}) > 0.75
  for: 10m
  labels:
    severity: warning

# Nagios: check_http -H app.example.com -u /health -t 10
# Translation (via blackbox_exporter):
- alert: EndpointDown
  expr: probe_success{job="blackbox-http"} == 0
  for: 2m
  labels:
    severity: critical

5. Exporter Deployment

                  Prometheus
                  (scrapes every 15-60s)
         ┌────────────┼────────────┐
         │            │            │
    node_exporter  blackbox     app metrics
    (port 9100)   (port 9115)  (port 8080/metrics)
         │            │            │
    OS metrics    HTTP/TCP      Application
    CPU/mem/disk  endpoint      custom metrics
    network/io    probes
# Prometheus scrape config for node_exporter
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'server1:9100'
          - 'server2:9100'
          - 'server3:9100'
    # Or use service discovery:
    # ec2_sd_configs:
    #   - region: us-east-1
    #     port: 9100

  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://app.example.com/health
          - https://api.example.com/status
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

6. Zabbix to Prometheus Mapping

Zabbix Concept Prometheus Equivalent
Zabbix Agent node_exporter + app exporters
Zabbix Proxy Prometheus federation or remote write
Item (data collection) Scrape target metric
Trigger Alerting rule (PromQL expression)
Template Recording rules + dashboards
Host group Labels (job, environment, team)
Action (alerting) Alertmanager routes and receivers
Screen/Dashboard Grafana dashboard
Discovery rule Service discovery (EC2, Consul, K8s)

7. Data Migration Considerations

What to migrate:
  ✓ Alert rules (translated to PromQL)
  ✓ Dashboard layouts (recreated in Grafana)
  ✓ Contact/escalation policies (Alertmanager routes)
  ✓ Downtime/maintenance windows (Alertmanager silences)

What NOT to migrate:
  ✗ Historical metric data (different data model)
  ✗ Legacy check scripts (replace with exporters)
  ✗ Agent configurations (replaced by exporters)

Historical data strategy:
  Option A: Accept the data break. New system, new data.
  Option B: Keep legacy system read-only for N months for historical queries.
  Option C: Export key metrics to CSV for compliance/reporting.

8. Team Training Plan

Week 1: Foundations
  - Prometheus data model (metrics, labels, time series)
  - Basic PromQL (rate, sum, avg, histogram_quantile)
  - Grafana navigation and basic dashboard creation

Week 2: Operations
  - Writing alerting rules
  - Alertmanager routing and silences
  - Debugging missing metrics and scrape failures

Week 3: Advanced
  - Recording rules for performance
  - Service discovery configuration
  - Federation and high availability

Week 4: On-Call Practice
  - Respond to alerts using new tools
  - Build runbook dashboards
  - Practice common triage workflows in Grafana

Common Pitfalls

  1. Big-bang cutover — Turning off Nagios and turning on Prometheus on the same day. You will miss things. Always parallel-run for at least 4 weeks.
  2. 1:1 check translation — Copying every Nagios check as a Prometheus alert. Many Nagios checks are redundant or obsolete. The migration is a chance to clean up.
  3. Ignoring the team — The new system is only useful if the team knows how to use it. Budget training time. People will revert to the old system if the new one is confusing.
  4. No baseline comparison — Not comparing alert fidelity during parallel run. If Nagios catches an incident that Prometheus misses, you have a gap.
  5. Migrating custom NRPE plugins as-is — Some Nagios plugins are shell scripts that check very specific things. These need to become either custom exporters or pushgateway jobs, not ported directly.
  6. Forgetting network device monitoring — Nagios and Zabbix handle SNMP well. Prometheus requires snmp_exporter, which needs MIB configuration. Do not leave network devices unmonitored during the gap.
  7. Underestimating Prometheus storage — Prometheus with 1,000 targets and 500 metrics each generates millions of time series. Plan storage and retention from day one.

Wiki Navigation

Prerequisites