Portal | Level: L2: Operations | Topics: Monitoring Migration, Monitoring Fundamentals, Prometheus, Grafana | Domain: Observability

Monitoring Migration (Legacy to Modern) - Primer¶

Why This Matters¶

If you have been in infrastructure long enough, you have inherited a Nagios or Zabbix installation that someone set up in 2014 and nobody fully understands anymore. The checks work — mostly — but adding new ones requires tribal knowledge, the dashboards are from a different era, and the alerting is a maze of email rules and escalation chains. Moving to Prometheus and Grafana is not just a technology upgrade. It is a chance to rethink what you monitor, how you alert, and how your team interacts with operational data.

I have done this migration three times at different organizations, each with hundreds of Nagios checks and years of accumulated configuration. The technology swap is the easy part. The hard part is the parallel-run period, the team training, the check-by-check equivalence mapping, and the political challenge of convincing people to trust the new system before you turn off the old one.

Core Concepts¶

1. Why Migrate?¶

Legacy (Nagios/Zabbix)              Modern (Prometheus/Grafana)
─────────────────────               ───────────────────────────
Check-based (pass/fail)             Metric-based (continuous values)
Push or active polling              Pull model (scrape)
Plugin scripts per check            Exporters expose all metrics
Static thresholds                   Dynamic queries (PromQL)
Host-centric                        Service-centric, label-based
Config files / web UI               Config-as-code (YAML)
Limited dimensional data            Multi-dimensional labels
Custom dashboards are painful       Grafana dashboards are powerful
Scaling requires proxies/agents     Federation and remote write

2. Migration Phases¶

Phase 1: Assessment (2-4 weeks)
├── Inventory all existing checks
├── Classify by type (host, service, app, network)
├── Identify check owners
├── Map checks to Prometheus equivalents
└── Document current alerting rules and escalations

Phase 2: Foundation (2-4 weeks)
├── Deploy Prometheus + Grafana
├── Deploy node_exporter on all hosts
├── Deploy blackbox_exporter for endpoint checks
├── Set up basic recording rules
└── Create initial dashboards

Phase 3: Parallel Run (4-8 weeks)
├── Both systems monitoring simultaneously
├── Compare alert fidelity (same incidents detected?)
├── Tune Prometheus thresholds to match reality
├── Train team on PromQL and Grafana
└── Build confidence in new system

Phase 4: Cutover (1-2 weeks)
├── Route alerting through Prometheus/Alertmanager
├── Disable legacy alerting (keep monitoring passive)
├── Monitor for gaps (incidents legacy caught but new missed)
└── Team on-call uses new system exclusively

Phase 5: Decommission (2-4 weeks)
├── Remove legacy agents (NRPE, Zabbix agent)
├── Archive legacy configuration
├── Shut down legacy server
└── Update documentation and runbooks

3. Mapping Nagios Checks to Prometheus¶

Nagios Check	Prometheus Equivalent
check_disk	`node_filesystem_avail_bytes` / `node_filesystem_size_bytes` via node_exporter
check_load	`node_load1`, `node_load5`, `node_load15` via node_exporter
check_mem	`node_memory_MemAvailable_bytes` / `node_memory_MemTotal_bytes` via node_exporter
check_procs	`node_procs_running`, `node_procs_blocked` via node_exporter
check_swap	`node_memory_SwapFree_bytes` / `node_memory_SwapTotal_bytes`
check_http	`probe_success`, `probe_http_status_code` via blackbox_exporter
check_tcp	`probe_success` with TCP module via blackbox_exporter
check_ping	`probe_success` with ICMP module via blackbox_exporter
check_ntp	`node_ntp_offset_seconds` via node_exporter
check_mysql	`mysql_up`, `mysql_global_status_*` via mysqld_exporter
check_postgres	`pg_up`, `pg_stat_*` via postgres_exporter
Custom NRPE plugin	Custom exporter or pushgateway for batch jobs

4. Nagios Check to PromQL Translation¶

# Nagios: check_disk -w 20% -c 10% -p /
# Translation:
- alert: DiskSpaceLow
  expr: |
    (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Disk space low on {{ $labels.instance }}"
    description: "{{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free"

# Nagios: check_load -w 4,3,2 -c 8,6,4
# Translation:
- alert: HighLoad
  expr: node_load5 / count without(cpu) (node_cpu_seconds_total{mode="idle"}) > 0.75
  for: 10m
  labels:
    severity: warning

# Nagios: check_http -H app.example.com -u /health -t 10
# Translation (via blackbox_exporter):
- alert: EndpointDown
  expr: probe_success{job="blackbox-http"} == 0
  for: 2m
  labels:
    severity: critical

5. Exporter Deployment¶

                  Prometheus
                  (scrapes every 15-60s)
                      │
         ┌────────────┼────────────┐
         │            │            │
    node_exporter  blackbox     app metrics
    (port 9100)   (port 9115)  (port 8080/metrics)
         │            │            │
    OS metrics    HTTP/TCP      Application
    CPU/mem/disk  endpoint      custom metrics
    network/io    probes

# Prometheus scrape config for node_exporter
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'server1:9100'
          - 'server2:9100'
          - 'server3:9100'
    # Or use service discovery:
    # ec2_sd_configs:
    #   - region: us-east-1
    #     port: 9100

  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://app.example.com/health
          - https://api.example.com/status
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

6. Zabbix to Prometheus Mapping¶

Zabbix Concept	Prometheus Equivalent
Zabbix Agent	node_exporter + app exporters
Zabbix Proxy	Prometheus federation or remote write
Item (data collection)	Scrape target metric
Trigger	Alerting rule (PromQL expression)
Template	Recording rules + dashboards
Host group	Labels (job, environment, team)
Action (alerting)	Alertmanager routes and receivers
Screen/Dashboard	Grafana dashboard
Discovery rule	Service discovery (EC2, Consul, K8s)

7. Data Migration Considerations¶

What to migrate:
  ✓ Alert rules (translated to PromQL)
  ✓ Dashboard layouts (recreated in Grafana)
  ✓ Contact/escalation policies (Alertmanager routes)
  ✓ Downtime/maintenance windows (Alertmanager silences)

What NOT to migrate:
  ✗ Historical metric data (different data model)
  ✗ Legacy check scripts (replace with exporters)
  ✗ Agent configurations (replaced by exporters)

Historical data strategy:
  Option A: Accept the data break. New system, new data.
  Option B: Keep legacy system read-only for N months for historical queries.
  Option C: Export key metrics to CSV for compliance/reporting.

8. Team Training Plan¶

Week 1: Foundations
  - Prometheus data model (metrics, labels, time series)
  - Basic PromQL (rate, sum, avg, histogram_quantile)
  - Grafana navigation and basic dashboard creation

Week 2: Operations
  - Writing alerting rules
  - Alertmanager routing and silences
  - Debugging missing metrics and scrape failures

Week 3: Advanced
  - Recording rules for performance
  - Service discovery configuration
  - Federation and high availability

Week 4: On-Call Practice
  - Respond to alerts using new tools
  - Build runbook dashboards
  - Practice common triage workflows in Grafana

Common Pitfalls¶

Big-bang cutover — Turning off Nagios and turning on Prometheus on the same day. You will miss things. Always parallel-run for at least 4 weeks.
1:1 check translation — Copying every Nagios check as a Prometheus alert. Many Nagios checks are redundant or obsolete. The migration is a chance to clean up.
Ignoring the team — The new system is only useful if the team knows how to use it. Budget training time. People will revert to the old system if the new one is confusing.
No baseline comparison — Not comparing alert fidelity during parallel run. If Nagios catches an incident that Prometheus misses, you have a gap.
Migrating custom NRPE plugins as-is — Some Nagios plugins are shell scripts that check very specific things. These need to become either custom exporters or pushgateway jobs, not ported directly.
Forgetting network device monitoring — Nagios and Zabbix handle SNMP well. Prometheus requires snmp_exporter, which needs MIB configuration. Do not leave network devices unmonitored during the gap.
Underestimating Prometheus storage — Prometheus with 1,000 targets and 500 metrics each generates millions of time series. Plan storage and retention from day one.

Prerequisites¶

Monitoring Fundamentals (Topic Pack, L1)
Observability Deep Dive (Topic Pack, L2)

Monitoring Fundamentals (Topic Pack, L1) — Grafana, Monitoring Fundamentals, Prometheus
Lab: Prometheus Target Down (CLI) (Lab, L2) — Grafana, Prometheus
Observability Architecture (Reference, L2) — Grafana, Prometheus
Observability Deep Dive (Topic Pack, L2) — Grafana, Prometheus
Skillcheck: Observability (Assessment, L2) — Grafana, Prometheus
Track: Observability (Reference, L2) — Grafana, Prometheus
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
Alerting Rules (Topic Pack, L2) — Prometheus
Alerting Rules Drills (Drill, L2) — Prometheus
Capacity Planning (Topic Pack, L2) — Prometheus

Monitoring Migration (Legacy to Modern) - Primer¶

Why This Matters¶

Core Concepts¶

1. Why Migrate?¶

2. Migration Phases¶

3. Mapping Nagios Checks to Prometheus¶

4. Nagios Check to PromQL Translation¶

5. Exporter Deployment¶

6. Zabbix to Prometheus Mapping¶

7. Data Migration Considerations¶

8. Team Training Plan¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Monitoring Migration (Legacy to Modern) - Primer¶

Why This Matters¶

Core Concepts¶

1. Why Migrate?¶

2. Migration Phases¶

3. Mapping Nagios Checks to Prometheus¶

4. Nagios Check to PromQL Translation¶

5. Exporter Deployment¶

6. Zabbix to Prometheus Mapping¶

7. Data Migration Considerations¶

8. Team Training Plan¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶