Monitoring Migration¶

10 cards — 🟢 3 easy | 🟡 4 medium | 🔴 3 hard

🟢 Easy (3)¶

1. What is the fundamental difference between Nagios-style and Prometheus-style monitoring?

Show answer

Nagios uses check-based (pass/fail) monitoring with active polling or push. Prometheus uses metric-based monitoring (continuous numerical values) with a pull/scrape model, multi-dimensional labels, and PromQL for dynamic queries.

2. What is node_exporter, and what type of metrics does it provide?

Show answer

node_exporter is a Prometheus exporter that runs on each host (port 9100) and exposes OS-level metrics: CPU, memory, disk, network, and I/O. It replaces Nagios checks like check_disk, check_load, and check_mem.

3. What is the blackbox_exporter used for in a Prometheus deployment?

Show answer

The blackbox_exporter performs endpoint probes (HTTP, TCP, ICMP, DNS) from outside the target, replacing Nagios checks like check_http, check_tcp, and check_ping. It exposes metrics like probe_success and probe_http_status_code.

🟡 Medium (4)¶

1. What are the five phases of a monitoring migration, and why is the parallel run phase the longest?

Show answer

Assessment (2-4 weeks), Foundation (2-4 weeks), Parallel Run (4-8 weeks), Cutover (1-2 weeks), Decommission (2-4 weeks). The parallel run is longest because both systems must monitor simultaneously to compare alert fidelity, tune Prometheus thresholds, train the team, and build confidence in the new system before committing.

2. How do you translate a Nagios "check_disk -w 20% -c 10%" check to a Prometheus alerting rule?

Show answer

Use PromQL: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20 with "for: 5m" and severity: warning. This continuously evaluates disk usage rather than running a point-in-time check script.

3. In a Zabbix-to-Prometheus migration, what replaces Zabbix triggers, templates, and host groups?

Show answer

Zabbix triggers become Prometheus alerting rules (PromQL expressions). Templates become recording rules plus Grafana dashboards. Host groups become labels (job, environment, team) which provide multi-dimensional grouping.

4. Should you migrate historical metric data from the legacy system to Prometheus? Why or why not?

Show answer

Generally no — the data models are fundamentally different (check results vs time series), making direct migration impractical. Options: accept the data break, keep the legacy system read-only for N months for historical queries, or export key metrics to CSV for compliance/reporting.

🔴 Hard (3)¶

1. How should custom Nagios NRPE plugins be handled during a migration to Prometheus?

Show answer

Custom NRPE plugins (typically shell scripts checking specific things) should NOT be ported directly. They need to become either custom Prometheus exporters (that expose metrics on an HTTP endpoint) or pushgateway jobs (for batch/cron jobs that run and push results). This transforms point-in-time checks into continuous metric collection.

2. Why is "big-bang cutover" dangerous in a monitoring migration, and what should you do instead?

Show answer

Turning off Nagios and enabling Prometheus on the same day risks missing incidents because the new system has untested coverage gaps. Instead, run both systems in parallel for at least 4 weeks, comparing alert fidelity to ensure Prometheus catches the same incidents Nagios does, then gradually cut over alerting before decommissioning legacy.

3. What challenge does network device monitoring present during a monitoring migration?

Show answer

Nagios and Zabbix handle SNMP natively and well. Prometheus requires the snmp_exporter, which needs MIB configuration — a non-trivial setup. If this is overlooked, network devices go unmonitored during the migration gap, creating a blind spot for switch/router health, port utilization, and environmental sensors.