Monitoring Migration (Legacy to Modern) - Street-Level Ops¶
What experienced monitoring engineers know from surviving multiple Nagios-to-Prometheus migrations.
Quick Diagnosis Commands¶
# Nagios: check current status
cat /var/log/nagios/status.dat | grep -c "current_state=2" # Critical count
nagiostats # Overall Nagios health
# Zabbix: check server status
zabbix_server -R diaginfo
zabbix_get -s <host> -k agent.ping
# Prometheus: check scrape targets
curl -s http://prometheus:9090/api/v1/targets | python3 -m json.tool | grep -E '"health"|"job"'
# Prometheus: count active time series
curl -s http://prometheus:9090/api/v1/label/__name__/values | python3 -c "import sys,json; print(len(json.load(sys.stdin)['data']))"
# Check Prometheus scrape health
curl -s http://prometheus:9090/api/v1/targets | python3 -c "
import sys, json
targets = json.load(sys.stdin)['data']['activeTargets']
for t in targets:
if t['health'] != 'up':
print(f\"DOWN: {t['labels'].get('job','?')} {t['scrapeUrl']} - {t.get('lastError','')}\")"
# Compare check counts
echo "Nagios checks: $(grep -c "define service" /etc/nagios/conf.d/*.cfg 2>/dev/null)"
echo "Prometheus rules: $(grep -c "alert:" /etc/prometheus/rules/*.yml 2>/dev/null)"
# Grafana: check datasource connectivity
curl -s -u admin:admin http://grafana:3000/api/datasources | python3 -m json.tool
# node_exporter: verify it's running on a target
curl -s http://target-host:9100/metrics | head -5
# blackbox_exporter: test a probe manually
curl -s "http://blackbox:9115/probe?target=https://app.example.com&module=http_2xx"
Gotcha: Nagios Checks Have Hidden Dependencies¶
You inventory 200 Nagios checks and start translating them. Check #47 is check_app_health — a custom NRPE plugin that SSHes to a jump host, runs a curl command, parses JSON, and checks three fields. It took someone 2 days to write. You cannot just replace this with a blackbox probe.
Fix:
# Audit every custom plugin
find /usr/lib/nagios/plugins/custom/ -type f -exec head -5 {} \;
# Categorize custom checks:
# 1. Simple HTTP checks → blackbox_exporter
# 2. Process checks → node_exporter process collector
# 3. Custom data checks → write a custom exporter or textfile collector
# 4. Multi-step checks → consider script_exporter or pushgateway
# For complex checks, use node_exporter textfile collector:
# Write a script that outputs Prometheus metrics to a file
cat > /usr/local/bin/custom-check.sh << 'SCRIPT'
#!/bin/bash
RESULT=$(curl -s http://internal-api/status | jq -r '.healthy')
if [ "$RESULT" = "true" ]; then
echo "app_health_status 1" > /var/lib/node_exporter/textfile/app_health.prom
else
echo "app_health_status 0" > /var/lib/node_exporter/textfile/app_health.prom
fi
SCRIPT
chmod +x /usr/local/bin/custom-check.sh
# Run via cron every minute
Gotcha: Alert Fatigue Migrates with the Alerts¶
You translate all 200 Nagios checks to Prometheus alerts. You now have 200 alerting rules, many of which fire constantly because the thresholds were wrong in Nagios too — people just learned to ignore them.
Fix:
# During migration, audit every alert:
# 1. Has this alert fired in the last 90 days?
# 2. Did anyone take action when it fired?
# 3. Is the threshold still relevant?
# Example: Nagios check_load with -w 4 -c 8 on a 32-core server
# This was ALWAYS firing. The threshold was never adjusted for the hardware.
# New alert with context-aware threshold:
- alert: HighCPULoad
expr: |
node_load5 / count without(cpu) (node_cpu_seconds_total{mode="idle"}) > 0.8
for: 15m # Wait 15 minutes, not instant
labels:
severity: warning
annotations:
summary: "High load on {{ $labels.instance }}"
Treat the migration as a chance to clean up, not a 1:1 translation.
Gotcha: Parallel Run Shows Different Alert Timing¶
During the parallel run, Nagios fires an alert 30 seconds before Prometheus. The team panics: "Prometheus is slower." Actually, Nagios checks every 60 seconds and fires immediately. Prometheus scrapes every 15 seconds but has a for: 5m clause requiring 5 minutes of sustained failure.
Fix:
Understand the difference:
Nagios: check_interval=60s, max_check_attempts=3, notification_delay=0
Fires after: 3 * 60s = 3 minutes of failure
Prometheus: scrape_interval=15s, for=5m
Fires after: 5 minutes of continuous failure
Adjust the "for" duration to match operational expectations.
Do NOT set for: 0s to match Nagios speed — that causes flapping alerts.
Gotcha: SNMP Devices Left Unmonitored¶
You migrate servers to node_exporter, apps to custom exporters, and endpoints to blackbox_exporter. Three weeks later, a network switch runs out of memory and nobody notices because Nagios was doing SNMP checks that were not migrated.
Fix:
# Deploy snmp_exporter for network devices
# Generate snmp.yml from MIBs:
# snmp_exporter generator with your device MIBs
scrape_configs:
- job_name: 'snmp-switches'
static_configs:
- targets:
- 10.0.0.1 # Switch 1
- 10.0.0.2 # Switch 2
metrics_path: /snmp
params:
module: [if_mib]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: snmp-exporter:9116
Make a checklist of everything Nagios/Zabbix monitored by category. Cross off each category as you confirm Prometheus coverage.
Gotcha: Zabbix Templates Do Not Map to Prometheus Cleanly¶
Zabbix templates bundle items, triggers, graphs, and discovery rules. There is no 1:1 Prometheus equivalent. Teams try to "convert templates" and get stuck.
Fix: Map at the intent level, not the config level.
Zabbix Template: "Template OS Linux"
Items: CPU idle, load, memory, disk, network
Triggers: CPU > 90%, disk > 80%, load > N
Graphs: CPU usage, memory usage, disk I/O
Prometheus Equivalent:
Exporter: node_exporter (covers all items)
Rules: alerting rules in YAML (covers triggers)
Dashboards: Grafana "Node Exporter Full" dashboard (covers graphs)
Discovery: Prometheus service discovery (covers Zabbix discovery)
Do not try to "translate" the template XML. Deploy the exporter and write new rules.
Pattern: Migration Tracking Spreadsheet¶
| Check Name | Nagios Host/Service | Prometheus Alert | Exporter | Status | Owner |
|-----------------|--------------------|--------------------|---------------|------------|-------|
| check_disk / | all / Disk Root | DiskSpaceLow | node_exporter | Validated | ops |
| check_load | all / CPU Load | HighCPULoad | node_exporter | Validated | ops |
| check_http app | app / HTTP | EndpointDown | blackbox | Testing | dev |
| check_mysql | db1 / MySQL | MysqlDown | mysqld_exp | Not started| dba |
| check_custom_x | web / Custom App | AppHealthCheck | textfile | Needs work | ops |
Track every check through: Inventoried → Mapped → Implemented → Parallel Run → Validated → Legacy Disabled.
Pattern: Parallel Run Comparison Script¶
#!/bin/bash
# compare-alerts.sh - run daily during parallel run period
# Compare active alerts in Nagios vs Prometheus
echo "=== Nagios Active Problems ==="
# Parse Nagios status.dat or use livestatus
curl -s "http://nagios:8080/cgi-bin/statusjson.cgi?query=servicelist&servicestatus=critical" 2>/dev/null | \
python3 -c "import sys,json; [print(f' {s}') for s in json.load(sys.stdin).get('data',{}).get('servicelist',{}).keys()]" || \
echo " (Could not query Nagios)"
echo ""
echo "=== Prometheus Active Alerts ==="
curl -s "http://prometheus:9090/api/v1/alerts" | \
python3 -c "
import sys, json
alerts = json.load(sys.stdin)['data']['alerts']
for a in alerts:
if a['state'] == 'firing':
print(f\" {a['labels'].get('alertname','?')}: {a['labels'].get('instance','?')}\")"
echo ""
echo "=== Gaps ==="
echo " Review manually: are there Nagios alerts not in Prometheus?"
echo " Review manually: are there Prometheus alerts not in Nagios?"
Emergency: Migration Broke Alerting¶
You cut over alerting to Alertmanager and disabled Nagios notifications. An incident happens. Alertmanager is misconfigured and alerts go nowhere.
# 1. Immediately re-enable Nagios notifications
# Edit Nagios config: enable_notifications=1
# Or via API: ENABLE_NOTIFICATIONS
# 2. Check Alertmanager status
curl -s http://alertmanager:9093/api/v2/status | python3 -m json.tool
curl -s http://alertmanager:9093/api/v2/alerts | python3 -m json.tool
# 3. Check Alertmanager config
amtool check-config /etc/alertmanager/alertmanager.yml
# 4. Common Alertmanager issues:
# - Route matching is wrong (alerts fall through to default)
# - Receiver webhook URL is wrong
# - SMTP credentials expired
# - Slack webhook token revoked
# 5. Test Alertmanager manually
amtool alert add alertname=test severity=critical instance=test-host \
--alertmanager.url=http://alertmanager:9093
# 6. After fixing, run both systems in parallel for another week
# before attempting cutover again
Emergency: Prometheus Storage Full¶
# 1. Check disk usage
df -h /var/lib/prometheus/
# 2. Check TSDB stats
curl -s http://prometheus:9090/api/v1/status/tsdb | python3 -m json.tool
# 3. Reduce retention (prometheus.yml or CLI flag)
# --storage.tsdb.retention.time=15d (default 15d)
# --storage.tsdb.retention.size=50GB
# 4. Delete old blocks manually (emergency only)
# Prometheus stores data in 2-hour blocks under /var/lib/prometheus/
# Delete oldest block directories, then restart
ls -lt /var/lib/prometheus/ | tail -5
# 5. Identify high-cardinality metrics
curl -s http://prometheus:9090/api/v1/status/tsdb | \
python3 -c "
import sys, json
data = json.load(sys.stdin)['data']
for s in sorted(data['seriesCountByMetricName'], key=lambda x: x['value'], reverse=True)[:10]:
print(f\" {s['value']:>10} {s['name']}\")"