Skip to content

The Observability Migration

Category: The Migration Domains: monitoring, observability Read time: ~5 min


Setting the Scene

I was the observability lead at an insurance tech company. We had Nagios. Not "Nagios plus some modern stuff." Just Nagios. Nagios Core 4.4.6 on a single VM with 2,847 service checks, 412 host checks, and a check_interval of 5 minutes for everything. The web UI looked like it was from 2004 because it was. Alert notifications went to a shared email inbox that had 14,000 unread messages. Nobody read them.

The plan: migrate to Prometheus, Grafana, and Alertmanager. Modern stack. Metrics-based. Beautiful dashboards instead of a green/red/yellow table. What could go wrong?

What Happened

Week 1 — I deployed Prometheus using the kube-prometheus-stack Helm chart to our Kubernetes cluster. Set up Grafana with the default dashboards. Immediately got beautiful CPU, memory, and network graphs for every pod. The team was impressed. "This is so much better than Nagios," everyone said. It was. But we hadn't migrated a single alert yet.

Week 2-3 — Alert translation. I exported our Nagios config and started mapping checks to Prometheus alerting rules. check_http became a probe_success metric via Blackbox Exporter. check_disk became node_filesystem_avail_bytes. check_load became node_load1. For the standard checks, it was tedious but mechanical. I wrote 180 alerting rules in two weeks.

Week 4 — The custom checks. We had 340 custom Nagios plugins — bash scripts, Python scripts, one inexplicable Perl script that checked SAP transaction codes. These weren't translatable to Prometheus metrics. They were procedural checks: "SSH into this box, run this command, parse the output, decide if it's OK." Prometheus doesn't work like that. I had to decide: rewrite them as exporters, use Blackbox Exporter creatively, or accept that some checks would stay in Nagios.

Week 5-6 — Dual-running. We had both Nagios and Prometheus firing alerts. Alertmanager sent to PagerDuty. Nagios sent to the email inbox AND PagerDuty (because someone had finally connected it). Every real issue generated two alerts, one from each system. Every false positive generated two alerts. The on-call engineer was getting 60+ notifications per shift instead of 30. Alert fatigue went from bad to catastrophic. Three people muted their PagerDuty apps.

Week 7 — I shut off Nagios alerting for any check that had a Prometheus equivalent. Kept Nagios running read-only so teams could compare. This cut the noise by 60%. But 40% of the remaining Prometheus alerts were too sensitive — I'd translated Nagios thresholds directly, and Nagios thresholds were accumulated cruft from five years of "just add a check." We had an alert for disk usage above 70% on a 2TB volume. It fired constantly. Nobody cared. It had been in the email inbox for three years.

Week 8-9 — Alert review. I sat down with each team and went through their alerts one by one. "Do you act on this alert? What do you do? Has it ever woken you up? Should it?" We deleted 40% of the alerts. Adjusted thresholds on 30%. Added for: 5m duration clauses to prevent flapping. The remaining alerts were ones people actually cared about.

Week 10 — Nagios decommission. I powered off the Nagios VM on a Friday afternoon. Nobody noticed until Monday, and only because someone wanted to look at historical data. We pointed them at Grafana.

The Moment of Truth

Week 6, watching the on-call engineer get 63 PagerDuty notifications in a single 12-hour shift, half from Nagios and half from Prometheus, for the same set of minor issues. The dual-running period was necessary — you can't cut over alerting cold — but we should have been much more aggressive about silencing the old system as the new one came online.

The Aftermath

Three months after decommission, we had 320 alerting rules (down from 2,847 Nagios checks), a 94% reduction in alert volume, and on-call engineers who actually responded to pages because pages meant something. The Grafana dashboards became the default tab in every team's browser. One engineer told me it was the first time in four years she'd trusted the monitoring system.

The Lessons

  1. Migrate alerts gradually by service: Don't run both systems at full volume. Migrate one service's alerts, validate them, then silence the corresponding Nagios checks. Team by team, service by service.
  2. Dual-running is expensive but necessary: You can't cut over alerting without a parallel period, but plan it carefully. Set a hard deadline for each service's dual-run period — 2 weeks max.
  3. Take the chance to reduce alert count: A migration is the perfect excuse to audit every alert. If nobody acts on it, delete it. If it fires daily, it's not an alert — it's a dashboard metric.

What I'd Do Differently

I'd start with the alert audit, not the migration. Go through every Nagios check with the owning team BEFORE writing a single Prometheus rule. Kill the 40% of checks that nobody cares about in Nagios first. Then migrate only the alerts that matter. You're translating 60% of the work instead of 100%, and you skip the dual-running noise for checks that should never have existed.

The Quote

"We had 2,847 Nagios checks and 14,000 unread alert emails. The monitoring system was working perfectly — at being ignored."

Cross-References