Monitoring Fundamentals Footguns¶

Mistakes that leave you blind, drown you in noise, or give you false confidence.

1. Monitoring the infrastructure but not the application¶

CPU is at 5%, memory is fine, disk has plenty of space. Everything looks green. But the application is returning 500 errors to every request because the database connection pool is exhausted. Your monitoring dashboard is all green while customers are seeing errors.

Fix: Always monitor at the application layer: HTTP status codes, error rates, response latency, and health check endpoints. Infrastructure metrics are necessary but not sufficient.

2. Alerting on every metric you collect¶

You set up alerts on CPU, memory, disk, swap, load average, network throughput, inode usage, context switches, and 30 other metrics per host. A single incident produces 50 alerts. The on-call engineer spends 20 minutes triaging alerts instead of fixing the problem.

Fix: Alert on symptoms (error rate, latency, availability), not causes (CPU, memory). Use dashboards for causes — they are for investigation after the alert fires, not for alerting themselves.

3. Using default SNMP community strings in production¶

Your switches and routers use "public" as the SNMP community string because that is what the vendor shipped. Anyone on the network can query every metric on every device. With write access (community "private"), they can reconfigure the device.

Fix: Change community strings to something unique and non-guessable. Better: use SNMPv3 with authentication and encryption. Restrict SNMP access to your monitoring server's IP via ACLs on the device.

4. No monitoring of the monitoring system¶

Your Nagios server runs out of disk space and crashes. No alerts fire because the thing that sends alerts is the thing that crashed. You find out 4 hours later when someone manually checks a server.

Fix: Use an external monitoring service (UptimeRobot, Healthchecks.io, or a separate lightweight check) to monitor your monitoring server. The monitor that watches the monitors cannot be the same monitor.

5. Setting thresholds without baselines¶

You set disk warning at 80% because that is what the tutorial said. This server's data volume grows 2% per day. By the time you get the warning, you have less than a day to act. Another server has seasonal data and hits 80% every month before cleanup runs.

Fix: Establish baselines first. Monitor for a week before setting thresholds. Set thresholds based on how much time you need to respond, not on arbitrary percentages. For disk: alert when projected time to full is less than your response window.

6. Check interval too frequent for the monitoring server¶

You set 10-second check intervals on 500 hosts with 10 checks each. That is 500 checks per second. Nagios cannot keep up. Checks pile up. Some run minutes late. You get stale data and delayed alerts.

Fix: Match check intervals to criticality. Critical services: 30-60 seconds. Standard services: 5 minutes. Capacity metrics: 15 minutes. Calculate total check load before deploying: hosts x checks / interval = checks per second.

7. Passive checks with no freshness checking¶

You set up passive checks for batch jobs that report results to Nagios when they complete. The batch job crashes and never sends a result. Nagios shows the last successful result as current. Nobody notices the job has been dead for a week.

Fix: Enable freshness checking on passive checks. Set check_freshness 1 and freshness_threshold to slightly longer than the expected reporting interval. If no result arrives in time, Nagios marks the check as CRITICAL.

8. Dashboard without context¶

You build a Grafana dashboard that shows CPU, memory, and disk graphs. The graphs show lines going up and down. There are no thresholds marked, no annotations for deployments, no context for what "normal" looks like. During an incident, the on-call engineer stares at the dashboard and cannot tell if the current values are abnormal.

Fix: Add threshold lines to graphs (warning/critical). Add deployment annotations. Include "normal range" context. Build dashboards that answer specific questions ("Is this service healthy?") not generic ones ("Here are some numbers").

9. Nagios notification commands that silently fail¶

You configure Nagios to send alerts via a custom email script or Slack webhook. The script has a typo. Nagios runs the notification command, it fails, and Nagios logs the failure to a file nobody reads. Critical alerts are not being delivered and nobody knows.

Fix: Test notification commands manually before deploying. Check Nagios notification logs regularly. Send a test notification to every contact group weekly. Monitor notification delivery as its own check.

10. Building your own monitoring instead of using established tools¶

You decide Nagios is "too old" and Prometheus is "too complex." You write a custom monitoring system in Python that checks HTTP endpoints and sends Slack messages. It works for 3 months until edge cases pile up: check scheduling, state management, alert deduplication, silence windows, escalations, historical data. You have reinvented Nagios, poorly.

Fix: Use established monitoring tools. They have solved problems you have not thought of yet. The learning curve is cheaper than the maintenance cost of a custom system. Customize through plugins, exporters, and dashboards — not by rewriting the core.