Skip to content

Monitoring Fundamentals - Street-Level Ops

What experienced monitoring engineers know from years of keeping eyes on infrastructure.

Quick Diagnosis Commands

# Check if Nagios is running and healthy
systemctl status nagios
nagiostats | grep -E "Active|Cached|Check"

# Check Nagios current problems
cat /var/log/nagios/status.dat | grep -A 5 "current_state=2"

# Run a Nagios check manually
/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
/usr/lib/nagios/plugins/check_http -H localhost -p 8080 -u /health
/usr/lib/nagios/plugins/check_load -w 4,3,2 -c 8,6,4
/usr/lib/nagios/plugins/check_procs -w 250 -c 400

# Check Zabbix server status
zabbix_server -R diaginfo
systemctl status zabbix-server

# Test Zabbix agent connectivity
zabbix_get -s target-host -p 10050 -k agent.ping
zabbix_get -s target-host -p 10050 -k system.cpu.load[all,avg1]
zabbix_get -s target-host -p 10050 -k vfs.fs.size[/,pfree]

# Prometheus: check targets
curl -s http://localhost:9090/api/v1/targets | python3 -c "
import sys, json
data = json.load(sys.stdin)['data']['activeTargets']
up = sum(1 for t in data if t['health'] == 'up')
down = sum(1 for t in data if t['health'] != 'up')
print(f'UP: {up}  DOWN: {down}')
for t in data:
    if t['health'] != 'up':
        print(f'  FAIL: {t[\"scrapeUrl\"]} - {t.get(\"lastError\",\"\")[:80]}')"

# SNMP: test device connectivity
snmpwalk -v2c -c public switch01 sysDescr
snmpget -v2c -c public switch01 sysUpTime.0
snmpwalk -v2c -c public switch01 ifOperStatus

# Check node_exporter
curl -s http://target:9100/metrics | grep -E "^node_cpu_seconds_total|^node_memory_MemAvailable"

# Quick system health one-liner
echo "Load: $(cat /proc/loadavg) | Mem: $(free -h | awk '/Mem/{print $3\"/\"$2}') | Disk: $(df -h / | awk 'NR==2{print $5}')"

Gotcha: Nagios Check Interval vs Retry Interval

You configure check_interval 5 (5 minutes) and max_check_attempts 3. You think the alert fires after 15 minutes of failure. Actually, after the first failure Nagios switches to retry_interval (default: 1 minute). So the alert fires after 5 + 1 + 1 = 7 minutes, not 15.

Fix:

define service {
    check_interval          5    ; Normal check every 5 minutes
    retry_interval          1    ; Retry every 1 minute after failure
    max_check_attempts      3    ; 3 failures before HARD state
    ; Alert fires after: 5 + 1 + 1 = 7 minutes worst case
    ; Not: 5 + 5 + 5 = 15 minutes
}

Understand the state machine: SOFT states use retry_interval, HARD states use check_interval. Notifications only fire on HARD state transitions.

Gotcha: NRPE Command Not Found

You define a new check in Nagios that calls an NRPE command on a remote host. The check returns UNKNOWN with "NRPE: Command 'check_myapp' not defined."

Fix:

# On the REMOTE host, check NRPE config
cat /etc/nagios/nrpe.cfg | grep -v "^#" | grep "check_myapp"
# Nothing? The command is not defined on the remote side.

# Add the command definition on the remote host:
# /etc/nagios/nrpe.cfg or /etc/nagios/nrpe.d/myapp.cfg
command[check_myapp]=/usr/lib/nagios/plugins/check_http -H localhost -p 8080 -u /health

# Restart NRPE
systemctl restart nrpe

# Test from the Nagios server
/usr/lib/nagios/plugins/check_nrpe -H remote-host -c check_myapp

NRPE commands must be defined on BOTH sides: the check definition on Nagios, the command definition on the remote host.

Debug clue: If an NRPE check returns "CHECK_NRPE: Error - Could not complete SSL handshake," it is almost always an allowed_hosts mismatch in nrpe.cfg on the remote side. The Nagios server's IP must be listed there, and if the server has multiple interfaces, the source IP may not be the one you expect.

Gotcha: Zabbix Agent Cannot Connect

Zabbix server shows "Get value from agent failed: cannot connect to host" for multiple hosts.

Fix:

# On the target host, check agent status
systemctl status zabbix-agent2

# Check if agent is listening
ss -tlnp | grep 10050

# Check agent config
grep -E "^Server=|^ServerActive=" /etc/zabbix/zabbix_agent2.conf
# Server= must include the Zabbix server IP

# Check firewall
iptables -L -n | grep 10050
firewall-cmd --list-ports | grep 10050

# Test from Zabbix server
zabbix_get -s target-host -p 10050 -k agent.ping
# Should return: 1

# If using Zabbix Proxy, check proxy connectivity too
zabbix_get -s proxy-host -p 10051 -k agent.ping

Gotcha: Prometheus Scrape Target Shows as DOWN

Prometheus targets page shows a target with state: down and last error: connection refused.

Fix:

# 1. Is the exporter running on the target?
ssh target-host "ss -tlnp | grep 9100"

# 2. Can Prometheus reach the target?
# From the Prometheus server:
curl -s http://target-host:9100/metrics | head -5

# 3. Check for firewall/security group blocking
# Port 9100 (node_exporter) must be accessible from Prometheus server

# 4. Check scrape config
# Is the target listed correctly in prometheus.yml?
grep -A 5 "target-host" /etc/prometheus/prometheus.yml

# 5. Check for DNS resolution issues
dig +short target-host

# 6. If using service discovery, check discovery status
curl -s http://localhost:9090/api/v1/targets | \
    python3 -c "import sys,json; [print(t['discoveredLabels']) for t in json.load(sys.stdin)['data']['activeTargets'] if 'target-host' in str(t)]"

Gotcha: SNMP Returns Timeout

SNMP queries to a network device hang and then timeout. The device is pingable.

Fix:

# 1. Check community string (v2c)
snmpget -v2c -c public switch01 sysDescr.0
# If timeout: wrong community string or SNMP not enabled

# 2. Check if SNMP is enabled on the device
# (device-specific — check via web UI or console)

# 3. Check SNMP version
# Try v1 if v2c fails:
snmpget -v1 -c public switch01 sysDescr.0

# 4. Check ACL on device
# Many devices restrict SNMP to specific source IPs
# Your monitoring server IP must be in the device's SNMP ACL

# 5. Check for firewall blocking UDP 161
tcpdump -n -i eth0 udp port 161 -c 5

# 6. SNMPv3 (if using authentication)
snmpget -v3 -u monitor_user -l authPriv \
    -a SHA -A 'authpass' -x AES -X 'privpass' \
    switch01 sysDescr.0

Pattern: Building a Monitoring Checklist for a New Service

When deploying a new service, set up monitoring for:

Infrastructure Layer:
[ ] Host alive (ping / agent connectivity)
[ ] CPU utilization (warning 80%, critical 95%)
[ ] Memory utilization (warning 85%, critical 95%)
[ ] Disk space (warning 80%, critical 90%)
[ ] Disk I/O latency (warning >10ms, critical >50ms)
[ ] Network interface errors

Application Layer:
[ ] Process running (check_procs or node_exporter process collector)
[ ] HTTP health endpoint (response code + latency)
[ ] Application error rate (5xx responses)
[ ] Request latency (p95, p99)
[ ] Queue depth (if applicable)
[ ] Connection pool usage

Dependency Layer:
[ ] Database connectivity
[ ] Cache (Redis/Memcached) connectivity
[ ] External API dependencies
[ ] DNS resolution
[ ] Certificate expiry

Business Layer:
[ ] Transaction success rate
[ ] User-facing SLO metrics
[ ] Key business metric (signups, orders, etc.)

Pattern: Nagios to Nagios Debugging Flow

Alert fires → Check Nagios web UI → Read the status output

If status is stale:
  Check: Is the Nagios scheduler running?
  Check: Is the check_interval too long?
  Check: Is the host/service disabled?

If status is UNKNOWN:
  The plugin crashed or returned bad output
  Run the plugin manually on the Nagios server:
    /usr/lib/nagios/plugins/check_nrpe -H host -c command

  If NRPE check: run it directly on the remote host:
    /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /

If status is CRITICAL but service looks fine:
  Check: Is the threshold correct for this hardware?
  Check: Is the check hitting the right host/port?
  Check: Did someone change the service without updating monitoring?

Pattern: Metric Naming Conventions

Prometheus naming best practices:
  <namespace>_<subsystem>_<name>_<unit>

Examples:
  http_requests_total            (counter, no unit needed)
  http_request_duration_seconds  (histogram, unit = seconds)
  node_memory_MemFree_bytes      (gauge, unit = bytes)
  process_cpu_seconds_total      (counter, unit = seconds)

Units should be base units:
  seconds (not milliseconds)
  bytes (not megabytes)
  meters (not kilometers)

> **Under the hood:** Prometheus stores everything as 64-bit floats. Using base units (seconds, bytes) avoids precision loss at small values and keeps PromQL math consistent. If one metric is in milliseconds and another in seconds, `rate()` comparisons silently produce nonsense.

Counters should end with _total:
  http_requests_total
  disk_reads_total

Use labels for dimensions:
  http_requests_total{method="GET", status="200", handler="/api"}
  NOT: http_get_requests_200_api_total

Emergency: Monitoring Server Down

# 1. Do not panic. You are blind but services are still running.

# 2. Check monitoring server status
systemctl status nagios    # or zabbix-server, or prometheus
journalctl -u nagios --since "30 minutes ago"

# 3. Common causes:
# - Disk full (monitoring generates a lot of data)
df -h /var/lib/nagios/ /var/lib/prometheus/ /var/lib/zabbix/
# - Out of memory (too many checks/targets)
free -h
dmesg | grep -i "out of memory"
# - Database full (Zabbix)
du -sh /var/lib/mysql/zabbix/

# 4. Quick fix: restart the service
systemctl restart nagios  # or prometheus, zabbix-server

# 5. If disk full:
# - Rotate logs: logrotate -f /etc/logrotate.d/nagios
# - For Prometheus: reduce retention
# - For Zabbix: run housekeeper or truncate history tables

# 6. Set up external monitoring of your monitoring
# Use a free service (UptimeRobot, Healthchecks.io) to ping
# your monitoring server's health endpoint

Emergency: Alert Storm (Hundreds of Alerts at Once)

# 1. Identify the root cause, not individual alerts
# A network switch failure causes 50 host-down alerts
# Focus on the FIRST alert chronologically

# 2. In Nagios: check service dependencies
# If parent host is down, child alerts should be suppressed
# If they're not suppressed: host dependencies are not configured

# 3. In Prometheus/Alertmanager:
# Group alerts by common label
curl -s http://alertmanager:9093/api/v2/alerts | \
    python3 -c "import sys,json; alerts=json.load(sys.stdin); \
    [print(f'{a[\"labels\"][\"alertname\"]}: {a[\"labels\"].get(\"instance\",\"?\")}') for a in alerts[:20]]"

# 4. Silence the storm while you work
# Alertmanager: create a silence
amtool silence add alertname=~".*" instance=~"10.0.1.*" \
    --comment="Network switch failure investigation" \
    --duration=2h \
    --alertmanager.url=http://alertmanager:9093

# 5. Fix the root cause (switch, network, DNS, etc.)
# 6. Remove the silence
# 7. Post-incident: add proper dependency/grouping