Skip to content

Portal | Level: L1: Foundations | Topics: Monitoring Fundamentals, Prometheus, Grafana | Domain: Observability

Monitoring Fundamentals - Primer

Why This Matters

You cannot operate what you cannot see. Monitoring is not a nice-to-have bolted onto infrastructure after the fact. It is the nervous system that tells you if your services are alive, healthy, degraded, or on fire. Without monitoring, you find out about outages from your customers. With good monitoring, you find out before your customers do.

This is the L1 foundation — what monitoring is, how the major systems work, what metrics mean, and how to think about alerting. Whether you end up running Nagios on bare metal or Prometheus in Kubernetes, these fundamentals apply everywhere. I started with Nagios on 1,500 bare-metal servers and eventually migrated to Prometheus. The tools changed; the principles did not.

Core Concepts

1. What Monitoring Does

Monitoring answers four questions:
1. Is the service UP?            Availability (health checks, uptime)
2. Is the service FAST?          Latency (response time, percentiles)
3. Is the service CORRECT?       Error rate (5xx, exceptions, failures)
4. Is the service SATURATED?     Utilization (CPU, memory, disk, connections)

These map to the Four Golden Signals (Google SRE):
  Latency     How long requests take
  Traffic     How many requests per second
  Errors      How many requests fail
  Saturation  How full your resources are

2. Monitoring Architecture Patterns

Pattern 1: Agent-Based Push (Nagios, Zabbix)
┌────────┐     push      ┌──────────────┐
│ Agent   │─────────────> │ Central      │
│ (NRPE/  │  check result │ Server       │
│  Zabbix │               │ (Nagios/     │
│  agent) │               │  Zabbix)     │
└────────┘               └──────────────┘

Pattern 2: Pull/Scrape (Prometheus)
┌────────┐     scrape     ┌──────────────┐
│ Exporter│<──────────────│ Prometheus   │
│ (exposes│  HTTP GET      │ Server       │
│  /metrics)│             │ (pulls       │
└────────┘               │  metrics)    │
                          └──────────────┘

Pattern 3: SNMP Polling (Network devices)
┌────────┐     SNMP GET   ┌──────────────┐
│ Network │<──────────────│ Monitoring   │
│ Device  │  OID values    │ Server       │
│ (switch/│               │              │
│  router)│               └──────────────┘
└────────┘

3. SNMP (Simple Network Management Protocol)

Name origin: Despite "Simple" in the name, SNMP is notoriously complex. The joke in the industry is that SNMP stands for "Security Not My Problem" (v1/v2c transmit community strings in cleartext) or "Still Not Managing Properly." It was defined in RFC 1157 (1990) and the "Simple" referred to its design being simpler than the competing OSI network management protocol CMIP.

SNMP is the standard for monitoring network devices. Every managed switch, router, firewall, and UPS speaks SNMP.

SNMP Components:
  Manager  → The monitoring server that queries devices
  Agent    → Software on the device that responds to queries
  MIB      → Management Information Base (defines what OIDs mean)
  OID      → Object Identifier (address of a specific metric)
  Community String → Password for SNMP v1/v2c (cleartext!)

Common OIDs:
  .1.3.6.1.2.1.1.1.0        → sysDescr (device description)
  .1.3.6.1.2.1.1.3.0        → sysUpTime
  .1.3.6.1.2.1.2.2.1.10     → ifInOctets (interface input bytes)
  .1.3.6.1.2.1.2.2.1.16     → ifOutOctets (interface output bytes)
  .1.3.6.1.2.1.25.3.3.1.2   → hrProcessorLoad (CPU usage)
# SNMP queries from command line
snmpwalk -v2c -c public switch01 .1.3.6.1.2.1.1
snmpget -v2c -c public switch01 .1.3.6.1.2.1.1.3.0
snmpwalk -v2c -c public switch01 ifDescr
snmpwalk -v2c -c public switch01 ifInOctets

SNMP v3 adds authentication and encryption. Use v3 in production — v2c community strings are transmitted in cleartext.

4. Nagios Architecture

Who made it: Nagios was created by Ethan Galstad in 1999 under the name "NetSaint." It was renamed to "Nagios" in 2002 due to a trademark dispute. The name is a recursive acronym: Nagios Ain't Gonna Insist On Sainthood -- a nod to its NetSaint origins.

Nagios is the grandfather of infrastructure monitoring. It has been around since 2002 and is still running in thousands of environments.

┌────────────────────────────────────────────────┐
│                 Nagios Server                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐ │
│  │ Scheduler │  │ Check    │  │ Notification │ │
│  │ (runs     │  │ Engine   │  │ Engine       │ │
│  │  checks   │  │ (active/ │  │ (email/      │ │
│  │  on       │  │  passive)│  │  pager/      │ │
│  │  schedule)│  │          │  │  webhook)    │ │
│  └──────────┘  └──────────┘  └──────────────┘ │
│                     │                           │
│         ┌──────────┼──────────┐                │
│         │          │          │                 │
│     ┌───▼──┐  ┌───▼──┐  ┌──▼───┐             │
│     │check │  │check │  │check │  (plugins)   │
│     │_disk │  │_http │  │_load │              │
│     └──────┘  └──────┘  └──────┘              │
└────────────────────────────────────────────────┘
         │            │
    Local check   Remote check via NRPE
              ┌───────▼──────┐
              │ Remote Host  │
              │ ┌──────────┐ │
              │ │ NRPE     │ │
              │ │ daemon   │ │
              │ └──────────┘ │
              └──────────────┘
Nagios check states:
  0 = OK       (green)
  1 = WARNING  (yellow)
  2 = CRITICAL (red)
  3 = UNKNOWN  (orange)

Check types:
  Active  → Nagios initiates the check on schedule
  Passive → External process sends results to Nagios
# Run a Nagios plugin manually
/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
# DISK OK - free space: / 45123 MB (78% inode=93%);| /=12456MB;41234;46388;0;51543

/usr/lib/nagios/plugins/check_http -H app.example.com -u /health -t 10
# HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.045 second response time

/usr/lib/nagios/plugins/check_load -w 4,3,2 -c 8,6,4
# OK - load average: 0.52, 0.48, 0.39|load1=0.520;4.000;8.000;0; ...

5. Zabbix Architecture

Zabbix adds a web interface, database backend, and more sophisticated data collection than Nagios.

┌──────────────────────────────────────────┐
│              Zabbix Server                │
│  ┌────────┐  ┌────────┐  ┌───────────┐  │
│  │ Poller │  │ Trapper│  │ Alerter   │  │
│  │        │  │        │  │           │  │
│  └────────┘  └────────┘  └───────────┘  │
│       │                                   │
│  ┌────▼────────────────────────────────┐ │
│  │           Database                   │ │
│  │    (PostgreSQL / MySQL / TimescaleDB)│ │
│  └─────────────────────────────────────┘ │
└──────────────────────────────────────────┘
        │              │
   Zabbix Agent    Zabbix Proxy
   (on each host)  (for remote sites)
        │              │
   ┌────▼────┐    ┌───▼────────────┐
   │ Target  │    │ Remote Site    │
   │ Host    │    │ ┌───────────┐  │
   └─────────┘    │ │ Zabbix    │  │
                  │ │ Agent     │  │
                  │ └───────────┘  │
                  └────────────────┘
Zabbix concepts:
  Host        A device or server being monitored
  Item        A specific metric being collected (CPU idle, disk free)
  Trigger     A condition that evaluates to PROBLEM or OK
  Template    A reusable set of items, triggers, and graphs
  Discovery   Automatic detection of new items (network interfaces, filesystems)
  Proxy       Collects data on behalf of the server (for remote sites)

6. Prometheus Pull Model

Prometheus scrapes HTTP endpoints on a schedule.
Each target exposes metrics at /metrics.

Example /metrics output:
  # HELP http_requests_total Total number of HTTP requests
  # TYPE http_requests_total counter
  http_requests_total{method="GET",handler="/health",status="200"} 145823
  http_requests_total{method="POST",handler="/api/v1/users",status="201"} 3421
  http_requests_total{method="GET",handler="/api/v1/users",status="500"} 17

  # HELP http_request_duration_seconds Request latency in seconds
  # TYPE http_request_duration_seconds histogram
  http_request_duration_seconds_bucket{handler="/health",le="0.01"} 140000
  http_request_duration_seconds_bucket{handler="/health",le="0.1"} 145000
  http_request_duration_seconds_bucket{handler="/health",le="1"} 145823
  http_request_duration_seconds_bucket{handler="/health",le="+Inf"} 145823
  http_request_duration_seconds_sum{handler="/health"} 872.45
  http_request_duration_seconds_count{handler="/health"} 145823

7. Metric Types

Type Description Example Use
Counter Only goes up (resets on restart) http_requests_total Request counts, errors, bytes sent
Gauge Goes up and down node_memory_MemFree_bytes Temperature, queue depth, CPU usage
Histogram Counts observations in buckets http_request_duration_seconds Latency percentiles (p50, p95, p99)
Summary Pre-calculated quantiles go_gc_duration_seconds Client-side percentiles (less flexible)

Remember: Mnemonic for Prometheus metric types: "Counters Climb, Gauges Go-anywhere, Histograms Have-buckets." Counters only go up (use rate() to make them useful). Gauges go up and down (use directly). Histograms count observations into buckets (use histogram_quantile() for percentiles).

Counter: Use rate() to get per-second values
  rate(http_requests_total[5m])         requests per second over 5 min
  increase(http_requests_total[1h])     total requests in the last hour

Gauge: Use directly or track changes
  node_memory_MemFree_bytes             current free memory
  delta(node_memory_MemFree_bytes[1h])  memory change in last hour

Histogram: Use histogram_quantile()
  histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
   95th percentile latency over 5 minutes

8. Alert Design

Good alerts:
  ✓ Actionable      → Someone needs to DO something
  ✓ Timely          → Fires early enough to prevent impact
  ✓ Relevant        → Affects users or SLO
  ✓ Prioritized     → Critical vs warning vs info

Bad alerts:
  ✗ Noisy           → Fires constantly, gets ignored
  ✗ Unactionable    → "CPU at 60%" — so what?
  ✗ Stale           → Threshold from 2018 on 2026 hardware
  ✗ Redundant       → 5 alerts for the same incident
# Alert hierarchy example:
# Symptom-based (preferred):
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 5% on {{ $labels.instance }}"

# Cause-based (secondary):
- alert: DatabaseConnectionPoolExhausted
  expr: db_connections_active / db_connections_max > 0.95
  for: 5m
  labels:
    severity: warning

9. The Monitoring Stack Decision Tree

What are you monitoring?

├── Network devices (switches, routers, firewalls)
│   └── SNMP → Zabbix, LibreNMS, or Prometheus + snmp_exporter
├── Bare metal servers (no containers)
│   ├── Small fleet (<50 servers) → Zabbix or Nagios
│   └── Large fleet (>50 servers) → Prometheus + node_exporter
├── Kubernetes workloads
│   └── Prometheus (native integration) + Grafana
├── Cloud services (AWS, GCP, Azure)
│   └── CloudWatch/Stackdriver + Prometheus (via cloudwatch_exporter)
├── Application metrics
│   ├── Pull-based → Prometheus client library in your app
│   └── Push-based → StatsD → Prometheus via statsd_exporter
└── Logs (not metrics)
    └── ELK/EFK, Loki, or CloudWatch Logs (different layer)

10. Check Intervals and Thresholds

Check interval guidelines:
  5-15s   → Critical real-time services (payment processing)
  30-60s  → Standard production services
  5m      → Batch jobs, capacity metrics
  15-60m  → Compliance checks, certificate expiry

Threshold guidelines:
  Warning  → "We should look at this soon" (next business day)
  Critical → "Someone needs to act NOW" (page on-call)

  Disk:    Warning at 80%, Critical at 90%
  CPU:     Warning at 80% sustained 15m, Critical at 95% sustained 5m
  Memory:  Warning at 85%, Critical at 95%
  Latency: Warning at p95 > 500ms, Critical at p95 > 2s
  Errors:  Warning at 1% error rate, Critical at 5%

Common Pitfalls

  1. Monitoring the server but not the service — CPU is at 5% and memory is fine, but the application is returning 500 errors. Always monitor at the application layer, not just the infrastructure layer.
  2. Alert on every metric — You alert on CPU, memory, disk, swap, load, network, inode usage, and 40 other metrics. The on-call engineer gets 200 alerts for a single incident. Focus on symptoms (error rate, latency) over causes (CPU, memory).
  3. No baseline — You set a threshold of 80% CPU but this server normally runs at 75%. The alert fires constantly. Establish baselines before setting thresholds.
  4. Monitoring only in production — Your staging environment has no monitoring. You deploy a change that causes a memory leak. You find out in production.
  5. Community strings as security — SNMP v2c community strings are transmitted in cleartext. "public" and "private" are the defaults. If you use SNMP, use v3 with authentication.
  6. Check interval too aggressive — Checking every 5 seconds across 1,000 hosts generates 200 checks/second. Your monitoring server becomes the bottleneck. Match interval to criticality.
  7. No monitoring of the monitoring system — Your Nagios server crashes and nobody notices because Nagios was the thing that would have alerted you. Use an external check (Uptime Robot, Pingdom, or a separate lightweight monitor) to watch your monitoring.

Interview tip: The "Four Golden Signals" (Latency, Traffic, Errors, Saturation) come from Google's SRE book (2016). If an interviewer asks "what would you monitor for service X?" start with these four. They cover 90% of production issues and show you think in symptoms, not causes.


Wiki Navigation

Prerequisites

Next Steps