Portal | Level: L1: Foundations | Topics: Monitoring Fundamentals, Prometheus, Grafana | Domain: Observability

Monitoring Fundamentals - Primer¶

Why This Matters¶

You cannot operate what you cannot see. Monitoring is not a nice-to-have bolted onto infrastructure after the fact. It is the nervous system that tells you if your services are alive, healthy, degraded, or on fire. Without monitoring, you find out about outages from your customers. With good monitoring, you find out before your customers do.

This is the L1 foundation — what monitoring is, how the major systems work, what metrics mean, and how to think about alerting. Whether you end up running Nagios on bare metal or Prometheus in Kubernetes, these fundamentals apply everywhere. I started with Nagios on 1,500 bare-metal servers and eventually migrated to Prometheus. The tools changed; the principles did not.

Core Concepts¶

1. What Monitoring Does¶

Monitoring answers four questions:
1. Is the service UP?           → Availability (health checks, uptime)
2. Is the service FAST?         → Latency (response time, percentiles)
3. Is the service CORRECT?      → Error rate (5xx, exceptions, failures)
4. Is the service SATURATED?    → Utilization (CPU, memory, disk, connections)

These map to the Four Golden Signals (Google SRE):
  Latency    → How long requests take
  Traffic    → How many requests per second
  Errors     → How many requests fail
  Saturation → How full your resources are

2. Monitoring Architecture Patterns¶

Pattern 1: Agent-Based Push (Nagios, Zabbix)
┌────────┐     push      ┌──────────────┐
│ Agent   │─────────────> │ Central      │
│ (NRPE/  │  check result │ Server       │
│  Zabbix │               │ (Nagios/     │
│  agent) │               │  Zabbix)     │
└────────┘               └──────────────┘

Pattern 2: Pull/Scrape (Prometheus)
┌────────┐     scrape     ┌──────────────┐
│ Exporter│<──────────────│ Prometheus   │
│ (exposes│  HTTP GET      │ Server       │
│  /metrics)│             │ (pulls       │
└────────┘               │  metrics)    │
                          └──────────────┘

Pattern 3: SNMP Polling (Network devices)
┌────────┐     SNMP GET   ┌──────────────┐
│ Network │<──────────────│ Monitoring   │
│ Device  │  OID values    │ Server       │
│ (switch/│               │              │
│  router)│               └──────────────┘
└────────┘

3. SNMP (Simple Network Management Protocol)¶

Name origin: Despite "Simple" in the name, SNMP is notoriously complex. The joke in the industry is that SNMP stands for "Security Not My Problem" (v1/v2c transmit community strings in cleartext) or "Still Not Managing Properly." It was defined in RFC 1157 (1990) and the "Simple" referred to its design being simpler than the competing OSI network management protocol CMIP.

SNMP is the standard for monitoring network devices. Every managed switch, router, firewall, and UPS speaks SNMP.

SNMP Components:
  Manager  → The monitoring server that queries devices
  Agent    → Software on the device that responds to queries
  MIB      → Management Information Base (defines what OIDs mean)
  OID      → Object Identifier (address of a specific metric)
  Community String → Password for SNMP v1/v2c (cleartext!)

Common OIDs:
  .1.3.6.1.2.1.1.1.0        → sysDescr (device description)
  .1.3.6.1.2.1.1.3.0        → sysUpTime
  .1.3.6.1.2.1.2.2.1.10     → ifInOctets (interface input bytes)
  .1.3.6.1.2.1.2.2.1.16     → ifOutOctets (interface output bytes)
  .1.3.6.1.2.1.25.3.3.1.2   → hrProcessorLoad (CPU usage)

# SNMP queries from command line
snmpwalk -v2c -c public switch01 .1.3.6.1.2.1.1
snmpget -v2c -c public switch01 .1.3.6.1.2.1.1.3.0
snmpwalk -v2c -c public switch01 ifDescr
snmpwalk -v2c -c public switch01 ifInOctets

SNMP v3 adds authentication and encryption. Use v3 in production — v2c community strings are transmitted in cleartext.

4. Nagios Architecture¶

Who made it: Nagios was created by Ethan Galstad in 1999 under the name "NetSaint." It was renamed to "Nagios" in 2002 due to a trademark dispute. The name is a recursive acronym: Nagios Ain't Gonna Insist On Sainthood -- a nod to its NetSaint origins.

Nagios is the grandfather of infrastructure monitoring. It has been around since 2002 and is still running in thousands of environments.

┌────────────────────────────────────────────────┐
│                 Nagios Server                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐ │
│  │ Scheduler │  │ Check    │  │ Notification │ │
│  │ (runs     │  │ Engine   │  │ Engine       │ │
│  │  checks   │  │ (active/ │  │ (email/      │ │
│  │  on       │  │  passive)│  │  pager/      │ │
│  │  schedule)│  │          │  │  webhook)    │ │
│  └──────────┘  └──────────┘  └──────────────┘ │
│                     │                           │
│         ┌──────────┼──────────┐                │
│         │          │          │                 │
│     ┌───▼──┐  ┌───▼──┐  ┌──▼───┐             │
│     │check │  │check │  │check │  (plugins)   │
│     │_disk │  │_http │  │_load │              │
│     └──────┘  └──────┘  └──────┘              │
└────────────────────────────────────────────────┘
         │            │
    Local check   Remote check via NRPE
                      │
              ┌───────▼──────┐
              │ Remote Host  │
              │ ┌──────────┐ │
              │ │ NRPE     │ │
              │ │ daemon   │ │
              │ └──────────┘ │
              └──────────────┘

Nagios check states:
  0 = OK       (green)
  1 = WARNING  (yellow)
  2 = CRITICAL (red)
  3 = UNKNOWN  (orange)

Check types:
  Active  → Nagios initiates the check on schedule
  Passive → External process sends results to Nagios

# Run a Nagios plugin manually
/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
# DISK OK - free space: / 45123 MB (78% inode=93%);| /=12456MB;41234;46388;0;51543

/usr/lib/nagios/plugins/check_http -H app.example.com -u /health -t 10
# HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.045 second response time

/usr/lib/nagios/plugins/check_load -w 4,3,2 -c 8,6,4
# OK - load average: 0.52, 0.48, 0.39|load1=0.520;4.000;8.000;0; ...

5. Zabbix Architecture¶

Zabbix adds a web interface, database backend, and more sophisticated data collection than Nagios.

┌──────────────────────────────────────────┐
│              Zabbix Server                │
│  ┌────────┐  ┌────────┐  ┌───────────┐  │
│  │ Poller │  │ Trapper│  │ Alerter   │  │
│  │        │  │        │  │           │  │
│  └────────┘  └────────┘  └───────────┘  │
│       │                                   │
│  ┌────▼────────────────────────────────┐ │
│  │           Database                   │ │
│  │    (PostgreSQL / MySQL / TimescaleDB)│ │
│  └─────────────────────────────────────┘ │
└──────────────────────────────────────────┘
        │              │
   Zabbix Agent    Zabbix Proxy
   (on each host)  (for remote sites)
        │              │
   ┌────▼────┐    ┌───▼────────────┐
   │ Target  │    │ Remote Site    │
   │ Host    │    │ ┌───────────┐  │
   └─────────┘    │ │ Zabbix    │  │
                  │ │ Agent     │  │
                  │ └───────────┘  │
                  └────────────────┘

Zabbix concepts:
  Host       → A device or server being monitored
  Item       → A specific metric being collected (CPU idle, disk free)
  Trigger    → A condition that evaluates to PROBLEM or OK
  Template   → A reusable set of items, triggers, and graphs
  Discovery  → Automatic detection of new items (network interfaces, filesystems)
  Proxy      → Collects data on behalf of the server (for remote sites)

6. Prometheus Pull Model¶

Prometheus scrapes HTTP endpoints on a schedule.
Each target exposes metrics at /metrics.

Example /metrics output:
  # HELP http_requests_total Total number of HTTP requests
  # TYPE http_requests_total counter
  http_requests_total{method="GET",handler="/health",status="200"} 145823
  http_requests_total{method="POST",handler="/api/v1/users",status="201"} 3421
  http_requests_total{method="GET",handler="/api/v1/users",status="500"} 17

  # HELP http_request_duration_seconds Request latency in seconds
  # TYPE http_request_duration_seconds histogram
  http_request_duration_seconds_bucket{handler="/health",le="0.01"} 140000
  http_request_duration_seconds_bucket{handler="/health",le="0.1"} 145000
  http_request_duration_seconds_bucket{handler="/health",le="1"} 145823
  http_request_duration_seconds_bucket{handler="/health",le="+Inf"} 145823
  http_request_duration_seconds_sum{handler="/health"} 872.45
  http_request_duration_seconds_count{handler="/health"} 145823

7. Metric Types¶

Type	Description	Example	Use
Counter	Only goes up (resets on restart)	`http_requests_total`	Request counts, errors, bytes sent
Gauge	Goes up and down	`node_memory_MemFree_bytes`	Temperature, queue depth, CPU usage
Histogram	Counts observations in buckets	`http_request_duration_seconds`	Latency percentiles (p50, p95, p99)
Summary	Pre-calculated quantiles	`go_gc_duration_seconds`	Client-side percentiles (less flexible)

Remember: Mnemonic for Prometheus metric types: "Counters Climb, Gauges Go-anywhere, Histograms Have-buckets." Counters only go up (use rate() to make them useful). Gauges go up and down (use directly). Histograms count observations into buckets (use histogram_quantile() for percentiles).

Counter: Use rate() to get per-second values
  rate(http_requests_total[5m])        → requests per second over 5 min
  increase(http_requests_total[1h])    → total requests in the last hour

Gauge: Use directly or track changes
  node_memory_MemFree_bytes            → current free memory
  delta(node_memory_MemFree_bytes[1h]) → memory change in last hour

Histogram: Use histogram_quantile()
  histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
  → 95th percentile latency over 5 minutes

8. Alert Design¶

Good alerts:
  ✓ Actionable      → Someone needs to DO something
  ✓ Timely          → Fires early enough to prevent impact
  ✓ Relevant        → Affects users or SLO
  ✓ Prioritized     → Critical vs warning vs info

Bad alerts:
  ✗ Noisy           → Fires constantly, gets ignored
  ✗ Unactionable    → "CPU at 60%" — so what?
  ✗ Stale           → Threshold from 2018 on 2026 hardware
  ✗ Redundant       → 5 alerts for the same incident

# Alert hierarchy example:
# Symptom-based (preferred):
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 5% on {{ $labels.instance }}"

# Cause-based (secondary):
- alert: DatabaseConnectionPoolExhausted
  expr: db_connections_active / db_connections_max > 0.95
  for: 5m
  labels:
    severity: warning

9. The Monitoring Stack Decision Tree¶

What are you monitoring?

├── Network devices (switches, routers, firewalls)
│   └── SNMP → Zabbix, LibreNMS, or Prometheus + snmp_exporter
│
├── Bare metal servers (no containers)
│   ├── Small fleet (<50 servers) → Zabbix or Nagios
│   └── Large fleet (>50 servers) → Prometheus + node_exporter
│
├── Kubernetes workloads
│   └── Prometheus (native integration) + Grafana
│
├── Cloud services (AWS, GCP, Azure)
│   └── CloudWatch/Stackdriver + Prometheus (via cloudwatch_exporter)
│
├── Application metrics
│   ├── Pull-based → Prometheus client library in your app
│   └── Push-based → StatsD → Prometheus via statsd_exporter
│
└── Logs (not metrics)
    └── ELK/EFK, Loki, or CloudWatch Logs (different layer)

10. Check Intervals and Thresholds¶

Check interval guidelines:
  5-15s   → Critical real-time services (payment processing)
  30-60s  → Standard production services
  5m      → Batch jobs, capacity metrics
  15-60m  → Compliance checks, certificate expiry

Threshold guidelines:
  Warning  → "We should look at this soon" (next business day)
  Critical → "Someone needs to act NOW" (page on-call)

  Disk:    Warning at 80%, Critical at 90%
  CPU:     Warning at 80% sustained 15m, Critical at 95% sustained 5m
  Memory:  Warning at 85%, Critical at 95%
  Latency: Warning at p95 > 500ms, Critical at p95 > 2s
  Errors:  Warning at 1% error rate, Critical at 5%

Common Pitfalls¶

Monitoring the server but not the service — CPU is at 5% and memory is fine, but the application is returning 500 errors. Always monitor at the application layer, not just the infrastructure layer.
Alert on every metric — You alert on CPU, memory, disk, swap, load, network, inode usage, and 40 other metrics. The on-call engineer gets 200 alerts for a single incident. Focus on symptoms (error rate, latency) over causes (CPU, memory).
No baseline — You set a threshold of 80% CPU but this server normally runs at 75%. The alert fires constantly. Establish baselines before setting thresholds.
Monitoring only in production — Your staging environment has no monitoring. You deploy a change that causes a memory leak. You find out in production.
Community strings as security — SNMP v2c community strings are transmitted in cleartext. "public" and "private" are the defaults. If you use SNMP, use v3 with authentication.
Check interval too aggressive — Checking every 5 seconds across 1,000 hosts generates 200 checks/second. Your monitoring server becomes the bottleneck. Match interval to criticality.
No monitoring of the monitoring system — Your Nagios server crashes and nobody notices because Nagios was the thing that would have alerted you. Use an external check (Uptime Robot, Pingdom, or a separate lightweight monitor) to watch your monitoring.

Interview tip: The "Four Golden Signals" (Latency, Traffic, Errors, Saturation) come from Google's SRE book (2016). If an interviewer asks "what would you monitor for service X?" start with these four. They cover 90% of production issues and show you think in symptoms, not causes.

Prerequisites¶

Linux Ops (Topic Pack, L0)

Next Steps¶

Monitoring Migration (Legacy to Modern) (Topic Pack, L2)

Monitoring Migration (Legacy to Modern) (Topic Pack, L2) — Grafana, Monitoring Fundamentals, Prometheus
Lab: Prometheus Target Down (CLI) (Lab, L2) — Grafana, Prometheus
Observability Architecture (Reference, L2) — Grafana, Prometheus
Observability Deep Dive (Topic Pack, L2) — Grafana, Prometheus
Skillcheck: Observability (Assessment, L2) — Grafana, Prometheus
Track: Observability (Reference, L2) — Grafana, Prometheus
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
Alerting Rules (Topic Pack, L2) — Prometheus
Alerting Rules Drills (Drill, L2) — Prometheus
Capacity Planning (Topic Pack, L2) — Prometheus

Monitoring Fundamentals - Primer¶

Why This Matters¶

Core Concepts¶

1. What Monitoring Does¶

2. Monitoring Architecture Patterns¶

3. SNMP (Simple Network Management Protocol)¶

4. Nagios Architecture¶

5. Zabbix Architecture¶

6. Prometheus Pull Model¶

7. Metric Types¶

8. Alert Design¶

9. The Monitoring Stack Decision Tree¶

10. Check Intervals and Thresholds¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Pages that link here¶

Monitoring Fundamentals - Primer¶

Why This Matters¶

Core Concepts¶

1. What Monitoring Does¶

2. Monitoring Architecture Patterns¶

3. SNMP (Simple Network Management Protocol)¶

4. Nagios Architecture¶

5. Zabbix Architecture¶

6. Prometheus Pull Model¶

7. Metric Types¶

8. Alert Design¶

9. The Monitoring Stack Decision Tree¶

10. Check Intervals and Thresholds¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Related Content¶

Pages that link here¶