Portal | Level: L1: Foundations | Topics: Monitoring Fundamentals, Prometheus, Grafana | Domain: Observability
Monitoring Fundamentals - Primer¶
Why This Matters¶
You cannot operate what you cannot see. Monitoring is not a nice-to-have bolted onto infrastructure after the fact. It is the nervous system that tells you if your services are alive, healthy, degraded, or on fire. Without monitoring, you find out about outages from your customers. With good monitoring, you find out before your customers do.
This is the L1 foundation — what monitoring is, how the major systems work, what metrics mean, and how to think about alerting. Whether you end up running Nagios on bare metal or Prometheus in Kubernetes, these fundamentals apply everywhere. I started with Nagios on 1,500 bare-metal servers and eventually migrated to Prometheus. The tools changed; the principles did not.
Core Concepts¶
1. What Monitoring Does¶
Monitoring answers four questions:
1. Is the service UP? → Availability (health checks, uptime)
2. Is the service FAST? → Latency (response time, percentiles)
3. Is the service CORRECT? → Error rate (5xx, exceptions, failures)
4. Is the service SATURATED? → Utilization (CPU, memory, disk, connections)
These map to the Four Golden Signals (Google SRE):
Latency → How long requests take
Traffic → How many requests per second
Errors → How many requests fail
Saturation → How full your resources are
2. Monitoring Architecture Patterns¶
Pattern 1: Agent-Based Push (Nagios, Zabbix)
┌────────┐ push ┌──────────────┐
│ Agent │─────────────> │ Central │
│ (NRPE/ │ check result │ Server │
│ Zabbix │ │ (Nagios/ │
│ agent) │ │ Zabbix) │
└────────┘ └──────────────┘
Pattern 2: Pull/Scrape (Prometheus)
┌────────┐ scrape ┌──────────────┐
│ Exporter│<──────────────│ Prometheus │
│ (exposes│ HTTP GET │ Server │
│ /metrics)│ │ (pulls │
└────────┘ │ metrics) │
└──────────────┘
Pattern 3: SNMP Polling (Network devices)
┌────────┐ SNMP GET ┌──────────────┐
│ Network │<──────────────│ Monitoring │
│ Device │ OID values │ Server │
│ (switch/│ │ │
│ router)│ └──────────────┘
└────────┘
3. SNMP (Simple Network Management Protocol)¶
Name origin: Despite "Simple" in the name, SNMP is notoriously complex. The joke in the industry is that SNMP stands for "Security Not My Problem" (v1/v2c transmit community strings in cleartext) or "Still Not Managing Properly." It was defined in RFC 1157 (1990) and the "Simple" referred to its design being simpler than the competing OSI network management protocol CMIP.
SNMP is the standard for monitoring network devices. Every managed switch, router, firewall, and UPS speaks SNMP.
SNMP Components:
Manager → The monitoring server that queries devices
Agent → Software on the device that responds to queries
MIB → Management Information Base (defines what OIDs mean)
OID → Object Identifier (address of a specific metric)
Community String → Password for SNMP v1/v2c (cleartext!)
Common OIDs:
.1.3.6.1.2.1.1.1.0 → sysDescr (device description)
.1.3.6.1.2.1.1.3.0 → sysUpTime
.1.3.6.1.2.1.2.2.1.10 → ifInOctets (interface input bytes)
.1.3.6.1.2.1.2.2.1.16 → ifOutOctets (interface output bytes)
.1.3.6.1.2.1.25.3.3.1.2 → hrProcessorLoad (CPU usage)
# SNMP queries from command line
snmpwalk -v2c -c public switch01 .1.3.6.1.2.1.1
snmpget -v2c -c public switch01 .1.3.6.1.2.1.1.3.0
snmpwalk -v2c -c public switch01 ifDescr
snmpwalk -v2c -c public switch01 ifInOctets
SNMP v3 adds authentication and encryption. Use v3 in production — v2c community strings are transmitted in cleartext.
4. Nagios Architecture¶
Who made it: Nagios was created by Ethan Galstad in 1999 under the name "NetSaint." It was renamed to "Nagios" in 2002 due to a trademark dispute. The name is a recursive acronym: Nagios Ain't Gonna Insist On Sainthood -- a nod to its NetSaint origins.
Nagios is the grandfather of infrastructure monitoring. It has been around since 2002 and is still running in thousands of environments.
┌────────────────────────────────────────────────┐
│ Nagios Server │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Scheduler │ │ Check │ │ Notification │ │
│ │ (runs │ │ Engine │ │ Engine │ │
│ │ checks │ │ (active/ │ │ (email/ │ │
│ │ on │ │ passive)│ │ pager/ │ │
│ │ schedule)│ │ │ │ webhook) │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│ │ │
│ ┌──────────┼──────────┐ │
│ │ │ │ │
│ ┌───▼──┐ ┌───▼──┐ ┌──▼───┐ │
│ │check │ │check │ │check │ (plugins) │
│ │_disk │ │_http │ │_load │ │
│ └──────┘ └──────┘ └──────┘ │
└────────────────────────────────────────────────┘
│ │
Local check Remote check via NRPE
│
┌───────▼──────┐
│ Remote Host │
│ ┌──────────┐ │
│ │ NRPE │ │
│ │ daemon │ │
│ └──────────┘ │
└──────────────┘
Nagios check states:
0 = OK (green)
1 = WARNING (yellow)
2 = CRITICAL (red)
3 = UNKNOWN (orange)
Check types:
Active → Nagios initiates the check on schedule
Passive → External process sends results to Nagios
# Run a Nagios plugin manually
/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
# DISK OK - free space: / 45123 MB (78% inode=93%);| /=12456MB;41234;46388;0;51543
/usr/lib/nagios/plugins/check_http -H app.example.com -u /health -t 10
# HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.045 second response time
/usr/lib/nagios/plugins/check_load -w 4,3,2 -c 8,6,4
# OK - load average: 0.52, 0.48, 0.39|load1=0.520;4.000;8.000;0; ...
5. Zabbix Architecture¶
Zabbix adds a web interface, database backend, and more sophisticated data collection than Nagios.
┌──────────────────────────────────────────┐
│ Zabbix Server │
│ ┌────────┐ ┌────────┐ ┌───────────┐ │
│ │ Poller │ │ Trapper│ │ Alerter │ │
│ │ │ │ │ │ │ │
│ └────────┘ └────────┘ └───────────┘ │
│ │ │
│ ┌────▼────────────────────────────────┐ │
│ │ Database │ │
│ │ (PostgreSQL / MySQL / TimescaleDB)│ │
│ └─────────────────────────────────────┘ │
└──────────────────────────────────────────┘
│ │
Zabbix Agent Zabbix Proxy
(on each host) (for remote sites)
│ │
┌────▼────┐ ┌───▼────────────┐
│ Target │ │ Remote Site │
│ Host │ │ ┌───────────┐ │
└─────────┘ │ │ Zabbix │ │
│ │ Agent │ │
│ └───────────┘ │
└────────────────┘
Zabbix concepts:
Host → A device or server being monitored
Item → A specific metric being collected (CPU idle, disk free)
Trigger → A condition that evaluates to PROBLEM or OK
Template → A reusable set of items, triggers, and graphs
Discovery → Automatic detection of new items (network interfaces, filesystems)
Proxy → Collects data on behalf of the server (for remote sites)
6. Prometheus Pull Model¶
Prometheus scrapes HTTP endpoints on a schedule.
Each target exposes metrics at /metrics.
Example /metrics output:
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",handler="/health",status="200"} 145823
http_requests_total{method="POST",handler="/api/v1/users",status="201"} 3421
http_requests_total{method="GET",handler="/api/v1/users",status="500"} 17
# HELP http_request_duration_seconds Request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="/health",le="0.01"} 140000
http_request_duration_seconds_bucket{handler="/health",le="0.1"} 145000
http_request_duration_seconds_bucket{handler="/health",le="1"} 145823
http_request_duration_seconds_bucket{handler="/health",le="+Inf"} 145823
http_request_duration_seconds_sum{handler="/health"} 872.45
http_request_duration_seconds_count{handler="/health"} 145823
7. Metric Types¶
| Type | Description | Example | Use |
|---|---|---|---|
| Counter | Only goes up (resets on restart) | http_requests_total |
Request counts, errors, bytes sent |
| Gauge | Goes up and down | node_memory_MemFree_bytes |
Temperature, queue depth, CPU usage |
| Histogram | Counts observations in buckets | http_request_duration_seconds |
Latency percentiles (p50, p95, p99) |
| Summary | Pre-calculated quantiles | go_gc_duration_seconds |
Client-side percentiles (less flexible) |
Remember: Mnemonic for Prometheus metric types: "Counters Climb, Gauges Go-anywhere, Histograms Have-buckets." Counters only go up (use
rate()to make them useful). Gauges go up and down (use directly). Histograms count observations into buckets (usehistogram_quantile()for percentiles).
Counter: Use rate() to get per-second values
rate(http_requests_total[5m]) → requests per second over 5 min
increase(http_requests_total[1h]) → total requests in the last hour
Gauge: Use directly or track changes
node_memory_MemFree_bytes → current free memory
delta(node_memory_MemFree_bytes[1h]) → memory change in last hour
Histogram: Use histogram_quantile()
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
→ 95th percentile latency over 5 minutes
8. Alert Design¶
Good alerts:
✓ Actionable → Someone needs to DO something
✓ Timely → Fires early enough to prevent impact
✓ Relevant → Affects users or SLO
✓ Prioritized → Critical vs warning vs info
Bad alerts:
✗ Noisy → Fires constantly, gets ignored
✗ Unactionable → "CPU at 60%" — so what?
✗ Stale → Threshold from 2018 on 2026 hardware
✗ Redundant → 5 alerts for the same incident
# Alert hierarchy example:
# Symptom-based (preferred):
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% on {{ $labels.instance }}"
# Cause-based (secondary):
- alert: DatabaseConnectionPoolExhausted
expr: db_connections_active / db_connections_max > 0.95
for: 5m
labels:
severity: warning
9. The Monitoring Stack Decision Tree¶
What are you monitoring?
├── Network devices (switches, routers, firewalls)
│ └── SNMP → Zabbix, LibreNMS, or Prometheus + snmp_exporter
│
├── Bare metal servers (no containers)
│ ├── Small fleet (<50 servers) → Zabbix or Nagios
│ └── Large fleet (>50 servers) → Prometheus + node_exporter
│
├── Kubernetes workloads
│ └── Prometheus (native integration) + Grafana
│
├── Cloud services (AWS, GCP, Azure)
│ └── CloudWatch/Stackdriver + Prometheus (via cloudwatch_exporter)
│
├── Application metrics
│ ├── Pull-based → Prometheus client library in your app
│ └── Push-based → StatsD → Prometheus via statsd_exporter
│
└── Logs (not metrics)
└── ELK/EFK, Loki, or CloudWatch Logs (different layer)
10. Check Intervals and Thresholds¶
Check interval guidelines:
5-15s → Critical real-time services (payment processing)
30-60s → Standard production services
5m → Batch jobs, capacity metrics
15-60m → Compliance checks, certificate expiry
Threshold guidelines:
Warning → "We should look at this soon" (next business day)
Critical → "Someone needs to act NOW" (page on-call)
Disk: Warning at 80%, Critical at 90%
CPU: Warning at 80% sustained 15m, Critical at 95% sustained 5m
Memory: Warning at 85%, Critical at 95%
Latency: Warning at p95 > 500ms, Critical at p95 > 2s
Errors: Warning at 1% error rate, Critical at 5%
Common Pitfalls¶
- Monitoring the server but not the service — CPU is at 5% and memory is fine, but the application is returning 500 errors. Always monitor at the application layer, not just the infrastructure layer.
- Alert on every metric — You alert on CPU, memory, disk, swap, load, network, inode usage, and 40 other metrics. The on-call engineer gets 200 alerts for a single incident. Focus on symptoms (error rate, latency) over causes (CPU, memory).
- No baseline — You set a threshold of 80% CPU but this server normally runs at 75%. The alert fires constantly. Establish baselines before setting thresholds.
- Monitoring only in production — Your staging environment has no monitoring. You deploy a change that causes a memory leak. You find out in production.
- Community strings as security — SNMP v2c community strings are transmitted in cleartext. "public" and "private" are the defaults. If you use SNMP, use v3 with authentication.
- Check interval too aggressive — Checking every 5 seconds across 1,000 hosts generates 200 checks/second. Your monitoring server becomes the bottleneck. Match interval to criticality.
- No monitoring of the monitoring system — Your Nagios server crashes and nobody notices because Nagios was the thing that would have alerted you. Use an external check (Uptime Robot, Pingdom, or a separate lightweight monitor) to watch your monitoring.
Interview tip: The "Four Golden Signals" (Latency, Traffic, Errors, Saturation) come from Google's SRE book (2016). If an interviewer asks "what would you monitor for service X?" start with these four. They cover 90% of production issues and show you think in symptoms, not causes.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Next Steps¶
- Monitoring Migration (Legacy to Modern) (Topic Pack, L2)
Related Content¶
- Monitoring Migration (Legacy to Modern) (Topic Pack, L2) — Grafana, Monitoring Fundamentals, Prometheus
- Lab: Prometheus Target Down (CLI) (Lab, L2) — Grafana, Prometheus
- Observability Architecture (Reference, L2) — Grafana, Prometheus
- Observability Deep Dive (Topic Pack, L2) — Grafana, Prometheus
- Skillcheck: Observability (Assessment, L2) — Grafana, Prometheus
- Track: Observability (Reference, L2) — Grafana, Prometheus
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
- Alerting Rules Drills (Drill, L2) — Prometheus
- Capacity Planning (Topic Pack, L2) — Prometheus
Pages that link here¶
- Alerting Rules Drills
- Anti-Primer: Monitoring Fundamentals
- Capacity Planning
- Certification Prep: CKA — Certified Kubernetes Administrator
- Certification Prep: CKAD — Certified Kubernetes Application Developer
- Certification Prep: CKS — Certified Kubernetes Security Specialist
- Certification Prep: PCA — Prometheus Certified Associate
- Comparison: Alerting & Paging
- Comparison: Metrics Platforms
- Master Curriculum: 40 Weeks
- Monitoring Fundamentals
- Monitoring Migration (Legacy to Modern)
- Observability
- Observability Architecture
- Observability Skillcheck