Skip to content

How We Got Here: Monitoring Evolution

Arc: Observability Eras covered: 6 Timeline: ~2005-2025 Read time: ~12 min


The Original Problem

In 2005, "monitoring" meant checking if a server was alive. You pinged it. If the ping failed, someone got an email. If the disk filled up, maybe someone noticed before the application crashed — maybe not. The gap between "something is wrong" and "here's what's wrong and why" was filled by SSH'ing into servers and reading log files. Monitoring was reactive, binary (up/down), and focused on infrastructure, not applications.

When an application was slow, you found out from angry users, not from your monitoring system. The idea that you could proactively detect performance degradation, trace it to a specific service, and fix it before users noticed was science fiction.


Era 1: Nagios and Host-Based Monitoring (~2005-2010)

The Solution

Nagios (1999, but mainstream enterprise use ~2005) was the king of infrastructure monitoring. It checked hosts and services via plugins — shell scripts that returned exit codes (0=OK, 1=WARNING, 2=CRITICAL). It had a web UI showing green/red status, email/pager notifications, and an ecosystem of thousands of community plugins.

What It Looked Like

# /etc/nagios/conf.d/webserver.cfg
define host {
    host_name           web01
    alias               Web Server 01
    address             192.168.1.10
    check_command       check-host-alive
    max_check_attempts  3
    notification_period 24x7
    contacts            oncall-team
}

define service {
    host_name           web01
    service_description HTTP
    check_command       check_http!-H web01.example.com -u /healthz -w 2 -c 5
    check_interval      5
    retry_interval      1
    max_check_attempts  3
    notification_period 24x7
}

define service {
    host_name           web01
    service_description Disk Space
    check_command       check_nrpe!check_disk
    check_interval      15
}

Why It Was Better

  • Automated health checking — no more manual SSH sessions to check
  • Notification system with escalation and acknowledgment
  • Extensible via simple shell-script plugins
  • The "green wall" dashboard became the ops team's heartbeat
  • Configuration-as-code (even if it was painful config file syntax)

Why It Wasn't Enough

  • Pull-based: Nagios server polled every host, which didn't scale past ~5000 hosts
  • No historical data: Nagios told you the current state, not trends
  • Host-centric: no understanding of services, dependencies, or applications
  • Configuration was verbose and error-prone (define every host individually)
  • Notification fatigue: thousands of checks = thousands of alerts during an outage

Legacy You'll Still See

Nagios is still running in enterprises everywhere. Icinga (a Nagios fork) and Naemon modernize the model. The mental model of "check → threshold → alert" is foundational. Many monitoring systems still use Nagios plugins as their check mechanism. If you see check_nrpe on a server, you're in this era.


Era 2: Zabbix and Comprehensive Infrastructure Monitoring (~2008-2013)

The Solution

Zabbix (2004, widespread ~2008) addressed Nagios's limitations with built-in data storage, graphing, auto-discovery, and a powerful template system. It could monitor via agents, SNMP, IPMI, and agentless checks. It stored historical metric data and provided trend analysis, capacity planning, and customizable dashboards.

What It Looked Like

# Zabbix template for Linux servers
# Auto-discovers filesystems, network interfaces, processes
# Built-in items:
#   - system.cpu.util[,idle]     → CPU idle %
#   - vm.memory.size[available]  → Available RAM
#   - vfs.fs.size[/,pfree]       → Free disk %
#   - net.if.in[eth0]            → Network bytes in

# Trigger: alert when disk is 90% full
{Template OS Linux:vfs.fs.size[/,pfree].last()}<10

# Trigger: alert when CPU is high for 5 minutes
{Template OS Linux:system.cpu.util[,idle].avg(5m)}<20

Why It Was Better

  • Historical data storage with configurable retention
  • Auto-discovery: automatically find and monitor new hosts, interfaces, disks
  • Templates: define monitoring once, apply to thousands of hosts
  • Built-in graphing and dashboards
  • SNMP support for network devices (routers, switches, firewalls)

Why It Wasn't Enough

  • Relational database backend (PostgreSQL/MySQL) struggled at large scale
  • Still infrastructure-focused — CPU, RAM, disk, not application metrics
  • Dashboard customization was limited compared to later tools
  • Configuration complexity grew with the number of templates and overrides
  • No concept of application-level tracing or distributed systems

Legacy You'll Still See

Zabbix is widely used in enterprises, especially for network and infrastructure monitoring. It excels at SNMP monitoring and is often the tool of choice for network operations teams. If you join a company with significant on-prem infrastructure, Zabbix is likely in the stack.


Era 3: Graphite and the Metrics Pipeline (~2011-2016)

The Solution

Graphite (2008, widespread ~2011) separated the concerns: collect metrics (StatsD, collectd, Diamond), transport them (Carbon), store them (Whisper), and query/graph them (Graphite-Web). This pipeline model enabled teams to instrument their own applications — not just monitor infrastructure. Custom metrics became possible: requests per second, cart conversions, API latency percentiles.

What It Looked Like

# Application instrumentation with StatsD
import statsd

c = statsd.StatsClient('statsd.example.com', 8125)

# Count events
c.incr('api.requests.total')
c.incr('api.requests.status.200')

# Time operations
with c.timer('api.response_time'):
    result = handle_request()

# Gauge current values
c.gauge('queue.depth', get_queue_depth())
# Graphite query — render a graph
/render?target=stats.api.requests.total
        &target=stats.api.requests.status.500
        &from=-24h
        &format=json

Why It Was Better

  • Application-level metrics: teams could instrument anything
  • StatsD was dead simple: UDP fire-and-forget, five lines of code
  • Pipeline architecture: each component could be scaled independently
  • Community-driven dashboards (Grafana started as a Graphite frontend)
  • Time-series storage optimized for metrics (not a relational database)

Why It Wasn't Enough

  • Whisper storage was disk-intensive and inflexible (fixed retention periods)
  • Single-node Carbon was a bottleneck (Carbon-Relay helped but added complexity)
  • No labels/tags: metric names were hierarchical strings (servers.web01.cpu.idle)
  • No built-in alerting (required Cabot, Seyren, or Graphite's own primitive alerts)
  • Scaling Graphite horizontally was an engineering project

Legacy You'll Still See

Graphite's influence is enormous. StatsD is still widely used. The metric pipeline pattern (collect → transport → store → query) is the standard architecture. Grafana, which now supports dozens of backends, started as a Graphite UI. The hierarchical metric naming convention persists in many older codebases.


Era 4: Prometheus and Cloud-Native Monitoring (~2015-2022)

The Solution

Prometheus (built at SoundCloud, open-sourced 2012, CNCF project 2016) was purpose-built for the cloud-native world. Pull-based scraping over HTTP. Labels for multi-dimensional data. A powerful query language (PromQL). Built-in alerting (Alertmanager). Native Kubernetes integration via service discovery. It became the monitoring standard for Kubernetes.

What It Looked Like

# prometheus.yml — scrape configuration
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
# Application instrumentation with Prometheus client
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total requests',
                        ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Request latency',
                            ['method', 'endpoint'])

@app.route('/api/users')
def get_users():
    with REQUEST_LATENCY.labels(method='GET', endpoint='/api/users').time():
        result = fetch_users()
    REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc()
    return result
# PromQL — 99th percentile latency
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{job="myapp"}[5m])
)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Why It Was Better

  • Label-based: multi-dimensional queries without metric name explosion
  • Pull-based with service discovery: automatically finds new targets in K8s
  • PromQL: powerful, expressive query language
  • Built-in alerting with Alertmanager (routing, silencing, inhibition)
  • Kubernetes-native: the standard monitoring stack for K8s
  • Grafana + Prometheus became the de facto dashboard pair

Why It Wasn't Enough

  • Single-node storage: Prometheus doesn't scale horizontally natively
  • Retention limited by disk (15 days default, more requires careful sizing)
  • Pull-based model doesn't work well for short-lived jobs (Pushgateway is a workaround)
  • PromQL learning curve is steep
  • Long-term storage required Thanos or Cortex (significant additional complexity)

Legacy You'll Still See

Prometheus is the current standard for Kubernetes monitoring. The Prometheus + Grafana + Alertmanager stack is ubiquitous. PromQL is a required skill for DevOps/SRE roles. If you're working with Kubernetes, you're working with Prometheus.


Era 5: OpenTelemetry and Unified Observability (~2019-2024)

The Solution

OpenTelemetry (2019, merger of OpenTracing and OpenCensus) created a vendor-neutral standard for telemetry: metrics, traces, and logs through a single SDK and collector. Instead of instrumenting once for Prometheus, once for Jaeger, and once for your logging system, you instrument once with OpenTelemetry and send data wherever you want.

What It Looked Like

# OpenTelemetry auto-instrumentation — zero-code-change metrics and traces
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# opentelemetry-bootstrap -a install

# Or manual instrumentation
from opentelemetry import trace, metrics

tracer = trace.get_tracer("myapp")
meter = metrics.get_meter("myapp")

request_counter = meter.create_counter("http.server.request.count")
request_duration = meter.create_histogram("http.server.request.duration")

@app.route('/api/users')
def get_users():
    with tracer.start_as_current_span("get_users") as span:
        span.set_attribute("user.count", len(users))
        request_counter.add(1, {"method": "GET", "route": "/api/users"})
        # ... handle request
# OpenTelemetry Collector config
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  otlp/jaeger:
    endpoint: jaeger:4317
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Why It Was Better

  • Vendor-neutral: instrument once, export to any backend
  • Unified: metrics, traces, and logs through one SDK
  • Auto-instrumentation: get traces for HTTP, gRPC, database calls without code changes
  • Collector as a pipeline: filter, transform, route telemetry data
  • Industry-wide adoption (CNCF graduated project, supported by all major vendors)

Why It Wasn't Enough

  • Still maturing: some language SDKs are more stable than others
  • Logs support lagged behind metrics and traces
  • The collector adds operational complexity
  • Migration from existing instrumentation (Prometheus client, StatsD) is gradual
  • The promise of "correlate metrics, traces, and logs" is partially realized

Legacy You'll Still See

OpenTelemetry is the current direction. Major vendors (Datadog, New Relic, Grafana) support OTLP natively. Most new instrumentation should use OpenTelemetry. The migration from legacy instrumentation is ongoing.


Era 6: eBPF and Zero-Instrumentation Observability (~2022-2025)

The Solution

eBPF (extended Berkeley Packet Filter) allows programs to run in the Linux kernel without modifying application code or even restarting processes. Tools like Cilium, Pixie (acquired by New Relic), Grafana Beyla, and Parca extract metrics, traces, and profiles from running applications with zero code changes. The kernel observes the application, not the other way around.

What It Looked Like

# Grafana Beyla — auto-instrument any HTTP/gRPC service via eBPF
# No code changes, no SDK, no sidecar
BEYLA_OPEN_PORT=8080 \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318 \
beyla

# Pixie (now open-source) — observe all K8s traffic
# Install once, see HTTP requests, SQL queries, DNS calls
px deploy
px run px/http_data  # see all HTTP requests across the cluster

# Parca — continuous profiling via eBPF
# See CPU, memory allocation profiles for every process
# No sampling bias, no instrumentation needed

Why It Was Better

  • Zero instrumentation: no code changes, no SDK, no restart
  • Kernel-level visibility: see things the application can't report
  • Low overhead: eBPF programs are verified and JIT-compiled
  • Language-agnostic: works for Go, Java, Python, Rust, C, anything
  • Continuous profiling: always-on CPU/memory profiling without production impact

Why It Wasn't Enough

  • Linux-only (eBPF is a Linux kernel feature)
  • Kernel version requirements (5.x+ for most features)
  • Limited to what the kernel can see (no application-level business context)
  • eBPF programs are hard to write and debug
  • Cannot replace application-level instrumentation for business metrics
  • Security implications of running code in the kernel

Legacy You'll Still See

eBPF-based observability is the current frontier. Cilium is becoming the default CNI for Kubernetes. Grafana Beyla and Pixie are in early production use. This is the direction, but OpenTelemetry-based instrumentation will remain necessary for application-specific metrics and traces.


Where We Are Now

Prometheus is the standard for metrics. OpenTelemetry is the standard for instrumentation. Grafana is the standard for dashboards. The "three pillars" model (metrics, logs, traces) is being replaced by a unified observability model where signals are correlated. eBPF is adding a fourth signal (continuous profiling) without instrumentation overhead. Most organizations are somewhere between Prometheus-only and full OpenTelemetry adoption.

Where It's Going

AI-assisted root cause analysis — systems that automatically correlate anomalous metrics with relevant traces and logs, and suggest root causes — is the next major capability. The combination of eBPF (system-level signals) and OpenTelemetry (application-level signals) will provide unprecedented visibility. The challenge is making this actionable without drowning teams in data.

The Pattern

Every generation of monitoring reduces the time between "something is wrong" and "here's why." The evolution is from binary checks (up/down) to dimensional metrics to distributed traces to kernel-level observation. Each layer adds depth but also complexity — the winning approach is the one that surfaces the right signal at the right time.

Key Takeaway for Practitioners

Start with Prometheus and Grafana. Add OpenTelemetry when you need traces. Explore eBPF when you need to observe services you can't instrument. The most important monitoring decision isn't the tool — it's defining what "healthy" looks like for your service so you know when it isn't.

Cross-References