Portal | Level: L2: Operations | Topics: OpenTelemetry, Tracing, Prometheus | Domain: Observability

OpenTelemetry - Primer¶

Why This Matters¶

You have logs in Elasticsearch, metrics in Prometheus, and traces in Jaeger. Three vendors, three agents, three config languages, three ways things break at 3 AM. OpenTelemetry (OTel) exists to end this fragmentation. It is the CNCF project that gives you a single, vendor-neutral standard for generating, collecting, and exporting telemetry data — traces, metrics, and logs — from your services.

If you touch infrastructure, OTel changes how you think about observability. Instead of bolting on monitoring after the fact, you instrument once and route signals anywhere. Switch backends without rewriting code. Correlate a slow API response to a database query to a container running hot — across services, languages, and teams.

This is not theoretical. OTel is the second-most-active CNCF project after Kubernetes. If you are not using it yet, you will be.

Timeline: OpenTelemetry was formed in 2019 by merging two competing projects: OpenTracing (tracing standard, 2016) and OpenCensus (Google's metrics+tracing library, 2018). The merge ended a confusing period where library authors had to choose between two incompatible instrumentation APIs. OTel traces and metrics reached GA (stable) in 2023; logs reached stability in 2024.

The Three Signals¶

OTel unifies three core telemetry signals under one umbrella:

┌─────────────────────────────────────────────────────┐
│                  YOUR APPLICATION                    │
│                                                      │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐         │
│   │  Traces  │  │ Metrics  │  │   Logs   │         │
│   │          │  │          │  │          │         │
│   │ Spans    │  │ Counters │  │ Structured│        │
│   │ Context  │  │ Gauges   │  │ Events   │         │
│   │ Timing   │  │ Histos   │  │ Severity │         │
│   └────┬─────┘  └────┬─────┘  └────┬─────┘         │
│        │              │              │               │
│        └──────────────┼──────────────┘               │
│                       │                              │
│              OTel SDK / API                          │
└───────────────────────┼──────────────────────────────┘
                        │
                        ▼
               OTel Collector

Traces¶

A trace is the full journey of a request through your system. It is composed of spans — each span represents a unit of work (an HTTP handler, a DB query, a cache lookup). Spans carry:

Trace ID: Shared across all spans in one request
Span ID: Unique to this span
Parent Span ID: Links child to parent
Attributes: Key-value metadata (http.method, db.system, etc.)
Events: Timestamped annotations within a span
Status: OK, Error, or Unset

Trace ID: abc123
├── Span: API Gateway (120ms)
│   ├── Span: Auth Service (15ms)
│   ├── Span: Order Service (95ms)
│   │   ├── Span: DB Query (40ms)
│   │   └── Span: Cache Lookup (3ms)
│   └── Span: Response Serialization (5ms)

Metrics¶

OTel defines three metric instruments:

Instrument	What It Measures	Example
Counter	Monotonically increasing	`http.server.request.count`
Gauge	Point-in-time value	`system.memory.usage`
Histogram	Distribution of values	`http.server.request.duration`

Metrics in OTel use a push model by default (unlike Prometheus pull), but the collector can bridge both worlds.

Logs¶

OTel logs are the newest signal and bridge existing log frameworks (log4j, slog, zerolog) into the OTel ecosystem. The key addition: logs gain trace context. A log line is no longer orphaned text — it links back to the span that produced it.

{
  "timestamp": "2026-03-15T14:30:00Z",
  "severity": "ERROR",
  "body": "connection refused to payments-db",
  "trace_id": "abc123",
  "span_id": "def456",
  "resource": {
    "service.name": "order-service",
    "service.version": "2.4.1"
  }
}

Collector Architecture¶

The OTel Collector is the central nervous system. It receives, processes, and exports telemetry. You can run it as an agent (sidecar/daemonset) or as a gateway (centralized).

┌───────────────────────────────────────────────────────┐
│                   OTel Collector                       │
│                                                        │
│  ┌────────────┐   ┌─────────────┐   ┌─────────────┐  │
│  │ Receivers  │──▶│ Processors  │──▶│  Exporters  │  │
│  │            │   │             │   │             │  │
│  │ - otlp     │   │ - batch     │   │ - otlp      │  │
│  │ - jaeger   │   │ - filter    │   │ - prometheus │  │
│  │ - prometheus│   │ - transform │   │ - jaeger    │  │
│  │ - filelog  │   │ - tail_samp │   │ - loki      │  │
│  │ - hostmetrc│   │ - memory_lim│   │ - debug     │  │
│  └────────────┘   └─────────────┘   └─────────────┘  │
│                                                        │
│  ┌──────────────────────────────────────────────────┐  │
│  │              Extensions                           │  │
│  │  - health_check  - pprof  - zpages  - bearerauth │  │
│  └──────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────┘

Receivers¶

Receivers ingest data. They listen on ports or scrape endpoints:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'node-exporter'
          scrape_interval: 15s
          static_configs:
            - targets: ['localhost:9100']
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:

Processors¶

Processors transform data in flight. Order matters — they execute sequentially:

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

Exporters¶

Exporters send data to backends. You can fan out to multiple:

exporters:
  otlp:
    endpoint: tempo.monitoring:4317
    tls:
      insecure: false
  prometheus:
    endpoint: 0.0.0.0:8889
  debug:
    verbosity: detailed

Pipelines — Wiring It Together¶

Pipelines connect receivers to processors to exporters, per signal:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp, debug]
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, filter, batch]
      exporters: [otlp]

SDK Instrumentation¶

The OTel SDK is what your application code uses to produce telemetry. There are two layers:

API: Stable interfaces. Safe to depend on in libraries.
SDK: The implementation. Configured in your application entrypoint.

Auto-Instrumentation¶

Most languages have auto-instrumentation that patches common libraries:

# Python — zero-code instrumentation
# pip install opentelemetry-distro opentelemetry-exporter-otlp
# opentelemetry-bootstrap -a install

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "order-service",
    "service.version": "2.4.1",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

Manual Instrumentation¶

For custom business logic:

tracer = trace.get_tracer("order-service")

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.total", total_amount)

    try:
        result = charge_payment(order_id)
        span.set_attribute("payment.status", "success")
    except PaymentError as e:
        span.set_status(StatusCode.ERROR, str(e))
        span.record_exception(e)
        raise

Semantic Conventions¶

Semantic conventions are standardized attribute names. They ensure that http.request.method means the same thing whether it comes from Go, Python, or Java.

Key convention namespaces:

Namespace	Example Attributes
`http.`	`http.request.method`, `http.response.status_code`
`db.`	`db.system`, `db.statement`, `db.operation`
`rpc.`	`rpc.system`, `rpc.method`, `rpc.service`
`messaging.`	`messaging.system`, `messaging.operation`
`server.`	`server.address`, `server.port`
`service.`	`service.name`, `service.version`, `service.namespace`
`deployment.`	`deployment.environment`
`container.`	`container.id`, `container.image.name`
`k8s.`	`k8s.pod.name`, `k8s.namespace.name`

Use them. If you invent httpMethod instead of http.request.method, every dashboard and alert that depends on the standard name breaks.

Gotcha: Semantic conventions change between OTel versions. The HTTP conventions underwent a major rename in 2023 (e.g., http.method became http.request.method, http.status_code became http.response.status_code). If you upgrade your SDK and your dashboards break, check the semantic convention migration guides. Pin your convention version in your instrumentation code.

Sampling¶

At scale, you cannot export every span. Sampling reduces volume while preserving signal.

Head Sampling¶

Decide at trace creation whether to sample:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Sample 10% of traces
sampler = TraceIdRatioBased(0.1)
provider = TracerProvider(sampler=sampler, resource=resource)

Pros: Simple, low overhead. Cons: You might drop the one interesting trace.

Tail Sampling (Collector-Side)¶

Decide after the trace completes, based on its content:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: error-traces
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

Tail sampling keeps all errors and slow requests, plus 5% of everything else. It requires the collector to buffer complete traces, which costs memory.

Remember: Mnemonic: "Head sampling is cheap but blind. Tail sampling is smart but hungry." Head sampling decides before seeing the trace, so it is fast but may drop interesting traces. Tail sampling decides after the trace completes, so it can keep errors and outliers, but it must buffer all spans in memory until the decision is made.

Under the hood: Tail sampling in the collector requires that all spans for a single trace arrive at the same collector instance. In a multi-instance gateway deployment, you need a load-balancing exporter that routes by trace ID. Without this, the collector sees incomplete traces and makes bad sampling decisions.

Deployment Models¶

Model 1: Agent (DaemonSet)          Model 2: Gateway
┌─────────┐  ┌─────────┐           ┌─────────┐  ┌─────────┐
│ App Pod  │  │ App Pod  │           │ App Pod  │  │ App Pod  │
│ ┌─────┐  │  │ ┌─────┐  │           └────┬────┘  └────┬────┘
│ │OTel │  │  │ │OTel │  │                │              │
│ │Agent│  │  │ │Agent│  │                └──────┬───────┘
│ └──┬──┘  │  │ └──┬──┘  │                       │
└────┼────┘  └────┼────┘                  ┌──────▼───────┐
     │              │                      │  OTel Gateway │
     └──────┬───────┘                      │  (Collector)  │
            │                              └──────┬───────┘
     ┌──────▼───────┐                             │
     │   Backend    │                      ┌──────▼───────┐
     └──────────────┘                      │   Backend    │
                                           └──────────────┘

Model 3: Agent + Gateway (recommended for production): - DaemonSet agents handle local collection and basic processing - Gateway handles tail sampling, enrichment, and fan-out to multiple backends - If the gateway goes down, agents can buffer briefly

Resource Detection¶

Resources describe the entity producing telemetry. OTel can auto-detect:

processors:
  resourcedetection:
    detectors: [env, system, docker, ec2, gcp, azure, k8s]
    timeout: 5s
    override: false

This automatically populates attributes like host.name, cloud.provider, k8s.pod.name without manual configuration.

Key Takeaways¶

OTel gives you one SDK, one collector, one wire format for all three signals
The collector is a pipeline: receivers -> processors -> exporters
Instrument with the API, configure with the SDK — libraries use API, apps configure SDK
Semantic conventions are not optional — they are what make cross-service correlation work
Tail sampling at the collector keeps errors and outliers while controlling volume
Start with auto-instrumentation, add manual spans for business logic
Resource attributes are the glue — they tell you where telemetry came from

Prerequisites¶

Observability Deep Dive (Topic Pack, L2)

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
Alerting Rules (Topic Pack, L2) — Prometheus
Alerting Rules Drills (Drill, L2) — Prometheus
Capacity Planning (Topic Pack, L2) — Prometheus
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Prometheus
Interview: Prometheus Target Down (Scenario, L2) — Prometheus
Lab: Prometheus Target Down (CLI) (Lab, L2) — Prometheus

OpenTelemetry - Primer¶

Why This Matters¶

The Three Signals¶

Traces¶

Metrics¶

Logs¶

Collector Architecture¶

Receivers¶

Processors¶

Exporters¶

Pipelines — Wiring It Together¶

SDK Instrumentation¶

Auto-Instrumentation¶

Manual Instrumentation¶

Semantic Conventions¶

Sampling¶

Head Sampling¶

Tail Sampling (Collector-Side)¶

Deployment Models¶

Resource Detection¶

Key Takeaways¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

OpenTelemetry - Primer¶

Why This Matters¶

The Three Signals¶

Traces¶

Metrics¶

Logs¶

Collector Architecture¶

Receivers¶

Processors¶

Exporters¶

Pipelines — Wiring It Together¶

SDK Instrumentation¶

Auto-Instrumentation¶

Manual Instrumentation¶

Semantic Conventions¶

Sampling¶

Head Sampling¶

Tail Sampling (Collector-Side)¶

Deployment Models¶

Resource Detection¶

Key Takeaways¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶