Skip to content

OpenTelemetry - Street-Level Ops

Quick Diagnosis Commands

# Check if the collector is running and healthy
curl -s http://localhost:13133/health | jq .

# View collector metrics (self-monitoring)
curl -s http://localhost:8888/metrics | grep otelcol

# Check how many spans are being received/exported
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans
curl -s http://localhost:8888/metrics | grep otelcol_exporter_sent_spans

# Check for dropped data
curl -s http://localhost:8888/metrics | grep -E "(dropped|refused|failed)"

# Verify OTLP endpoint is listening
ss -tlnp | grep -E '(4317|4318)'

# Test OTLP endpoint with grpcurl
grpcurl -plaintext localhost:4317 list

# Send a test span via HTTP
curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[]}'

# Check collector logs in Kubernetes
kubectl logs -n monitoring -l app=otel-collector --tail=100 -f

# View zpages for live pipeline debugging
# (requires zpages extension enabled)
curl -s http://localhost:55679/debug/tracez | head -50
curl -s http://localhost:55679/debug/pipelinez

Gotcha: Collector Reports Healthy But No Data Flows

The health check extension only confirms the process is running. It does not validate pipeline connectivity.

Diagnosis:

# Check exporter errors specifically
curl -s http://localhost:8888/metrics | grep otelcol_exporter_send_failed_spans

# Enable debug exporter temporarily
# Add to your config:
# exporters:
#   debug:
#     verbosity: detailed
# Then add 'debug' to your pipeline's exporters list

# Check if receivers are actually getting data
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans
# If this is 0, the problem is upstream (apps not sending)

The debug exporter is your best friend. It dumps telemetry to stdout. Use it to verify data is flowing through the pipeline before blaming the backend.


Pattern: Minimal Production Collector Config

Start here. Add complexity only when needed.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  batch:
    send_batch_size: 1024
    timeout: 5s

exporters:
  otlp:
    endpoint: your-backend:4317
    tls:
      insecure: false
  debug:
    verbosity: basic

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Key ordering: memory_limiter MUST come before batch in processors. If batch fills up and memory_limiter is after it, the collector OOMs before the limiter can act.


Gotcha: SDK Environment Variables Silently Ignored

OTel SDKs read from environment variables, but typos produce zero errors:

# Correct
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=order-service
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

# Wrong — and you will get no error, just no data
OTEL_EXPORTER_ENDPOINT=http://collector:4317        # Missing OTLP
OTEL_SERVICE=order-service                           # Missing _NAME
OTEL_RESOURCE_ATTRS=deployment.environment=production # Wrong suffix

Validate with the debug exporter. If your service name shows as "unknown_service", the env var is not being read.


Pattern: Tail Sampling Without Losing Your Mind

Tail sampling requires the collector to hold complete traces in memory. This is where people get burned.

processors:
  tail_sampling:
    decision_wait: 30s        # How long to wait for spans
    num_traces: 50000          # Max traces in memory
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep errors
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      # Always keep slow traces
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 3000
      # Sample 10% of the rest
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
      # Always keep specific operations
      - name: keep-payments
        type: string_attribute
        string_attribute:
          key: service.name
          values: [payment-service]

Critical constraint: tail sampling must run on a gateway collector, not on per-node agents. All spans for a single trace must hit the same collector instance. Use a load balancing exporter on agents:

# Agent config
exporters:
  loadbalancing:
    protocol:
      otlp:
        endpoint: gateway-collector:4317
    resolver:
      dns:
        hostname: gateway-collector-headless
        port: 4317

Gotcha: Prometheus Receiver vs OTLP Metrics

If you are migrating from Prometheus to OTel, be aware of metric naming:

Prometheus: http_requests_total (counter)
OTel:       http.server.request.duration (histogram with different semantics)

The Prometheus receiver scrapes Prometheus endpoints and converts to OTel format. But the names and types do not magically align with OTel semantic conventions. You will have metrics with both naming schemes in your backend.

Use the transform processor to rename if needed:

processors:
  transform:
    metric_statements:
      - context: metric
        statements:
          - set(name, "http.server.request.count") where name == "http_requests_total"

Pattern: Resource Detection in Kubernetes

Auto-populate pod, node, and namespace attributes:

processors:
  k8sattributes:
    auth_type: "serviceAccount"
    passthrough: false
    extract:
      metadata:
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.namespace.name
        - k8s.node.name
        - k8s.deployment.name
      labels:
        - tag_name: app.label.team
          key: team
          from: pod
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip

The collector's service account needs RBAC to read pods:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
  - apiGroups: [""]
    resources: ["pods", "namespaces", "nodes"]
    verbs: ["get", "list", "watch"]

Gotcha: Collector Contrib vs Core

There are two collector distributions:

  • Core (otelcol): Minimal set of components. Missing most receivers/exporters.
  • Contrib (otelcol-contrib): Batteries included. Has Loki, Kafka, AWS, GCP exporters and dozens of receivers.

If your config references loki exporter and you deployed core, the collector exits immediately with "unknown exporter." Check the image tag in your deployment.

# Check which distribution you're running
kubectl get deployment otel-collector -n monitoring -o jsonpath='{.spec.template.spec.containers[0].image}'

# Core:    otel/opentelemetry-collector:0.96.0
# Contrib: otel/opentelemetry-collector-contrib:0.96.0

Pattern: Graceful Collector Shutdown

When rolling the collector, in-flight data can be lost. Configure shutdown properly:

service:
  telemetry:
    logs:
      level: info
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

In your Kubernetes deployment:

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: collector
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 5"]

The batch processor flushes on shutdown, but only if it gets the termination signal and has time to drain. Set terminationGracePeriodSeconds higher than your batch timeout.


Gotcha: Context Propagation Breaks at Boundaries

Debug clue: If your trace shows service A calling service B, but service B appears as a separate root trace (no parent span), context propagation is broken at that boundary. Search the backend for traces with root_span=true and service.name=B -- these orphaned roots are the smoking gun.

Distributed tracing requires context propagation — the trace ID must cross service boundaries. This fails silently at:

  1. HTTP: Missing traceparent header. Check with:

    curl -v http://your-service/api | grep -i traceparent
    

  2. Message queues: Trace context must be injected into message headers. Most auto-instrumentation does NOT do this for Kafka/RabbitMQ.

  3. gRPC: Usually works with auto-instrumentation, but custom interceptors can strip metadata.

  4. Cross-language: Ensure all services use the same propagator. Default is W3C TraceContext. If one service uses B3 (Zipkin), traces break.

# Set propagator explicitly
OTEL_PROPAGATORS=tracecontext,baggage

Pattern: Filtering High-Volume Noise

Health checks, readiness probes, and internal metrics chatter can drown real signals:

processors:
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
        - 'attributes["http.target"] == "/readyz"'
        - 'attributes["http.target"] == "/metrics"'
    metrics:
      metric:
        - 'name == "up"'
    logs:
      log_record:
        - 'severity_number < 9'  # Drop DEBUG and below

Place the filter processor BEFORE the batch processor. No point batching data you are about to drop.


Pattern: Debugging Pipeline Issues with zpages

Enable zpages for live introspection without touching logs:

# Pipeline status
curl http://localhost:55679/debug/pipelinez

# Recent traces through the collector
curl http://localhost:55679/debug/tracez

# Exporter queue sizes
curl http://localhost:55679/debug/rpcz

zpages are for debugging only. Do not expose them outside the cluster.