Skip to content

Ops Archaeology: The Session Store That Keeps Dying

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L1 Estimated time: 15 min Domains: Kubernetes, Redis, Resource Management


Artifact 1: CLI Output

$ kubectl get pods -n ecommerce -l app=session-cache
NAME                             READY   STATUS             RESTARTS      AGE
session-cache-7f4b8d9c65-xk2mv  0/1     CrashLoopBackOff   43 (2m ago)   6h12m

$ kubectl describe pod session-cache-7f4b8d9c65-xk2mv -n ecommerce | tail -20
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 12 Sep 2024 14:38:02 +0000
      Finished:     Thu, 12 Sep 2024 14:38:47 +0000
    Ready:          False
    Restart Count:  43
    Limits:
      cpu:     250m
      memory:  64Mi
    Requests:
      cpu:     100m
      memory:  64Mi

Artifact 2: Metrics

Prometheus query results from 30 minutes before the last crash:

# HELP redis_memory_used_bytes Total number of bytes allocated by Redis
# TYPE redis_memory_used_bytes gauge
redis_memory_used_bytes{instance="session-cache:6379",job="redis-exporter"} 58720256

# HELP redis_connected_clients Number of client connections
# TYPE redis_connected_clients gauge
redis_connected_clients{instance="session-cache:6379",job="redis-exporter"} 312

# HELP redis_keys_total Total number of keys
redis_keys_total{db="db0",instance="session-cache:6379"} 48293

# HELP redis_evicted_keys_total Total number of evicted keys
redis_evicted_keys_total{instance="session-cache:6379"} 0

Artifact 3: Infrastructure Code

# From: values-prod.yaml (Helm)
sessionCache:
  image:
    repository: redis
    tag: "7.2-alpine"
  persistence:
    enabled: false
  auth:
    enabled: false
  master:
    configuration: |-
      maxmemory-policy noeviction
      save ""
      appendonly no

Artifact 4: Log Lines

[2024-09-12T14:37:55Z] redis  | 1:M 12 Sep 2024 14:37:55.812 * Running mode=standalone, port=6379.
[2024-09-12T14:38:41Z] redis  | 1:M 12 Sep 2024 14:38:41.003 # Can't save in background: fork: Cannot allocate memory
[2024-09-12T14:38:18Z] webapp | time="2024-09-12T14:38:18Z" level=warning msg="session store slow" latency_ms=2340 endpoint="/api/checkout"

Your Mission

  1. Reconstruct: What does this system do? What are its components and purpose?
  2. Diagnose: What is currently broken or degraded, and why?
  3. Propose: What would you do to fix it? What would you check first?