Ops Archaeology: The Session Store That Keeps Dying¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L1 Estimated time: 15 min Domains: Kubernetes, Redis, Resource Management
Artifact 1: CLI Output¶
$ kubectl get pods -n ecommerce -l app=session-cache
NAME READY STATUS RESTARTS AGE
session-cache-7f4b8d9c65-xk2mv 0/1 CrashLoopBackOff 43 (2m ago) 6h12m
$ kubectl describe pod session-cache-7f4b8d9c65-xk2mv -n ecommerce | tail -20
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Thu, 12 Sep 2024 14:38:02 +0000
Finished: Thu, 12 Sep 2024 14:38:47 +0000
Ready: False
Restart Count: 43
Limits:
cpu: 250m
memory: 64Mi
Requests:
cpu: 100m
memory: 64Mi
Artifact 2: Metrics¶
Prometheus query results from 30 minutes before the last crash:
# HELP redis_memory_used_bytes Total number of bytes allocated by Redis
# TYPE redis_memory_used_bytes gauge
redis_memory_used_bytes{instance="session-cache:6379",job="redis-exporter"} 58720256
# HELP redis_connected_clients Number of client connections
# TYPE redis_connected_clients gauge
redis_connected_clients{instance="session-cache:6379",job="redis-exporter"} 312
# HELP redis_keys_total Total number of keys
redis_keys_total{db="db0",instance="session-cache:6379"} 48293
# HELP redis_evicted_keys_total Total number of evicted keys
redis_evicted_keys_total{instance="session-cache:6379"} 0
Artifact 3: Infrastructure Code¶
# From: values-prod.yaml (Helm)
sessionCache:
image:
repository: redis
tag: "7.2-alpine"
persistence:
enabled: false
auth:
enabled: false
master:
configuration: |-
maxmemory-policy noeviction
save ""
appendonly no
Artifact 4: Log Lines¶
[2024-09-12T14:37:55Z] redis | 1:M 12 Sep 2024 14:37:55.812 * Running mode=standalone, port=6379.
[2024-09-12T14:38:41Z] redis | 1:M 12 Sep 2024 14:38:41.003 # Can't save in background: fork: Cannot allocate memory
[2024-09-12T14:38:18Z] webapp | time="2024-09-12T14:38:18Z" level=warning msg="session store slow" latency_ms=2340 endpoint="/api/checkout"
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?