Ops Archaeology: The Slow Death Nobody Noticed¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L3 Estimated time: 40 min Domains: Logging, Disk I/O, Observability, Change Management
Artifact 1: CLI Output¶
$ kubectl get pods -n backend -l app=user-service
NAME READY STATUS RESTARTS AGE
user-service-5c8d7e9f34-b2g4h 1/1 Running 0 18d
user-service-5c8d7e9f34-f5j7k 1/1 Running 0 18d
user-service-5c8d7e9f34-m8n1p 1/1 Running 0 18d
$ kubectl top pods -n backend -l app=user-service
NAME CPU(cores) MEMORY(bytes)
user-service-5c8d7e9f34-b2g4h 890m 412Mi
user-service-5c8d7e9f34-f5j7k 920m 398Mi
user-service-5c8d7e9f34-m8n1p 860m 421Mi
$ ssh node-3 'df -h /var/log'
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 100G 94G 6.0G 94% /var/log
$ ssh node-3 'du -sh /var/log/pods/backend_user-service* 2>/dev/null | head -3'
28G /var/log/pods/backend_user-service-5c8d7e9f34-b2g4h_uid1/user-service/0.log
26G /var/log/pods/backend_user-service-5c8d7e9f34-f5j7k_uid2/user-service/0.log
31G /var/log/pods/backend_user-service-5c8d7e9f34-m8n1p_uid3/user-service/0.log
Artifact 2: Metrics¶
# Application latency (user-service) — last 7 days
http_request_duration_seconds{service="user-service",quantile="0.5",date="12-12"} 0.045
http_request_duration_seconds{service="user-service",quantile="0.5",date="12-13"} 0.048
http_request_duration_seconds{service="user-service",quantile="0.5",date="12-14"} 0.052
http_request_duration_seconds{service="user-service",quantile="0.5",date="12-15"} 0.068
http_request_duration_seconds{service="user-service",quantile="0.5",date="12-16"} 0.089
http_request_duration_seconds{service="user-service",quantile="0.5",date="12-17"} 0.142
http_request_duration_seconds{service="user-service",quantile="0.5",date="12-18"} 0.287
# Node disk I/O wait (node-3) — percentage
node_disk_io_time_weighted_seconds_total rate over 7 days:
Dec 12: 2.1%
Dec 13: 2.3%
Dec 14: 3.8%
Dec 15: 8.4%
Dec 16: 14.2%
Dec 17: 22.7%
Dec 18: 38.1%
# Log volume (all user-service pods combined)
container_log_entries_total{pod=~"user-service.*"} rate:
Dec 11: 12,400/min
Dec 12: 124,800/min
Dec 13: 128,100/min
Dec 14: 131,200/min
Dec 18: 142,300/min
Artifact 3: Infrastructure Code¶
# From: helm/user-service/values.yaml
# Last modified: 2024-12-11 (git blame shows this section changed)
logging:
level: DEBUG # Changed from INFO on Dec 11
format: json
output: stdout
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
# From: fluentbit/values.yaml
[OUTPUT]
Name es
Match kube.*
Host elasticsearch.logging
Port 9200
Logstash_Format On
Retry_Limit 5
Artifact 4: Log Lines¶
[2024-12-18T16:44:12Z] user-service | {"level":"debug","ts":"2024-12-18T16:44:12.847Z","msg":"SQL query executed","query":"SELECT id, email FROM users WHERE id = $1","params":["usr-88291"],"duration_ms":2,"rows":1}
[2024-12-18T16:44:12Z] user-service | {"level":"debug","ts":"2024-12-18T16:44:12.849Z","msg":"Cache lookup","key":"user:usr-88291","hit":true,"ttl_remaining_s":1423}
[2024-12-18T16:44:09Z] kubelet/node-3 | E1218 16:44:09.441891 eviction_manager.go:346] eviction manager: attempting to reclaim ephemeral-storage
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?