Ops Archaeology: The Deploy That Didn't Deploy¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L2 Estimated time: 25 min Domains: Kubernetes, ArgoCD, Container Registry, CI/CD
Artifact 1: CLI Output¶
$ kubectl get deploy notification-service -n comms -o wide
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
notification-service 3/3 3 3 45d notifier registry.corp.io/notification-service:latest app=notification-service
$ kubectl rollout history deployment notification-service -n comms
REVISION CHANGE-CAUSE
14 <none>
15 <none>
16 kubectl apply (2024-11-20T09:15:03Z)
$ argocd app get comms-notification --output json | jq '{sync: .status.sync.status, health: .status.health.status, revision: .status.sync.revision}'
{
"sync": "Synced",
"health": "Healthy",
"revision": "a4f8c2d1"
}
$ kubectl get pods -n comms -l app=notification-service -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].imageID}{"\n"}{end}'
notification-service-6b7f9a8d43-h2k4l registry.corp.io/notification-service@sha256:3e7a9f2b...c841
notification-service-6b7f9a8d43-m5n8p registry.corp.io/notification-service@sha256:3e7a9f2b...c841
notification-service-6b7f9a8d43-q1r6s registry.corp.io/notification-service@sha256:3e7a9f2b...c841
Artifact 2: Metrics¶
# Application version metric (exposed by the app itself)
app_build_info{version="2.3.1",commit="b91e4c7a",build_date="2024-10-08T14:22:00Z"} 1
# Notification delivery metrics (last 24 hours)
notifications_sent_total{channel="email",status="success"} 14293
notifications_sent_total{channel="email",status="failed"} 847
notifications_sent_total{channel="sms",status="success"} 3891
notifications_sent_total{channel="sms",status="failed"} 2104
# Feature flag check (app queries feature flag service)
feature_flag_check_total{flag="sms_provider_v2",result="not_found"} 6012
Artifact 3: Infrastructure Code¶
# From: k8s/overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: comms
resources:
- ../../base
images:
- name: registry.corp.io/notification-service
newTag: latest
Artifact 4: Log Lines¶
[2024-11-20T10:02:11Z] notification-svc | WARN Feature flag 'sms_provider_v2' not found — falling back to legacy SMS gateway
[2024-11-20T09:15:08Z] argocd | app comms-notification synced to a4f8c2d1: Sync succeeded. Resources: 0 updated, 3 unchanged
[2024-11-19T16:45:33Z] ci-pipeline | [notification-service] Built and pushed v2.5.0 (commit e3b7d9f1) -> registry.corp.io/notification-service:latest sha256:9c4d8e1a...f273
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?