Skip to content

Ops Archaeology: The Deploy That Didn't Deploy

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L2 Estimated time: 25 min Domains: Kubernetes, ArgoCD, Container Registry, CI/CD


Artifact 1: CLI Output

$ kubectl get deploy notification-service -n comms -o wide
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS     IMAGES                                            SELECTOR
notification-service   3/3     3            3           45d    notifier       registry.corp.io/notification-service:latest        app=notification-service

$ kubectl rollout history deployment notification-service -n comms
REVISION  CHANGE-CAUSE
14        <none>
15        <none>
16        kubectl apply (2024-11-20T09:15:03Z)

$ argocd app get comms-notification --output json | jq '{sync: .status.sync.status, health: .status.health.status, revision: .status.sync.revision}'
{
  "sync": "Synced",
  "health": "Healthy",
  "revision": "a4f8c2d1"
}

$ kubectl get pods -n comms -l app=notification-service -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].imageID}{"\n"}{end}'
notification-service-6b7f9a8d43-h2k4l   registry.corp.io/notification-service@sha256:3e7a9f2b...c841
notification-service-6b7f9a8d43-m5n8p   registry.corp.io/notification-service@sha256:3e7a9f2b...c841
notification-service-6b7f9a8d43-q1r6s   registry.corp.io/notification-service@sha256:3e7a9f2b...c841

Artifact 2: Metrics

# Application version metric (exposed by the app itself)
app_build_info{version="2.3.1",commit="b91e4c7a",build_date="2024-10-08T14:22:00Z"} 1

# Notification delivery metrics (last 24 hours)
notifications_sent_total{channel="email",status="success"} 14293
notifications_sent_total{channel="email",status="failed"} 847
notifications_sent_total{channel="sms",status="success"} 3891
notifications_sent_total{channel="sms",status="failed"} 2104

# Feature flag check (app queries feature flag service)
feature_flag_check_total{flag="sms_provider_v2",result="not_found"} 6012

Artifact 3: Infrastructure Code

# From: k8s/overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: comms
resources:
  - ../../base
images:
  - name: registry.corp.io/notification-service
    newTag: latest

Artifact 4: Log Lines

[2024-11-20T10:02:11Z] notification-svc | WARN  Feature flag 'sms_provider_v2' not found  falling back to legacy SMS gateway
[2024-11-20T09:15:08Z] argocd          | app comms-notification synced to a4f8c2d1: Sync succeeded. Resources: 0 updated, 3 unchanged
[2024-11-19T16:45:33Z] ci-pipeline     | [notification-service] Built and pushed v2.5.0 (commit e3b7d9f1) -> registry.corp.io/notification-service:latest sha256:9c4d8e1a...f273

Your Mission

  1. Reconstruct: What does this system do? What are its components and purpose?
  2. Diagnose: What is currently broken or degraded, and why?
  3. Propose: What would you do to fix it? What would you check first?