Skip to content

Production Readiness Review: Answer Key

This document provides model answers at the Strong (3) level for all 50 questions in the assessment. Each answer includes:

  • The expected response from an engineer ready to go on-call
  • Common mistakes that indicate a lower score
  • Links to wiki content for deeper study

Use this after completing the self-assessment to calibrate your scores and identify study targets.


Section 1: Kubernetes Operations

Q1. Pod in CrashLoopBackOff — First 5 Commands

Model Answer (Strong):

# 1. Get pod status and restart count
kubectl get pods -n meridian-prod -l app=order-service -o wide
# Expected: shows one pod in CrashLoopBackOff with restart count, node placement

# 2. Describe the crashing pod for events and conditions
kubectl describe pod <pod-name> -n meridian-prod
# Expected: shows last state (terminated, exit code), events (Back-off restarting),
# resource limits, probe configuration, volume mounts

# 3. Check current logs (may be empty if crash is immediate)
kubectl logs <pod-name> -n meridian-prod --previous
# --previous is critical: shows logs from the LAST crashed container, not the current attempt

# 4. Check events in the namespace for broader context
kubectl get events -n meridian-prod --sort-by='.lastTimestamp' --field-selector involvedObject.name=<pod-name>
# Expected: reveals if this is OOMKilled, failed health probe, image pull issue, etc.

# 5. Check recent deployments/changes
kubectl rollout history deployment/order-service -n meridian-prod
helm history order-service -n meridian-prod
# Expected: reveals if a recent deploy triggered the crash

The key insight is that --previous on kubectl logs is essential because the current container may have crashed before writing any logs. You also need to distinguish between application crashes (exit code 1, check logs), OOM kills (exit code 137, check resource limits), and failed health probes (check probe config and timing).

Common mistakes: - Forgetting --previous flag and seeing empty logs - Not checking the exit code in describe output (137 = OOM, 1 = app error, 143 = SIGTERM) - Jumping to code-level debugging before checking if a recent deploy caused the issue - Not checking if only one pod is affected (suggests node-specific issue) vs all pods (suggests code/config issue)

Study: k8s-debugging-playbook/street_ops.md, Case: CrashLoopBackOff


Q2. HPA Not Scaling Despite High CPU

Model Answer (Strong):

The 4 most likely causes, in order:

  1. Metrics server not reporting: The HPA relies on the metrics-server (or Prometheus adapter for custom metrics) to get CPU values. Check:

    kubectl get hpa search-service -n meridian-prod -o yaml
    # Look at status.conditions — if ScalingActive is False, metrics are not available
    kubectl top pods -n meridian-prod -l app=search-service
    # If "metrics not available," metrics-server is broken
    kubectl get pods -n kube-system -l k8s-app=metrics-server
    

  2. Already at maxReplicas: The HPA has a ceiling. Check:

    kubectl get hpa search-service -n meridian-prod
    # Compare REPLICAS to MAXREPLICAS — if equal, it cannot scale further
    

  3. Resource quota or node capacity exhaustion: Even if HPA wants to scale, new pods cannot be scheduled:

    kubectl describe resourcequota -n meridian-prod
    kubectl get events -n meridian-prod --field-selector reason=FailedScheduling
    kubectl describe nodes | grep -A 5 "Allocated resources"
    

  4. CPU requests vs limits mismatch: HPA scales based on requests, not limits or actual usage. If requests are set very high relative to actual usage, the percentage the HPA sees may be low even though real CPU is high:

    kubectl get hpa search-service -n meridian-prod -o jsonpath='{.status.currentMetrics}'
    # Compare the HPA's view of utilization vs what you see in Grafana
    

Common mistakes: - Not understanding that HPA uses requests as the denominator, not limits - Forgetting to check maxReplicas - Not checking the Prometheus adapter (for custom metric http_requests_per_second) separately from metrics-server (for CPU) - Ignoring the ScalingActive and AbleToScale conditions in HPA status

Study: k8s-ops (HPA)/primer.md, k8s-ops (HPA)/street_ops.md


Q3. Node Drain Pre-Flight Checklist

Model Answer (Strong):

Before running kubectl drain app-worker-3:

  1. Check PDB compliance: The Order Service PDB requires minAvailable=2 of 3. If one pod is already unhealthy, draining this node would violate the PDB and the drain will hang.

    kubectl get pdb -n meridian-prod
    kubectl get pods -n meridian-prod -l app=order-service -o wide
    # Verify all 3 replicas are healthy and at least 1 is on another node
    

  2. Check what else is on the node:

    kubectl get pods --field-selector spec.nodeName=app-worker-3 --all-namespaces
    # Identify DaemonSets (will be ignored), stateful workloads, and pods without controllers
    

  3. Verify spare capacity exists on other nodes:

    kubectl describe nodes | grep -A 5 "Allocated resources"
    # Ensure evicted pods can be scheduled elsewhere
    

  4. Cordon first to prevent new scheduling:

    kubectl cordon app-worker-3
    # Then monitor for a few minutes to ensure existing pods remain healthy
    

  5. Drain with appropriate flags:

    kubectl drain app-worker-3 --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=300s
    # --ignore-daemonsets: DaemonSet pods cannot be evicted
    # --delete-emptydir-data: required if pods use emptyDir volumes
    # --grace-period: allow graceful shutdown (important for RabbitMQ consumers)
    # --timeout: fail if drain takes too long (PDB stuck)
    

  6. Monitor the drain: Watch for stuck evictions, especially the RabbitMQ consumer which may need time to finish processing in-flight messages.

  7. Notify the team: For production nodes, announce in the ops channel before draining.

Common mistakes: - Not checking PDB compliance before draining - Forgetting --ignore-daemonsets (drain fails immediately) - Not cordoning first (new pods may be scheduled to the node during drain) - Not considering the RabbitMQ consumer's graceful shutdown (messages could be redelivered) - Draining during peak hours without checking capacity headroom

Study: k8s-node-lifecycle/primer.md, Case: Drain Blocked by PDB


Q4. PVC Stuck in Pending

Model Answer (Strong):

4 causes in order of likelihood for this architecture (EKS with gp3 EBS volumes):

  1. StorageClass mismatch: The PVC requests a StorageClass that does not match the PV's class, or the StorageClass does not exist.

    kubectl get pvc <name> -n elastic-system -o yaml | grep storageClassName
    kubectl get pv <pv-name> -o yaml | grep storageClassName
    kubectl get storageclass
    

  2. Capacity mismatch: The PVC requests more storage than the PV offers, or the PV's accessModes do not match the PVC's request.

    kubectl get pv <pv-name> -o yaml | grep -A 2 capacity
    kubectl get pvc <name> -o yaml | grep -A 2 resources
    # Also check accessModes: PV has ReadWriteOnce but PVC requests ReadWriteMany (EBS cannot do RWX)
    

  3. Node affinity conflict: For EBS-backed PVs, the PV is bound to a specific AZ. If the pod's node affinity or topology constraints place it in a different AZ, the PVC cannot bind.

    kubectl get pv <pv-name> -o yaml | grep -A 5 nodeAffinity
    kubectl describe pod <es-pod> -n elastic-system | grep -A 3 "Node-Selectors"
    

  4. Label selector mismatch: If the PVC uses a selector with matchLabels, the PV must have those exact labels.

    kubectl get pvc <name> -o yaml | grep -A 5 selector
    kubectl get pv <pv-name> --show-labels
    

Common mistakes: - Assuming "Available" PV means it should bind (availability zone constraints still apply) - Not checking accessModes compatibility - Forgetting that EBS volumes are AZ-locked - Not running kubectl describe pvc which shows the specific binding failure reason in events

Study: k8s-storage/primer.md, Case: PV Stuck Terminating


Q5. Inter-Namespace Communication Failure After Calico Upgrade

Model Answer (Strong):

This is a systematic diagnostic — you need to separate Calico data plane issues from policy enforcement issues:

  1. Check Calico pod health:

    kubectl get pods -n calico-system
    # All calico-node pods should be Running/Ready
    calicoctl node status
    # Should show all nodes as "up" and peering established
    

  2. Test connectivity at the IP level (bypass NetworkPolicy):

    # Get a pod IP in the rabbitmq namespace
    kubectl get pods -n rabbitmq -o wide
    # Exec into a meridian-prod pod and test direct IP connectivity
    kubectl exec -it <app-pod> -n meridian-prod -- curl -v <rabbitmq-pod-ip>:5672
    

  3. Check if NetworkPolicy is the blocker:

    # Temporarily check Calico's policy evaluation
    calicoctl get networkpolicy -n rabbitmq -o yaml
    calicoctl get globalnetworkpolicy -o yaml
    # Look for policies that reference namespace selectors — the Calico upgrade
    # may have changed how namespace labels are evaluated
    

  4. Check Calico iptables rules on the node:

    # SSH to the node running the RabbitMQ pods
    iptables-save | grep -i "cali-" | grep DROP
    # Look for DROP rules targeting cross-namespace traffic
    

  5. Most likely root cause: The Calico upgrade changed the default iptablesBackend from Legacy to NFT (or vice versa), or the upgrade reset the FelixConfiguration which includes defaultEndpointToHostAction. Check:

    calicoctl get felixconfiguration default -o yaml
    

Common mistakes: - Assuming it is a NetworkPolicy issue without testing raw IP connectivity first - Not checking Calico's own health and peering status - Forgetting that Calico has GlobalNetworkPolicy in addition to namespace-scoped policies - Not reviewing the Calico upgrade changelog for breaking changes

Study: k8s-networking/primer.md, Case: CNI Broken After Restart


Q6. ArgoCD OutOfSync But Pods Look Fine

Model Answer (Strong):

3 specific causes:

  1. Metadata drift: ArgoCD compares the full manifest, not just the spec. Common culprits:
  2. Labels or annotations added by admission controllers, OPA, or other mutating webhooks
  3. kubectl.kubernetes.io/last-applied-configuration annotation mismatch
  4. Resource fields that Kubernetes adds by default (e.g., strategy.rollingUpdate defaults)

    argocd app diff auth-service --local /path/to/chart
    # Shows the exact fields that differ
    

  5. Helm value normalization: Helm templates may produce YAML that Kubernetes normalizes differently (e.g., cpu: "1" vs cpu: 1000m, or empty map {} vs absent field).

    argocd app get auth-service -o yaml | grep -A 10 "sync"
    # Check the diff details in the ArgoCD UI — it highlights the exact line
    

  6. Resource managed by ArgoCD but also modified externally: Someone ran kubectl edit or kubectl apply directly, adding fields that are not in the Git source.

    kubectl get deployment auth-service -n meridian-prod -o yaml > live.yaml
    helm template auth-service devops/helm/auth-service -f values-prod.yaml > expected.yaml
    diff live.yaml expected.yaml
    

To resolve: if the drift is benign metadata, add the specific field to ArgoCD's ignoreDifferences in the Application spec. If it is a real change, either commit it to Git or argocd app sync auth-service to force the Git version.

Common mistakes: - Clicking "Sync" without understanding what is different (could revert intentional changes) - Not using argocd app diff to see the exact drift - Confusing OutOfSync with Degraded (different conditions) - Not knowing about ignoreDifferences for expected drift

Study: argocd-gitops/primer.md, argocd-gitops/street_ops.md


Q7. Helm Rollback for Order Service

Model Answer (Strong):

# Check current and previous revisions
helm history order-service -n meridian-prod

# Roll back to the previous revision
helm rollback order-service <previous-revision> -n meridian-prod --wait --timeout=5m

Risks specific to the Order Service:

  1. Database migrations are forward-only. If the bad release included a DB migration (e.g., new column, changed index), rolling back the code means the new code schema expectations no longer match. You must check:

    # Check if a migration ran
    kubectl logs -n meridian-prod -l app=order-service --previous | grep -i migration
    
    If a migration ran, you need a forward migration to undo it, not a Helm rollback of the code.

  2. RabbitMQ message format: If the new version changed the message schema published to RabbitMQ, in-flight messages in the queue may be in the new format. The rolled-back version may fail to process them. Check the DLQ after rollback:

    # Check RabbitMQ management UI or Prometheus metrics
    rabbitmq_queue_messages{queue="orders.events.dlq"}
    

  3. Redis cached data: If the new version cached data in a different format, the rolled-back version may fail to deserialize it. Consider flushing relevant cache keys:

    kubectl exec -it <redis-pod> -- redis-cli KEYS "order:*" | head -20
    

  4. ArgoCD sync: After a Helm rollback, ArgoCD will see the live state as drifting from Git (which still has the new version). You need to either revert the Git commit too, or ArgoCD will try to re-deploy the bad version on next sync.

Verification:

kubectl rollout status deployment/order-service -n meridian-prod
kubectl get pods -n meridian-prod -l app=order-service
# Check error rate in Grafana: should drop back to baseline within 2-3 minutes

Common mistakes: - Rolling back Helm without considering database state - Forgetting to update the GitOps repo (ArgoCD will re-sync the bad version) - Not checking RabbitMQ DLQ for messages stuck in the new format - Not communicating to the team that a rollback happened

Study: helm/primer.md, helm/footguns.md, argocd-gitops/street_ops.md


Q8. ResourceQuota Blocking Deployment

Model Answer (Strong):

# 1. See which quotas exist and their utilization
kubectl describe resourcequota -n meridian-prod
# Output shows: Used / Hard for cpu, memory, pods, services, etc.

# 2. Check the specific error
kubectl get events -n meridian-prod --field-selector reason=FailedCreate --sort-by='.lastTimestamp'
# Shows: "exceeded quota: <quota-name>, requested: cpu=500m, used: 3500m, limited: 4000m"

# 3. See what is consuming the quota
kubectl top pods -n meridian-prod --sort-by=cpu
kubectl get pods -n meridian-prod -o custom-columns=NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory

Options to resolve without increasing the quota:

  1. Right-size existing deployments: Some services may have over-provisioned requests. Check actual usage vs requests in Grafana and reduce requests for services with significant headroom.

  2. Reduce replica counts for non-critical services: If Worker or Report Service can temporarily run with fewer replicas.

  3. Check for orphaned resources: Completed Jobs, failed pods not cleaned up, dangling ReplicaSets from previous deployments.

    kubectl get pods -n meridian-prod --field-selector status.phase=Failed
    kubectl get replicasets -n meridian-prod | grep "0         0"
    

  4. Use LimitRange defaults: If the new deployment does not specify requests, the LimitRange defaults may be larger than needed.

Common mistakes: - Not checking both requests and limits quotas (they are separate) - Forgetting that init containers also consume quota during startup - Not looking at sidecar containers (e.g., PgBouncer, Vault agent) which also count - Assuming the quota needs to increase rather than investigating utilization

Study: k8s-ops/primer.md, Case: Resource Quota Blocking Deploy


Q9. Inventory Service Cannot Reach RabbitMQ

Model Answer (Strong):

3 most likely causes:

  1. NetworkPolicy with incorrect namespace selector: The RabbitMQ namespace has a NetworkPolicy that allows ingress from meridian-prod, but the Inventory Service pods may have different labels than the other services that work.

    kubectl get networkpolicy -n rabbitmq -o yaml
    # Check the podSelector and namespaceSelector
    # Compare labels on Inventory Service pods vs Order Service pods (which works)
    kubectl get pods -n meridian-prod -l app=inventory-service --show-labels
    kubectl get pods -n meridian-prod -l app=order-service --show-labels
    

  2. DNS resolution failure for the RabbitMQ service: The Inventory Service may be using a different hostname or the service name resolution fails for this specific pod.

    kubectl exec -it <inventory-pod> -n meridian-prod -- nslookup rabbitmq.rabbitmq.svc.cluster.local
    kubectl exec -it <inventory-pod> -n meridian-prod -- nc -zv rabbitmq.rabbitmq.svc.cluster.local 5672
    

  3. RabbitMQ virtual host or user permissions: The Inventory Service may be using different credentials or targeting a different vhost than the working services.

    # Check the Inventory Service's RabbitMQ connection string (from Vault/Secret)
    kubectl get secret inventory-service-rabbitmq -n meridian-prod -o jsonpath='{.data.url}' | base64 -d
    # Compare with a working service's connection string
    kubectl get secret order-service-rabbitmq -n meridian-prod -o jsonpath='{.data.url}' | base64 -d
    

Common mistakes: - Assuming network connectivity without testing at the pod level - Not comparing the working services' config against the broken service's config - Forgetting that NetworkPolicies are additive — if there is no policy allowing the Inventory Service's specific labels, it is blocked - Not checking the RabbitMQ management interface for connection attempts

Study: k8s-networking/primer.md, Case: Service No Endpoints


Q10. Node NotReady Diagnosis

Model Answer (Strong):

Walk-through from Kubernetes level to OS level:

# 1. Kubernetes level: Check node conditions
kubectl describe node data-node-1
# Look at Conditions: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable, Ready
# Look at Events: check for "NodeNotReady" and "NodeStatusUnknown"
# The Reason field tells you: kubelet stopped posting status

# 2. Check if kubelet is running (if you can SSH to the node)
ssh data-node-1
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" --no-pager | tail -50
# Common kubelet failure reasons: certificate expiry, disk pressure, PLEG errors

# 3. Check system resources
free -h                    # Memory exhaustion?
df -h                      # Disk full? (/ and /var/lib/kubelet)
top -bn1 | head -20        # CPU/load issues?

# 4. Check container runtime
systemctl status containerd
crictl ps                  # Are containers still running?
crictl info                # Runtime health

# 5. Check networking
ip link show               # Are interfaces up?
ip route show              # Default route present?
ping -c 3 <api-server-ip>  # Can the node reach the API server?

# 6. Check system logs
dmesg | tail -50           # Kernel messages (OOM, hardware errors, NIC driver issues)
journalctl --since "10 minutes ago" -p err

The most common causes for a node going NotReady 10 minutes ago: - Kubelet crashed or was OOM-killed - Container runtime (containerd) crash - Network partition (node cannot reach API server) - Disk pressure triggered eviction and kubelet cannot function - Hardware failure (NIC, disk)

Common mistakes: - Only checking Kubernetes level and not SSH-ing to the node - Not checking the container runtime separately from kubelet - Forgetting that NotReady means the API server has not heard from the kubelet — the node itself may be fine but network partitioned - Not checking dmesg for hardware-level issues (especially on data nodes)

Study: k8s-node-lifecycle/primer.md, Case: Node Pressure Evictions, linux-ops/primer.md


Section 2: Observability

Q11. Prometheus Metrics Gap Without TargetDown Alert

Model Answer (Strong):

4 possible explanations:

  1. Pods were restarted/rescheduled during the gap. If all Order Service pods restarted simultaneously (e.g., node drain, OOM kills), there was a window with no /metrics endpoint. But TargetDown should have fired. Unless...

  2. TargetDown alert has a for duration longer than the gap. If the TargetDown rule is for: 15m and the gap was 10 minutes, the alert never fired because the condition cleared before the for duration elapsed.

    kubectl get prometheusrules -n monitoring -o yaml | grep -A 10 "TargetDown"
    # Check the "for" field
    

  3. Prometheus itself restarted during the gap. If the Prometheus pod was restarted (OOM, node drain), it would not have scraped anything during its downtime, and it cannot alert on its own unavailability.

    kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o wide
    kubectl describe pod <prometheus-pod> -n monitoring | grep -A 5 "Last State"
    # Check if there was a restart in the last 30 minutes
    

  4. Scrape target relabeling changed. A config change to the ServiceMonitor or PodMonitor dropped the Order Service targets. The target disappeared cleanly (not "down" — just gone), so TargetDown did not fire.

    # Check current targets in Prometheus UI: /targets
    # Look for the Order Service target — is it present?
    kubectl get servicemonitor -n meridian-prod -l app=order-service -o yaml
    

How to determine which occurred: - Check Prometheus's own prometheus_tsdb_head_samples_appended_total — if it also has a gap, Prometheus itself was down - Check up{job="order-service"} — if the time series is absent (not 0, but absent), the target was removed - Check pod events/restarts for the Order Service during the gap window

Common mistakes: - Assuming Prometheus is always available (it can be the problem) - Not understanding the difference between a target being "down" (up=0) and a target being absent - Forgetting that for clauses delay alert firing - Not checking the Prometheus UI targets page

Study: prometheus-deep-dive/primer.md, prometheus-deep-dive/footguns.md


Q12. PromQL for 95th Percentile Latency

Model Answer (Strong):

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{
    namespace="meridian-prod",
    app="order-service"
  }[1h])) by (le, path)
)

Key points: - histogram_quantile operates on the _bucket metric (not _sum or _count) - rate() must be applied before sum() to handle counter resets - The [1h] range matches the "over the last hour" requirement - by (le, path) preserves the le (less-than-or-equal) label required by histogram_quantile and the path label for the per-endpoint breakdown - You must include le in the by clause or histogram_quantile will not work

Common mistakes: - Using _sum / _count for average instead of percentile (gives mean, not p95) - Forgetting to include le in the by clause - Applying sum before rate (incorrect with counter metrics) - Using irate instead of rate (irate uses only last two samples, noisy for hourly windows) - Not filtering by namespace (would include staging metrics if both clusters report to the same Prometheus)

Study: prometheus-deep-dive/primer.md, prometheus-deep-dive/street_ops.md


Q13. Grafana Shows "No Data" for Loki Panels

Model Answer (Strong):

Troubleshooting path from Grafana back to Loki:

  1. Grafana data source configuration:

    Grafana  Configuration  Data Sources  Loki
    Click "Save & Test"  does it show "Data source connected"?
    If not: check the URL, authentication, and network connectivity from Grafana pods to Loki.
    

  2. Loki health check:

    kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
    # Check all pods are Running (read, write, backend in scalable mode)
    kubectl exec -it <grafana-pod> -n monitoring -- curl -s http://loki-read.monitoring:3100/ready
    # Should return "ready"
    

  3. Loki ingestion check:

    # Check if Loki is receiving logs
    kubectl exec -it <grafana-pod> -n monitoring -- curl -s http://loki-read.monitoring:3100/loki/api/v1/labels
    # Should return labels like namespace, pod, app
    # If empty: logs are not being ingested
    

  4. Log shipper (Promtail/Fluentbit) health:

    kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
    # DaemonSet should have one pod per node, all Running
    kubectl logs <promtail-pod> -n monitoring --tail=20
    # Check for errors sending to Loki
    

  5. S3 backend connectivity: Loki stores chunks in S3. If S3 access is broken:

    kubectl logs <loki-write-pod> -n monitoring --tail=50 | grep -i "error\|s3\|storage"
    

  6. Time range mismatch: Grafana may be querying a time range outside Loki's retention. Check the panel's time picker and Loki's retention config (30 days).

Common mistakes: - Not testing the data source connection in Grafana first - Forgetting that Loki uses separate read and write pods (the read path may be broken while write is fine) - Not checking the log shipper (Promtail/Fluentbit) — Loki does not pull logs, they must be pushed - Assuming the labels have not changed (a Promtail config change could rename labels, breaking saved queries)

Study: logging/primer.md, log-pipelines/primer.md, Case: Disk Full Runaway Logs Loki


Q14. Alert Fired But On-Call Not Paged

Model Answer (Strong):

5 most likely failure points in the pipeline:

Prometheus → Alertmanager → PagerDuty integration → PagerDuty routing → PagerDuty notification
  1. Alertmanager did not receive the alert: Prometheus may have fired the alert internally but failed to send it to Alertmanager.

    # Check Prometheus alerts page: /alerts
    # The alert shows as "firing" but check the Alertmanager status
    kubectl logs <prometheus-pod> -n monitoring | grep -i alertmanager
    # Look for "send" errors
    

  2. Alertmanager routing sent it to the wrong receiver: The routing tree may have matched a catch-all route (e.g., Slack info) before the PagerDuty route.

    # Check the Alertmanager config
    kubectl get secret alertmanager-config -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
    # Trace the label matchers in the routing tree
    # Use amtool: amtool config routes test --config.file=alertmanager.yaml severity=critical service=order-service
    

  3. Alertmanager silenced or inhibited the alert:

    # Check active silences
    kubectl exec -it <alertmanager-pod> -n monitoring -- amtool silence query
    # Check inhibition rules — a node-level alert may inhibit pod-level alerts
    

  4. PagerDuty integration failure: The webhook to PagerDuty may be failing (expired API key, wrong integration key, network issue).

    kubectl logs <alertmanager-pod> -n monitoring | grep -i "pagerduty\|error\|fail"
    # Look for HTTP 4xx/5xx responses from PagerDuty API
    

  5. PagerDuty schedule/escalation misconfiguration: The alert reached PagerDuty but the on-call schedule has the wrong person, or the notification rules are set to "email only" instead of push/SMS/call.

  6. Check PagerDuty incident log: the incident may exist but was not escalated
  7. Check the on-call schedule: the primary may have an expired override

Common mistakes: - Not checking Alertmanager silences (someone may have silenced during maintenance and forgot to remove) - Not tracing the routing tree manually (the route order matters, first match wins) - Assuming PagerDuty is always reachable (check webhook delivery logs in PagerDuty) - Not knowing about inhibition rules

Study: alerting-rules/primer.md, observability-deep-dive/primer.md


Q15. Adding a Custom Metric

Model Answer (Strong):

Full path from code to dashboard:

  1. Instrument the code (Python/FastAPI with prometheus-client):

    from prometheus_client import Counter
    
    orders_processed = Counter(
        'orders_processed_total',
        'Total orders processed',
        ['tenant_id']  # CAUTION: see cardinality note below
    )
    
    # In the order processing handler:
    orders_processed.labels(tenant_id=tenant.id).inc()
    

  2. Expose via existing /metrics endpoint (already configured in the app via prometheus-client).

  3. Verify Prometheus scrapes it:

    kubectl port-forward svc/order-service -n meridian-prod 8080:8080
    curl localhost:8080/metrics | grep orders_processed
    

  4. Create Grafana dashboard panel:

    sum(rate(orders_processed_total{namespace="meridian-prod"}[5m])) by (tenant_id)
    

  5. Cardinality concern: With 400 active tenants, the tenant_id label creates 400 time series per metric. This is manageable. But if the metric had labels like tenant_id x order_type x payment_method, the cardinality could explode to 400 x 5 x 4 = 8,000 series from one metric. The general rule: keep label cardinality under 1,000 per metric.

If cardinality is a concern, use a recording rule to pre-aggregate:

- record: orders_processed:rate5m:total
  expr: sum(rate(orders_processed_total[5m]))

Common mistakes: - Adding high-cardinality labels (user ID, order ID, etc.) — this kills Prometheus - Not understanding that the metric needs to be exposed at the /metrics endpoint - Forgetting that Prometheus pulls (scrapes), it does not receive pushes - Creating a Gauge when the metric should be a Counter (or vice versa) - Not considering the ServiceMonitor/PodMonitor configuration (Prometheus may not be scraping the right port)

Study: prometheus-deep-dive/primer.md, monitoring-fundamentals/primer.md


Q16. Incomplete Traces in Tempo

Model Answer (Strong):

4 likely causes for the Fulfillment Service span being missing:

  1. Context propagation broken: The Fulfillment Service is not receiving or extracting the trace context headers from the incoming RabbitMQ message. Unlike HTTP (where headers are automatic with OpenTelemetry auto-instrumentation), message queues require explicit propagation.

    Check: Does the Order Service inject trace context into RabbitMQ message headers?
    Check: Does the Fulfillment Service extract trace context from the message?
    RabbitMQ instrumentation is NOT automatic — it needs the OTel messaging semantic conventions.
    

  2. Sampling dropped the span: Tempo uses 10% head-based sampling for normal traffic. The API Gateway and Order Service spans were sampled (same trace ID), but if the Fulfillment Service made its own sampling decision independently, it may have dropped it.

    Fix: Ensure sampling is done at the head (API Gateway) and propagated —
    downstream services should respect the parent's sampling decision.
    

  3. OTel collector pipeline issue: The Fulfillment Service may be sending spans to a different OTel collector endpoint, or the collector may be dropping spans due to queue overflow.

    kubectl logs <otel-collector-pod> -n monitoring | grep -i "dropped\|error\|fulfillment"
    

  4. Clock skew between nodes: If the Fulfillment Service's node has clock drift, the span timestamps may fall outside Tempo's search window even though the trace ID is correct.

    # Check NTP on the node running Fulfillment Service
    timedatectl status
    chronyc tracking
    

Common mistakes: - Assuming auto-instrumentation covers RabbitMQ (it does not for most languages without explicit plugin) - Not understanding head-based vs tail-based sampling implications - Checking only the application logs and not the OTel collector logs - Not considering clock skew as a cause of missing spans

Study: tracing/primer.md, opentelemetry/primer.md


Q17. Suspected Memory Leak

Model Answer (Strong):

  1. Confirm the pattern in Grafana:

    container_memory_working_set_bytes{namespace="meridian-prod", pod=~"report-service.*"}
    
    Look for a steady upward slope over days without flattening (classic leak pattern vs normal cache growth which levels off).

  2. Check if the container has been OOM-killed before:

    kubectl get pods -n meridian-prod -l app=report-service -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
    kubectl describe pod <report-pod> -n meridian-prod | grep -A 3 "Last State"
    

  3. On the node — check process-level memory:

    # SSH to the node
    # Find the container's PID
    crictl ps | grep report-service
    crictl inspect <container-id> | jq '.info.pid'
    
    # Check process memory maps
    cat /proc/<pid>/smaps_rollup
    pmap -x <pid> | tail -5
    
    # Check cgroup memory stats
    cat /sys/fs/cgroup/memory/kubepods/pod<uid>/<container-id>/memory.stat
    

  4. Inside the container — language-specific profiling:

    # For Python (Report Service):
    kubectl exec -it <report-pod> -n meridian-prod -- pip install tracemalloc
    # Or if already instrumented:
    kubectl exec -it <report-pod> -n meridian-prod -- curl localhost:8080/debug/memory
    # Take two snapshots 10 minutes apart and diff
    
    # Alternatively, use py-spy for Python:
    kubectl exec -it <report-pod> -n meridian-prod -- py-spy dump --pid 1
    

  5. Determine urgency:

    Current usage / Memory limit = utilization %
    Growth rate = (current - yesterday) / 24h
    Time to OOM = (limit - current) / growth rate
    
    If time-to-OOM is less than 24h, restart the pod now and investigate in staging.

Common mistakes: - Confusing container_memory_usage_bytes with container_memory_working_set_bytes (the former includes cache) - Not checking the cgroup memory stats (Grafana may not show the full picture) - Jumping to code profiling before confirming the leak pattern with data - Not knowing that Python has garbage collection but can still leak via circular references or C extension memory

Study: linux-memory-management/primer.md, linux-performance/primer.md, continuous-profiling/primer.md


Q18. Metrics Cardinality Explosion

Model Answer (Strong):

  1. Identify the offending metric:

    # Count time series per metric name
    sort_desc(count by (__name__)({job="kubernetes-pods"}))
    # Compare against last week's snapshot to find the grower
    
    # Or use the TSDB status page: /api/v1/status/tsdb
    # Shows "seriesCountByMetricName" and "labelValueCountByLabelName"
    

  2. Find the high-cardinality label:

    # For the offending metric, count unique label values
    count(count by (some_label)(offending_metric_name))
    # Test each label until you find the one with unexpectedly many values
    

  3. Common causes in this architecture:

  4. A developer added a label like request_id, user_id, or order_id to a metric
  5. Dynamic pod names being used as label values (each restart creates a new series)
  6. A service started exporting go_* or python_* runtime metrics with high-cardinality process labels

  7. Fix options:

  8. Drop the high-cardinality label at scrape time using metric_relabel_configs:
    # In the ServiceMonitor
    metricRelabelings:
      - sourceLabels: [__name__]
        regex: 'offending_metric.*'
        action: drop
    
  9. Fix at source: Remove the high-cardinality label from the application code
  10. Use recording rules to pre-aggregate and drop the raw metric
  11. Set per-target series limit in the scrape config:
    sample_limit: 1000
    

Common mistakes: - Not knowing about the TSDB status endpoint - Not understanding that "cardinality" means unique combinations of label values, not just one label - Dropping the entire metric instead of just the problematic label - Not realizing that old time series (from deleted pods) are kept until Prometheus retention expires

Study: prometheus-deep-dive/footguns.md, prometheus-deep-dive/street_ops.md


Section 3: Networking

Q19. Asia Latency — CDN, DNS, or Backend?

Model Answer (Strong):

Systematic isolation approach:

  1. Tempo trace comparison: Pull a trace from an Asian user and a US user for the same API endpoint. The backend processing time (visible as span durations) should be the same. The difference is in the network layers that Tempo does not capture.

  2. DNS resolution time:

    # From an Asian vantage point (or use a tool like dig from multiple regions)
    dig api.meridian.io +stats
    # Look at "Query time" — should be <50ms with Route53
    # If slow: DNS is the issue (check Route53 latency-based routing, or missing EDNS)
    

  3. CDN cache hit ratio:

    # Check CloudFront access logs or CloudFront metrics in CloudWatch
    # For API calls (not cacheable), CDN adds latency without benefit
    # If API calls are routing through the CDN: the CDN PoP may be far from the backend
    # Check the X-Cache header: curl -I https://api.meridian.io/health
    

  4. TLS handshake overhead:

    curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nFirst byte: %{time_starttransfer}s\nTotal: %{time_total}s\n" -o /dev/null -s https://api.meridian.io/health
    # From Asia: if TLS is ~1s but connect is ~300ms, the extra 700ms is TLS round-trips
    # TLS 1.3 = 1 RTT, TLS 1.2 = 2 RTTs. Each RTT Asia→US-East is ~200-300ms
    

  5. Network path:

    mtr -rw api.meridian.io
    # Shows each hop and latency — identifies if traffic is taking an inefficient route
    

Root cause is likely: TLS handshake (multiple round trips at 200ms+ RTT) combined with lack of a regional CDN PoP or backend endpoint. Fix: enable TLS 1.3 (1 RTT instead of 2), deploy CDN with SSL termination at the edge, or add an API endpoint in a closer region.

Common mistakes: - Blaming the backend without checking trace data - Not using curl -w timing breakdown (guessing instead of measuring) - Not understanding that TLS adds multiple round trips proportional to latency - Forgetting that the CDN may not help for API calls (only static content)

Study: dns-deep-dive/primer.md, tls/primer.md, networking-troubleshooting/street_ops.md


Q20. TLS Certificate Renewal (Let's Encrypt / DNS-01)

Model Answer (Strong):

  1. Check the Certificate resource status:

    kubectl get certificate -n ingress -l app=ingress-nginx
    kubectl describe certificate api-meridian-io-tls -n ingress
    # Look at: Status, Ready condition, Events
    # "OrderFailed" means the ACME order did not complete
    

  2. Check the CertificateRequest and Order:

    kubectl get certificaterequest -n ingress --sort-by='.metadata.creationTimestamp'
    kubectl describe order <order-name> -n ingress
    # The Order status will show which Challenge failed
    

  3. Check the Challenge:

    kubectl get challenges -n ingress
    kubectl describe challenge <challenge-name> -n ingress
    # For DNS-01: check if the TXT record was created in Route53
    

  4. Common DNS-01 failure causes:

  5. IAM permissions: The cert-manager pod's IRSA role lost access to modify Route53.
    kubectl logs -n cert-manager -l app=cert-manager --tail=50 | grep -i "route53\|access\|denied"
    
  6. Wrong hosted zone: cert-manager is trying to create the TXT record in the wrong Route53 hosted zone.
  7. Propagation timeout: The DNS TXT record was created but Let's Encrypt's resolver cannot see it yet (propagation delay).
  8. Rate limiting: Let's Encrypt has rate limits (50 certificates per domain per week). Check the Order failure reason.

  9. Manual fix if needed:

    # Delete the failed order to allow retry
    kubectl delete order <order-name> -n ingress
    # cert-manager will create a new order automatically
    
    # If the ClusterIssuer is broken, check its status
    kubectl describe clusterissuer letsencrypt-prod
    

  10. Verify the fix:

    kubectl get certificate api-meridian-io-tls -n ingress -w
    # Watch until Ready: True
    # Verify the cert is actually valid:
    echo | openssl s_client -connect api.meridian.io:443 -servername api.meridian.io 2>/dev/null | openssl x509 -noout -dates
    

Common mistakes: - Trying to renew the certificate manually instead of letting cert-manager handle it - Not checking the Challenge resource (the answer is usually there) - Not knowing about Let's Encrypt rate limits - Forgetting to check IRSA/IAM permissions for Route53 access - Panicking and creating a self-signed cert (breaks trust chain)

Study: cert-manager/primer.md, tls-certificates-ops/primer.md, Case: DNS TLS cert-manager


Q21. Ingress 502 for One Service

Model Answer (Strong):

Diagnosis path:

  1. Check Ingress-NGINX logs:

    kubectl logs -n ingress -l app.kubernetes.io/name=ingress-nginx --tail=100 | grep billing
    # Look for: "upstream prematurely closed connection", "connect() failed", "no live upstreams"
    

  2. 5 most common causes of 502 in this scenario:

a. Service has no endpoints: The Billing Service's Service object has no endpoints (label selector mismatch, all pods failing readiness probes).

kubectl get endpoints billing-service -n meridian-prod
# If ENDPOINTS column is <none>: no pods match the selector
kubectl get pods -n meridian-prod -l app=billing-service

b. Readiness probe failures: Pods exist but are not ready (failing readiness probe), so they are removed from endpoints.

kubectl describe pod <billing-pod> -n meridian-prod | grep -A 5 "Readiness"

c. Pod port mismatch: The Service targets port 8080 but the Billing Service container listens on a different port.

kubectl get svc billing-service -n meridian-prod -o yaml | grep -A 3 ports
kubectl get pods -n meridian-prod -l app=billing-service -o jsonpath='{.items[0].spec.containers[0].ports}'

d. Backend timeout: The Billing Service is responding, but too slowly. Ingress-NGINX has a default proxy timeout (60s) that may be exceeded for long billing operations.

kubectl get ingress -n meridian-prod -o yaml | grep -A 5 "billing"
# Check for proxy-read-timeout annotations

e. NetworkPolicy blocking Ingress-NGINX to Billing Service: If the Billing Service has a NetworkPolicy that allows traffic from Kong but not directly from Ingress-NGINX (in cases where the ingress path bypasses Kong).

Common mistakes: - Assuming 502 means the service is down (it means the upstream is unreachable or returning errors) - Not checking endpoints (the most common cause) - Not reading Ingress-NGINX access logs (they show the exact upstream error) - Forgetting about readiness probes removing pods from endpoints

Study: k8s-services-and-ingress/primer.md, nginx-web-servers/primer.md, Case: Service No Endpoints


Q22. In-Flight Requests During Rolling Update

Model Answer (Strong):

The full sequence:

  1. New pod starts: Kubernetes creates a new pod (maxSurge=1). The new pod begins running but is not in the Service endpoints until its readiness probe passes.

  2. Readiness probe passes: The new pod is added to the Endpoints object. Ingress-NGINX's upstream list is updated to include the new pod.

  3. Old pod receives SIGTERM: Kubernetes sends SIGTERM to the old pod and simultaneously removes it from the Endpoints object.

  4. Race condition window: There is a brief window between the pod being removed from endpoints and Ingress-NGINX learning about it. During this window:

  5. New connections may still be routed to the old pod by Ingress-NGINX (which has not updated its upstream list yet)
  6. The old pod should continue serving in-flight requests during terminationGracePeriodSeconds

  7. Graceful shutdown in the Order Service:

  8. The application receives SIGTERM
  9. It should stop accepting new connections but finish processing in-flight requests
  10. It has terminationGracePeriodSeconds (default 30s, should be set longer for order processing) to complete
  11. After the grace period, Kubernetes sends SIGKILL

  12. Ingress-NGINX behavior:

  13. NGINX detects the endpoint removal and removes the upstream from its config
  14. Any new request to the removed upstream gets retried to another upstream (if proxy-next-upstream is configured)
  15. In-flight requests to the old pod complete normally (the TCP connection is already established)

  16. Best practice for zero-downtime:

  17. Add a preStop lifecycle hook with a small sleep (5-10s) to delay SIGTERM, giving Ingress-NGINX time to update:
    lifecycle:
      preStop:
        exec:
          command: ["sh", "-c", "sleep 5"]
    
  18. Set terminationGracePeriodSeconds to at least 60s for order processing
  19. Ensure the application handles SIGTERM gracefully

Common mistakes: - Not understanding the race condition between endpoint removal and Ingress-NGINX upstream update - Assuming SIGTERM immediately kills the process (it is a signal the application should handle) - Not knowing about the preStop hook technique to prevent dropped connections - Confusing readiness probes (new pod) with the shutdown process (old pod) - Setting terminationGracePeriodSeconds too short for the application's drain time

Study: k8s-pods-and-scheduling/primer.md, k8s-ops (Probes)/primer.md, k8s-services-and-ingress/primer.md


Q23. MTU Issue with Elasticsearch Bulk Indexing

Model Answer (Strong):

Detection:

  1. Symptom pattern: Small requests work (< 1400 bytes), large bulk indexing requests fail. This strongly suggests MTU/PMTU issues where large packets are dropped.

  2. Test with specific packet sizes:

    # From the Search Service pod to an Elasticsearch pod
    kubectl exec -it <search-pod> -n meridian-prod -- ping -M do -s 1400 <es-pod-ip>
    # -M do = don't fragment
    # If 1400 fails but 1300 works, there is an MTU issue
    # Reduce -s until you find the effective MTU
    

  3. Check Calico VXLAN overhead:

    Standard ethernet MTU: 1500
    VXLAN overhead: 50 bytes (8 VXLAN header + 8 UDP header + 20 IP header + 14 Ethernet)
    Effective pod MTU with VXLAN: 1450
    

  4. Verify current MTU settings:

    # On the node
    ip link show | grep mtu
    # Check vxlan.calico interface
    ip link show vxlan.calico
    
    # In a pod
    kubectl exec -it <search-pod> -n meridian-prod -- cat /sys/class/net/eth0/mtu
    

  5. The problem: New data nodes may have been added with a different MTU on the host interface (e.g., 1500 instead of 9000 jumbo frames that other nodes use), or the Calico VXLAN MTU was not adjusted to account for the overlay.

Fix:

# Option 1: Set Calico MTU explicitly via the installation resource
kubectl edit installation default
# Set spec.calicoNetwork.mtu to 1450 (or 8950 if using jumbo frames)

# Option 2: If using Calico operator, edit the FelixConfiguration
calicoctl patch felixconfiguration default --patch='{"spec":{"mtu": 1450}}'

# After changing: pods need to be restarted to pick up the new MTU
# Rolling restart all affected workloads
kubectl rollout restart deployment/search-service -n meridian-prod

Common mistakes: - Using ping without -M do (fragmentation hides the problem) - Not accounting for VXLAN overhead (50 bytes) - Changing only the host MTU without changing Calico's MTU configuration - Forgetting to restart pods after MTU change (existing network interfaces keep old MTU)

Study: mtu/primer.md, Case: MTU Blackhole TLS Stalls


Q24. Slow DNS Resolution in Pods

Model Answer (Strong):

3 most likely causes:

  1. CoreDNS overloaded or unhealthy:

    kubectl get pods -n kube-system -l k8s-app=kube-dns
    kubectl top pods -n kube-system -l k8s-app=kube-dns
    # Check if CoreDNS pods are CPU-throttled or near memory limit
    
    # Check CoreDNS metrics
    kubectl exec -it <prometheus-pod> -n monitoring -- curl -s 'http://localhost:9090/api/v1/query?query=coredns_dns_request_duration_seconds_sum'
    

  2. ndots:5 default causing excessive lookups: By default, Kubernetes sets ndots:5 in /etc/resolv.conf. For queries like rabbitmq.rabbitmq.svc.cluster.local, each dot adds a search domain lookup attempt.

    kubectl exec -it <app-pod> -n meridian-prod -- cat /etc/resolv.conf
    # Shows: search meridian-prod.svc.cluster.local svc.cluster.local cluster.local
    # options ndots:5
    # A query for "redis.example.com" (2 dots < 5) triggers 4 failed lookups before the real one
    
    # Fix: Set ndots:2 in the pod spec or use FQDNs (with trailing dot):
    # dnsConfig:
    #   options:
    #     - name: ndots
    #       value: "2"
    

  3. conntrack table full or UDP race condition: Linux has a known issue with UDP DNS and conntrack where identical DNS queries (A and AAAA sent simultaneously) get source NAT'd to the same port, causing one to be dropped.

    # On the node:
    dmesg | grep "conntrack: table full"
    cat /proc/sys/net/netfilter/nf_conntrack_count
    cat /proc/sys/net/netfilter/nf_conntrack_max
    
    # Fix for DNS race: use single-request-reopen in dnsConfig:
    # dnsConfig:
    #   options:
    #     - name: single-request-reopen
    

Common mistakes: - Not checking /etc/resolv.conf inside the pod (the search domains and ndots setting are invisible from the node) - Blaming the upstream DNS when the issue is CoreDNS or the ndots setting - Not knowing about the conntrack/UDP race condition (very common in Kubernetes) - Not considering CoreDNS autoscaling (the default 2 replicas may not be enough for large clusters)

Study: dns-deep-dive/primer.md, dns-ops/primer.md, Case: CoreDNS Timeout Pod DNS


Q25. NetworkPolicy Blocking Prometheus Scraping

Model Answer (Strong):

Why it broke: The new NetworkPolicy allows ingress only from pods in meridian-prod namespace. Prometheus runs in the monitoring namespace. The RabbitMQ Prometheus exporter runs as a sidecar or separate pod in the rabbitmq namespace. The NetworkPolicy blocks traffic from monitoring namespace, preventing Prometheus from scraping the exporter's /metrics endpoint.

Corrected NetworkPolicy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: rabbitmq-ingress
  namespace: rabbitmq
spec:
  podSelector: {}  # Apply to all pods in rabbitmq namespace
  policyTypes:
    - Ingress
  ingress:
    # Allow application traffic from meridian-prod namespace
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: meridian-prod
      ports:
        - protocol: TCP
          port: 5672   # AMQP
        - protocol: TCP
          port: 15672  # Management UI
    # Allow Prometheus scraping from monitoring namespace
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring
      ports:
        - protocol: TCP
          port: 9419   # RabbitMQ Prometheus exporter port
        - protocol: TCP
          port: 15692  # RabbitMQ built-in Prometheus metrics

Key points: - NetworkPolicies are additive — multiple ingress rules are OR'd together - The kubernetes.io/metadata.name label is automatically applied to namespaces in modern Kubernetes - You need to explicitly allow the metrics port, not just the AMQP port - The RabbitMQ exporter typically runs on port 9419 or the built-in metrics on 15692

Common mistakes: - Forgetting that NetworkPolicy default deny blocks ALL ingress, including monitoring - Not allowing the specific metrics port (allowing 5672 does not help Prometheus) - Using pod selectors instead of namespace selectors (Prometheus pods have different labels) - Not testing the NetworkPolicy before applying to production

Study: k8s-networking/primer.md, Case: Grafana Empty Prometheus NetworkPolicy


Section 4: Linux & Infrastructure

Q26. Disk Usage at 95% on Kubernetes Node

Model Answer (Strong):

# 1. Quick overview (safe, no IO impact)
df -h /
# Confirms the root filesystem is the problem

# 2. Find largest directories (safe, sequential read)
du -sh /var/lib/* 2>/dev/null | sort -rh | head -10
du -sh /var/log/* 2>/dev/null | sort -rh | head -10

# 3. Check container-specific storage
du -sh /var/lib/containerd/io.containerd.snapshotter/overlayfs/ 2>/dev/null
du -sh /var/lib/kubelet/pods/ | sort -rh | head -5

5 common culprits on Kubernetes worker nodes:

  1. Container images: Unused images accumulate. Check with crictl images | wc -l. Clean with crictl rmi --prune (only removes unreferenced images).

  2. Container logs (stdout/stderr): If the container runtime logging driver stores logs on disk (the default), verbose services can fill disk.

    find /var/log/pods/ -name "*.log" -size +100M
    # These are the container stdout/stderr logs that kubectl logs reads
    

  3. emptyDir volumes: Pods with emptyDir volumes write to the node's disk. A misconfigured pod can write unlimited data.

    du -sh /var/lib/kubelet/pods/*/volumes/kubernetes.io~empty-dir/ | sort -rh | head -5
    

  4. Kubelet garbage collection not running: The kubelet has image and container garbage collection settings. If thresholds are set too high, cleanup does not trigger.

    # Check kubelet GC settings
    ps aux | grep kubelet | grep -o "image-gc-high-threshold=[^ ]*"
    

  5. System journal (journald): Accumulates over time if not configured with a size limit.

    journalctl --disk-usage
    # Clean if large: journalctl --vacuum-size=500M
    

Common mistakes: - Running find / -size +100M which traverses all mount points and can impact IO - Deleting files directly instead of using proper cleanup tools (e.g., truncating active log files) - Not checking container images (the single largest consumer on most nodes) - Ignoring the kubelet eviction thresholds — at 90% the kubelet starts evicting pods automatically

Study: linux-ops/primer.md, disk-and-storage-ops/primer.md, Case: Runaway Logs Fill Disk


Q27. OOM Killer Investigation

Model Answer (Strong):

  1. Find which process was killed:

    dmesg | grep -i "oom\|killed process"
    # Output example:
    # Out of memory: Killed process 4521 (java) total-vm:8388608kB, anon-rss:6291456kB, file-rss:0kB
    
    journalctl -k | grep -i oom
    

  2. Determine if it was container-level or system-level:

Container-level OOM (most common in Kubernetes): The container exceeded its resources.limits.memory. The cgroup memory limit was hit.

# Check the pod events
kubectl describe pod <es-pod> -n elastic-system | grep -A 3 "Last State"
# If reason: OOMKilled and exit code 137, it was container memory limit

# Check the cgroup limit
cat /sys/fs/cgroup/memory/kubepods/pod<uid>/<container-id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/kubepods/pod<uid>/<container-id>/memory.max_usage_in_bytes

System-level OOM: The entire node ran out of memory. All cgroups exhausted.

# Check node memory at the time of the OOM
free -h
# Check if other pods were also killed
kubectl get events --all-namespaces --field-selector reason=OOMKilling --sort-by='.lastTimestamp'

  1. Why Elasticsearch was selected (for system-level OOM):

    # The OOM killer uses oom_score_adj
    # Kubernetes sets different scores: guaranteed pods get -997, best-effort get 1000
    cat /proc/<pid>/oom_score_adj
    # Elasticsearch likely has the highest oom_score because it uses the most memory
    

  2. Fix for Elasticsearch:

  3. If container OOM: increase resources.limits.memory (Elasticsearch JVM heap should be 50% of container memory)
  4. If system OOM: the node is over-committed. Either reduce total requests across all pods or add nodes
  5. Check Elasticsearch JVM settings: -Xms and -Xmx should match and be 50% of the container limit

Common mistakes: - Not distinguishing between container-level OOM (cgroup limit) and system-level OOM (node memory) - Not checking dmesg (the kernel OOM killer details are there, not in application logs) - Blaming the application without checking if the memory limit is appropriate - For Elasticsearch: setting JVM heap equal to the container limit (leaves no room for off-heap memory and filesystem cache)

Study: linux-memory-management/primer.md, Case: OOM Killer Events


Q28. Ansible Playbook Failure Triage

Model Answer (Strong):

  1. Check the retry file:

    # Ansible automatically creates a retry file for failed hosts
    cat upgrade.retry
    # Contains the 3 hostnames that failed
    

  2. Run only on failed hosts with verbose output:

    ansible-playbook upgrade.yml --limit @upgrade.retry -vvv
    # --limit @file reads hosts from the retry file
    # -vvv gives maximum verbosity to see the exact failure point
    

  3. Investigate the failures:

    # Check if the failure is at a specific task
    ansible-playbook upgrade.yml --limit @upgrade.retry --start-at-task="<failing task name>" -vvv
    
    # Common failure patterns for OS patching:
    # - Package manager lock held (dpkg/yum lock)
    # - Network connectivity to package repo
    # - Disk space insufficient for update
    # - SSH timeout (host unreachable)
    

  4. Ad-hoc diagnostics on failed hosts:

    # Check connectivity
    ansible -i inventory/ failed_host1,failed_host2,failed_host3 -m ping
    
    # Check disk space
    ansible -i inventory/ @upgrade.retry -m shell -a "df -h /"
    
    # Check if a previous upgrade is still running
    ansible -i inventory/ @upgrade.retry -m shell -a "fuser /var/lib/dpkg/lock-frontend 2>/dev/null"
    

  5. If 3 of 12 hosts fail consistently: Look for a common factor:

  6. Same site/rack? (network issue)
  7. Same OS version? (package compatibility)
  8. Same hardware model? (firmware/driver issue)

Common mistakes: - Re-running the entire playbook on all 12 hosts (wastes time and risks the 9 successful hosts) - Not checking the retry file - Not using --limit and --start-at-task for targeted debugging - Not checking if the 3 failed hosts share a common attribute

Study: ansible/primer.md, ansible-deep-dive/primer.md, Case: Ansible SSH Agent Firewall


Q29. High Load Average, Low CPU

Model Answer (Strong):

What it indicates: Load average includes processes in both running AND uninterruptible sleep (D state) states. Load 45 on a 4-core machine with 30% CPU means most of those processes are in D state — waiting on IO, not CPU.

3 possible explanations:

  1. IO-bound processes (disk): Many processes waiting on disk reads/writes.

    iostat -xz 1 5
    # Look at %util, await, and avgqu-sz
    # If disk utilization is near 100% or await is high: disk is the bottleneck
    
    vmstat 1 5
    # Check 'b' column (processes blocked on IO) — should be high
    # Check 'wa' column (IO wait %) — should be significant
    

  2. NFS or network filesystem hanging: A mounted NFS share is unresponsive, causing all processes accessing it to enter D state.

    mount | grep nfs
    ls /path/to/nfs/mount  # If this hangs: NFS is the cause
    
    # Check for D-state processes
    ps aux | awk '$8 ~ /D/ {print}'
    # The WCHAN column shows what they are waiting on
    cat /proc/<pid>/wchan
    

  3. Kernel/driver issue: A device driver or kernel operation is blocking.

    dmesg | tail -50
    # Look for I/O errors, SCSI timeouts, NVMe errors
    
    # Check for task hung messages
    dmesg | grep "hung_task\|blocked for more than"
    

For this specific architecture (data-node): The most likely cause is Elasticsearch performing heavy disk IO (merges, flushes) and the underlying EBS volume hitting its IOPS limit.

# Check EBS IOPS in CloudWatch, or:
cat /sys/block/nvme1n1/stat  # IO stats for the EBS device

Common mistakes: - Assuming high load = high CPU (load includes IO wait) - Not checking the b column in vmstat (blocked processes) - Not checking for D-state processes specifically - Not understanding that NFS hangs can cascade system-wide

Study: linux-performance/primer.md, linux-performance/street_ops.md


Q30. NTP Verification and Failure Impact

Model Answer (Strong):

Verify NTP on all nodes:

# On each node (or via Ansible for all nodes at once):
# For chronyd (default on modern Linux):
chronyc tracking
# Key fields: "System time" offset should be <100ms, "Leap status" should be "Normal"

chronyc sources -v
# Should show reachable NTP servers with low offset

# For systemd-timesyncd:
timedatectl status
# "System clock synchronized: yes" and "NTP service: active"

# Bulk check across all nodes via Ansible:
ansible all -m shell -a "chronyc tracking | grep 'System time'"

Maximum acceptable clock skew: - General: < 100ms for most applications - Vault: Lease durations and token TTLs rely on consistent time. Skew > 1 minute causes premature token expiry or late revocation. - TLS: Certificate validity windows use system time. Skew > 5 minutes can cause certificate validation failures. - Prometheus: Timestamps on metrics must be close to real time. Skew > 1 minute causes out-of-order samples, which Prometheus rejects. - RabbitMQ: Cluster partition healing uses time-based heuristics. Skew > 1 minute can cause split-brain resolution failures.

What breaks first: Typically Vault token validation or TLS certificate checks break first, because they use absolute timestamps. Prometheus breaks next because it rejects out-of-order samples. RabbitMQ cluster health degrades last.

Common mistakes: - Not checking NTP at all (assuming cloud instances handle it) - Checking only one node instead of all nodes - Not knowing which components are time-sensitive - Confusing NTP reachability with actual synchronization (a server can be reachable but drifting)

Study: Case: Time Sync Skew Breaks App, Case: HPA Flapping Clock Skew NTP


Q31. Terraform Plan Shows Unexpected RDS Recreation

Model Answer (Strong):

4 possible causes:

  1. Parameter changed outside Terraform: Someone modified the RDS instance via the AWS console (engine version, instance class, storage type), and Terraform sees the drift.

    terraform plan -target=aws_db_instance.main -detailed-exitcode
    # Read the diff carefully — which attribute triggers the replacement?
    # Attributes marked "forces replacement" cause recreation
    

  2. Provider version upgrade changed defaults: A terraform-provider-aws upgrade may have changed default values or resource schema.

    cat .terraform.lock.hcl | grep "provider.*aws" -A 5
    # Compare with the version used in the last successful apply
    terraform state show aws_db_instance.main | diff - <(terraform plan -json | jq '.resource_changes[0].change.before')
    

  3. State corruption or manual state edit: The state file was modified or a terraform state mv was done incorrectly.

    terraform state show aws_db_instance.main
    # Does the ID match the actual RDS instance?
    aws rds describe-db-instances --db-instance-identifier <id> --query 'DBInstances[0].DBInstanceIdentifier'
    

  4. Force-new attribute changed: Certain attributes like identifier, engine, availability_zone, or snapshot_identifier force replacement. A seemingly innocent change to a related attribute (like the subnet group) can cascade.

Safe investigation without applying:

# Always use -target to limit scope
terraform plan -target=aws_db_instance.main -no-color > rds_plan.txt

# Check state vs reality
terraform state show aws_db_instance.main > state_view.txt
aws rds describe-db-instances --db-instance-identifier <id> > aws_view.json

# Compare specific force-new attributes
diff <(grep "identifier\|engine\|availability_zone\|snapshot" state_view.txt) \
     <(jq '.DBInstances[0] | {DBInstanceIdentifier, Engine, AvailabilityZone}' aws_view.json)

# Check if terraform refresh resolves it (read-only, does not apply)
terraform refresh -target=aws_db_instance.main
terraform plan -target=aws_db_instance.main

Common mistakes: - Running terraform apply to "see what happens" (this would destroy and recreate the production database) - Not using -target to limit the scope of investigation - Not checking if a provider upgrade changed resource behavior - Panicking and running terraform import without understanding the root cause

Study: terraform/primer.md, terraform-deep-dive/primer.md, Case: Terraform State Lock DynamoDB


Q32. Kernel Parameter Rollout Strategy

Model Answer (Strong):

Safest rollout strategy:

  1. Do not change the host directly if Kubernetes pods need it. For net.core.somaxconn, Kubernetes supports pod-level sysctls that can be set without changing the host:

    # In the Ingress-NGINX pod spec:
    securityContext:
      sysctls:
        - name: net.core.somaxconn
          value: "4096"
    
    However, net.core.somaxconn is a namespaced sysctl in the kernel, so it only affects that pod's network namespace. This is the preferred approach.

  2. If host-level change is required (e.g., the sysctl is not namespaced):

a. Stage 1: Test on one non-production node

# Via Ansible on one node:
sysctl -w net.core.somaxconn=4096  # Temporary, lost on reboot
# Verify: sysctl net.core.somaxconn
# Monitor for 1 hour for any issues

b. Stage 2: Make persistent on the test node

echo "net.core.somaxconn = 4096" >> /etc/sysctl.d/99-kubernetes.conf
sysctl --system  # Reload all sysctl configs

c. Stage 3: Roll out to all nodes via Ansible in batches

ansible-playbook sysctl-rollout.yml --limit "app-workers:&batch1" --check
ansible-playbook sysctl-rollout.yml --limit "app-workers:&batch1"
# Wait, verify, then batch2, batch3...

d. Stage 4: Add to node provisioning template (Terraform user_data or AMI) so new nodes get the setting.

  1. Kubernetes interaction: The kubelet has --allowed-unsafe-sysctls flag. If using pod-level sysctls, the kubelet must whitelist the sysctl:
    # In kubelet config:
    allowedUnsafeSysctls:
      - "net.core.somaxconn"
    
    Safe sysctls (like net.core.somaxconn in newer kernels) do not need this.

Common mistakes: - Changing all nodes at once (no rollback if it causes issues) - Only setting via sysctl -w without persisting in /etc/sysctl.d/ - Not knowing that Kubernetes supports pod-level sysctls - Not updating the node provisioning template (new nodes will not have the change) - Not knowing the difference between safe and unsafe sysctls in Kubernetes

Study: linux-kernel-tuning/primer.md, ansible/primer.md


Section 5: Security

Q33. Vault Token Expired — External Secrets Operator

Model Answer (Strong):

Immediate fix:

  1. Check the External Secrets Operator (ESO) logs:

    kubectl logs -n vault -l app=external-secrets --tail=50
    # Look for: "permission denied" or "token is expired"
    

  2. Re-authenticate ESO with Vault:

    # ESO uses Kubernetes auth method — the service account token is used to get a Vault token
    # Check the SecretStore or ClusterSecretStore resource
    kubectl get clustersecretstore vault-backend -o yaml
    # Verify the auth path, role, and service account
    
    # Restart the ESO pod to trigger re-authentication
    kubectl rollout restart deployment/external-secrets -n vault
    

  3. Verify secrets are syncing:

    kubectl get externalsecret -n meridian-prod
    # Check the STATUS column — should show "SecretSynced"
    kubectl describe externalsecret auth-service-secrets -n meridian-prod
    

Long-term prevention:

  1. Use Kubernetes auth method (not token auth): Kubernetes auth tokens auto-renew via the service account token. If ESO is using a static Vault token, migrate to Kubernetes auth:

    # ClusterSecretStore with Kubernetes auth:
    apiVersion: external-secrets.io/v1beta1
    kind: ClusterSecretStore
    metadata:
      name: vault-backend
    spec:
      provider:
        vault:
          server: "https://vault.vault.svc:8200"
          path: "secret"
          auth:
            kubernetes:
              mountPath: "kubernetes"
              role: "external-secrets"
              serviceAccountRef:
                name: external-secrets
                namespace: vault
    

  2. Set up Vault token TTL monitoring:

    # Alert when token expiry is within 24 hours
    vault_token_ttl{role="external-secrets"} < 86400
    

  3. Configure automatic token renewal in Vault: The Kubernetes auth backend issues renewable tokens. Ensure the token TTL and max TTL are appropriate:

    vault read auth/kubernetes/role/external-secrets
    # token_ttl should be ~1h, token_max_ttl ~24h, with periodic renewal
    

Common mistakes: - Using a static Vault token for ESO (should use Kubernetes auth for auto-renewal) - Restarting the Auth Service instead of ESO (the Auth Service just reads Kubernetes Secrets created by ESO) - Not checking the ClusterSecretStore status (it shows the auth state) - Manually creating Kubernetes secrets to work around the issue (skips rotation)

Study: hashicorp-vault/primer.md, hashicorp-vault/street_ops.md, secrets-management/primer.md


Q34. Container Running as Root — Fix Without Downtime

Model Answer (Strong):

  1. Assess the current state:

    kubectl exec -it <fulfillment-pod> -n meridian-prod -- id
    # uid=0(root) gid=0(root)
    kubectl exec -it <fulfillment-pod> -n meridian-prod -- ls -la /data/
    # Shows files owned by root
    

  2. Fix the Dockerfile (build pipeline):

    # Add a non-root user
    RUN addgroup --system --gid 1000 appgroup && \
        adduser --system --uid 1000 --ingroup appgroup appuser
    
    # Change ownership of the data directory
    RUN chown -R appuser:appgroup /data/
    
    USER 1000
    

  3. Fix the Kubernetes deployment (add securityContext):

    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
      fsGroup: 1000  # This changes ownership of mounted volumes
    

  4. Handle the volume ownership problem: The existing PV has files owned by root. Using fsGroup: 1000 in the pod spec will cause Kubernetes to chown the volume on pod startup. For large volumes this can be slow. Alternative: use an init container:

    initContainers:
      - name: fix-permissions
        image: busybox
        command: ["sh", "-c", "chown -R 1000:1000 /data"]
        volumeMounts:
          - name: data
            mountPath: /data
        securityContext:
          runAsUser: 0  # Init container runs as root to fix permissions
    

  5. Deploy without downtime:

  6. Build and push the new image
  7. Update the Helm values with the securityContext
  8. The rolling update replaces pods one at a time
  9. Each new pod runs the init container to fix permissions, then starts as non-root
  10. Old pods continue serving as root until replaced

Common mistakes: - Changing securityContext without fixing file ownership (pod cannot read its own data) - Forgetting fsGroup for mounted volumes - Not testing in staging first (permission issues can be subtle) - Using chown -R on a large volume in the main container (blocks startup)

Study: linux-hardening/primer.md, container-images/primer.md, security-basics/primer.md


Q35. OPA/Gatekeeper Rejecting Deployment

Model Answer (Strong):

  1. Get the full rejection message:

    kubectl apply -f deployment.yaml --dry-run=server 2>&1
    # The error message includes the constraint name and violation details
    

  2. Find the constraint:

    kubectl get constraints
    kubectl describe k8srequiredresourcelimits.constraints.gatekeeper.sh require-resource-limits
    # Shows the constraint spec, match criteria, and parameters
    

  3. Common gotchas:

  4. Init containers: The constraint applies to ALL containers, not just the main one. An init container without resource limits will trigger the violation.
  5. Sidecar containers: Vault agent injector, Envoy proxy, or log shipper sidecars added by mutating webhooks may not have resource limits.
    kubectl get deployment <name> -n staging -o jsonpath='{.spec.template.spec.initContainers[*].name}'
    kubectl get deployment <name> -n staging -o jsonpath='{.spec.template.spec.containers[*].name}'
    # Check ALL containers for resource limits
    
  6. Ephemeral containers: Debug containers added via kubectl debug may not have limits.

  7. Debug the constraint template:

    # Check the Rego policy
    kubectl get constrainttemplate k8srequiredresourcelimits -o jsonpath='{.spec.targets[0].rego}'
    # Read the Rego to understand exactly what it checks
    

  8. Temporary workaround (if urgent):

    # Add an exemption in the constraint (not recommended long-term)
    kubectl edit k8srequiredresourcelimits require-resource-limits
    # Add to spec.match.excludedNamespaces or spec.match.labelSelector
    

Common mistakes: - Only checking the main container's resource limits (forgetting init containers and sidecars) - Not knowing that mutating webhooks inject sidecars BEFORE Gatekeeper validates (so the developer does not see the sidecar in their YAML) - Disabling the constraint instead of fixing the deployment - Not using --dry-run=server which triggers admission webhooks (client dry-run does not)

Study: policy-engines/primer.md, k8s-rbac/primer.md


Q36. Secret Committed to Git — Incident Response

Model Answer (Strong):

Immediate actions (first 15 minutes):

  1. Revoke the key immediately:

    aws iam delete-access-key --access-key-id AKIA... --user-name <username>
    # Do this BEFORE anything else — the key is compromised the moment it is pushed
    

  2. Generate a new key and rotate:

    aws iam create-access-key --user-name <username>
    # Update the key in Vault (where it should have been all along)
    vault kv put secret/aws/keys access_key=<new> secret_key=<new>
    

  3. Scrub from git history:

    # Remove from the public repo (but the key is already burned)
    git filter-branch --force --index-filter \
      "git rm --cached --ignore-unmatch <file-with-secret>" \
      --prune-empty --tag-name-filter cat -- --all
    git push --force
    # Or use BFG Repo-Cleaner (faster):
    bfg --replace-text secrets.txt repo.git
    

Investigation (next 1-2 hours):

  1. Check for unauthorized usage:

    # CloudTrail: Check if the key was used between exposure and revocation
    aws cloudtrail lookup-events \
      --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIA... \
      --start-time "2024-01-01T00:00:00Z"
    
    # Check S3 access logs for the key
    # Check RDS audit logs for any connections
    

  2. Assess blast radius:

  3. What IAM permissions did the key have? (S3 and RDS per the question)
  4. Was any data accessed or modified?
  5. Were any new resources created (crypto mining is common)?

Remediation and prevention:

  1. Pre-commit hooks: Add detect-secrets or git-secrets pre-commit hook to the repository.

  2. GitHub secret scanning: Enable GitHub Advanced Security secret scanning alerts.

  3. Vault integration for CI/CD: Ensure CI/CD fetches credentials from Vault at runtime, never from env vars or config files in the repo.

  4. Postmortem: Document the incident, timeline, and prevention measures.

Common mistakes: - Trying to scrub from git history BEFORE revoking the key (the key is compromised instantly) - Thinking a git revert removes the secret (it is still in history) - Not checking CloudTrail for unauthorized usage (the key may have been used) - Not scanning for other secrets in the repository (if one was there, others might be too) - Treating this as low severity because "we caught it quickly" (automated scanners find exposed keys within minutes)

Study: secrets-management/primer.md, secrets-management/footguns.md, security-basics/primer.md


Q37. Verifying mTLS Between Services

Model Answer (Strong):

How mTLS is configured in this architecture:

  1. Vault PKI engine issues short-lived certificates (90-day rotation)
  2. cert-manager requests certificates from the Vault PKI issuer
  3. Certificates are stored as Kubernetes Secrets
  4. Application pods mount the certificate secrets and use them for TLS

Verification steps:

  1. Check that certificates exist and are valid:

    kubectl get certificate -n meridian-prod -l app=order-service
    kubectl describe certificate order-service-tls -n meridian-prod
    # Check: Ready condition, Not After (expiry), Issuer
    
    # Inspect the actual certificate
    kubectl get secret order-service-tls -n meridian-prod -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
    # Verify: Issuer is the internal CA, Subject matches the service
    

  2. Verify traffic is actually encrypted (from inside a pod):

    # Test the connection from Order Service to Inventory Service
    kubectl exec -it <order-pod> -n meridian-prod -- \
      openssl s_client -connect inventory-service:8443 \
      -cert /certs/tls.crt -key /certs/tls.key -CAfile /certs/ca.crt
    # Should show successful TLS handshake with the internal CA
    
    # Verify that plaintext connection is rejected
    kubectl exec -it <order-pod> -n meridian-prod -- \
      curl -v http://inventory-service:8080/health
    # Should fail or be rejected (if the service only listens on TLS port)
    

  3. Network-level verification with tcpdump:

    # On the node, capture traffic between the pods
    tcpdump -i any -n host <order-pod-ip> and host <inventory-pod-ip> -A -c 50
    # If mTLS is working: payload should be encrypted (random bytes, not readable HTTP)
    # If not: you will see plaintext HTTP headers and JSON
    

What failure looks like: - Certificate expired: tls: certificate has expired - CA mismatch: tls: certificate signed by unknown authority - No client certificate presented: tls: client didn't provide a certificate - Wrong hostname: tls: certificate is valid for X, not Y

Common mistakes: - Checking only the Certificate resource status without verifying actual traffic encryption - Not testing that plaintext connections are rejected (mTLS is meaningless if the service also accepts unencrypted traffic) - Confusing external TLS (Let's Encrypt) with internal mTLS (Vault PKI) - Not checking the CA chain (the Vault PKI CA must be trusted by both sides)

Study: tls-pki/primer.md, tls-certificates-ops/primer.md, hashicorp-vault/primer.md


Q38. cert-manager Vault PKI Issuer Failure

Model Answer (Strong):

Diagnosis path:

  1. Check the ClusterIssuer status:

    kubectl describe clusterissuer vault-pki
    # Look at Conditions: Ready should be True
    # If False: the reason and message tell you what failed
    

  2. Check cert-manager controller logs:

    kubectl logs -n cert-manager -l app=cert-manager --tail=100
    # Look for errors mentioning vault, PKI, authentication
    

  3. Check Vault health:

    kubectl exec -it vault-0 -n vault -- vault status
    # Is it sealed? If sealed, the auto-unseal (KMS) may have failed
    # Check: "Sealed: false", "HA Enabled: true", "Active Node Address"
    
    kubectl get pods -n vault
    # All 3 pods should be Running
    

  4. Check Vault PKI engine:

    kubectl exec -it vault-0 -n vault -- vault secrets list
    # Verify the PKI engine is mounted at the expected path
    
    kubectl exec -it vault-0 -n vault -- vault read pki/roles/cert-manager
    # Check the role configuration: allowed_domains, ttl, max_ttl
    

  5. Check Kubernetes auth in Vault:

    kubectl exec -it vault-0 -n vault -- vault read auth/kubernetes/config
    # Verify the kubernetes_host and ca_cert are correct
    
    kubectl exec -it vault-0 -n vault -- vault read auth/kubernetes/role/cert-manager
    # Verify: bound_service_account_names, bound_service_account_namespaces
    

  6. Test authentication manually:

    # Get the cert-manager service account token
    TOKEN=$(kubectl get secret cert-manager-token -n cert-manager -o jsonpath='{.data.token}' | base64 -d)
    
    # Try to authenticate with Vault
    kubectl exec -it vault-0 -n vault -- vault write auth/kubernetes/login \
      role=cert-manager \
      jwt=$TOKEN
    

Most likely causes (given it worked 3 days ago): - Vault leader election changed and the new leader has different auth config - Kubernetes service account token rotated (auto-rotation in newer K8s versions) - Vault auto-unseal KMS key was rotated or permissions changed - PKI CA certificate expired or CRL is stale

Common mistakes: - Checking cert-manager without checking Vault (the problem is usually on the Vault side) - Not knowing how Kubernetes auth works in Vault (service account token exchange) - Restarting cert-manager without fixing the underlying Vault issue - Not checking if Vault itself is healthy (sealed state blocks everything)

Study: cert-manager/primer.md, hashicorp-vault/street_ops.md, Case: DNS TLS cert-manager


Section 6: CI/CD & DevOps Tooling

Q39. GitHub Actions Docker Build Failure

Model Answer (Strong):

First 3 diagnostic steps:

  1. Read the full error output in the GitHub Actions log:

    Click on the failed step → Expand the error output
    The pip install error will show the specific package and failure reason
    Common: version conflict, package yanked from PyPI, build dependency missing
    

  2. Check if a dependency was unpinned:

    # Check requirements.txt
    git diff HEAD~1 requirements.txt
    # If no diff: an unpinned dependency released a broken version overnight
    # e.g., "requests>=2.28" pulled in 2.32 which requires Python 3.12
    

  3. Try to reproduce locally:

    docker build -t inventory-service:test .
    # If it fails locally too: the issue is in the Dockerfile or dependencies
    # If it works locally: the issue is in the CI environment (cache, network, secrets)
    

Most common causes:

  1. Unpinned dependency released breaking version: A transitive dependency published a new version that is incompatible. Fix: pin all dependencies with exact versions or use a lock file.

  2. Docker build cache invalidation: The CI cache expired overnight, causing a fresh pull of all layers. A new base image (python:3.11-slim) was published with breaking changes. Fix: pin the base image digest.

  3. PyPI/network availability: Transient network failure to PyPI. Fix: add retry logic or use a private mirror.

  4. Build arg or secret not available: If the build step requires --build-arg or --secret that was changed in the workflow or repository secrets.

Common mistakes: - Re-running the workflow without investigating (if it is a flaky dependency, it will fail again) - Assuming the code changed when the dependencies did not (transitive deps are the usual culprit) - Not pinning the base image (OS-level packages in the base image can also break pip installs) - Not reading the full error output (pip errors are verbose but the actual failure is near the end)

Study: github-actions/primer.md, docker/primer.md, container-images/primer.md


Q40. kubectl Manual Patch vs ArgoCD

Model Answer (Strong):

The core conflict: ArgoCD manages the Auth Service declaratively from Git. A manual kubectl apply introduced state that is not in Git. ArgoCD's next sync will revert it.

Resolution options:

  1. If the change should persist: Commit it to the GitOps repository:

    # Add the environment variable to the Helm values
    cd argocd-manifests/
    vim charts/auth-service/values-prod.yaml
    # Add the env var
    git add . && git commit -m "feat: add debug env var to auth-service"
    git push
    # ArgoCD will sync and the change persists in a GitOps-compliant way
    

  2. If the change was temporary (debugging done): Let ArgoCD revert it:

    argocd app sync auth-service
    # ArgoCD applies the Git version, removing the manual change
    

  3. If the change is needed NOW but ArgoCD keeps reverting before you can commit:

    # Temporarily disable auto-sync
    argocd app set auth-service --sync-policy none
    # Make your manual change
    kubectl apply -f ...
    # When done debugging, commit to Git and re-enable auto-sync
    argocd app set auth-service --sync-policy automated
    argocd app sync auth-service
    

Key principle: In a GitOps workflow, Git is the source of truth. Manual kubectl changes are ephemeral. Any change that needs to persist must go through Git.

Common mistakes: - Leaving auto-sync disabled after debugging (drift accumulates) - Not understanding that ArgoCD compares live state to Git, not to the last sync - Using kubectl apply with --force to override ArgoCD (creates a sync loop) - Not documenting why the debug env var was added (next person removes it without understanding)

Study: argocd-gitops/primer.md, argocd-gitops/street_ops.md


Q41. Helm Chart Breaking Change Rollback

Model Answer (Strong):

  1. Immediate rollback in staging:

    helm history search-service -n staging
    # Find the last working revision
    helm rollback search-service <previous-revision> -n staging --wait --timeout=5m
    

  2. Verify the rollback:

    kubectl rollout status deployment/search-service -n staging
    kubectl get pods -n staging -l app=search-service
    # Confirm the service is back to the working version
    

  3. Revert the GitOps commit:

    cd argocd-manifests/
    git revert HEAD  # Revert the commit that changed the values
    git push
    # ArgoCD will sync staging back to the previous Helm values
    

  4. How the migration should have been handled:

  5. Backward-compatible values: The chart should have supported both the old and new field names during a transition period, using default and coalesce in templates:
    # In the Helm template:
    {{ .Values.newFieldName | default .Values.oldFieldName }}
    
  6. Staged rollout: Deploy the chart with backward compatibility first, then update the values, then remove backward compatibility in a subsequent release.
  7. Staging validation: ArgoCD should auto-sync staging. If staging fails, the commit should not be promoted to prod.
  8. Values schema validation: Use values.schema.json in the Helm chart to validate required fields before deployment.

Common mistakes: - Rolling back Helm in the cluster without reverting the Git commit (ArgoCD will re-apply the broken version) - Not checking if the rollback introduces its own issues (if the old values have the deprecated field name and the chart was also updated, rollback may not work) - Deploying breaking chart changes without a migration path - Not testing chart upgrades with helm diff upgrade before applying

Study: helm/primer.md, helm/footguns.md


Q42. Container Registry Cleanup

Model Answer (Strong):

  1. Identify what can be deleted:

    # List all tags for a repository
    gh api orgs/meridian/packages/container/order-service/versions --paginate | jq '.[].metadata.container.tags'
    
    # Safe to delete:
    # - Images older than 30 days that are NOT deployed to any cluster
    # - Untagged images (failed builds, superseded by tagged versions)
    # - Feature branch images (pr-* tags) after PR merge
    
    # NOT safe to delete:
    # - Any image currently deployed to prod, staging, or DR
    # - The last 5 tags per service (rollback targets)
    

  2. Check what is currently deployed:

    kubectl get deployments -n meridian-prod -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[*].image}{"\n"}{end}'
    kubectl get deployments -n staging -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[*].image}{"\n"}{end}'
    # Build a "do not delete" list from these outputs
    

  3. Manual cleanup:

    # Delete old untagged versions
    gh api -X DELETE orgs/meridian/packages/container/order-service/versions/<version-id>
    

  4. Automate going forward:

    # Add a cleanup job in GitHub Actions
    - name: Delete old container images
      uses: actions/delete-package-versions@v4
      with:
        package-name: order-service
        package-type: container
        min-versions-to-keep: 10
        delete-only-untagged-versions: true
    

Also configure retention policies: - Keep last 10 tagged versions per service - Delete untagged versions older than 7 days - Delete PR images 24 hours after PR merge

Common mistakes: - Deleting images that are currently deployed (breaks rollbacks and pod restarts) - Not checking DR cluster (images needed there may be older) - Deleting by age alone without checking deployment status - Not automating cleanup (the problem will recur)

Study: github-actions/primer.md, container-images/primer.md, ci-cd-patterns/primer.md


Q43. Canary Deployment with Argo Rollouts

Model Answer (Strong):

  1. Convert the Order Service Deployment to a Rollout:

    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    metadata:
      name: order-service
      namespace: meridian-prod
    spec:
      replicas: 3
      strategy:
        canary:
          canaryService: order-service-canary
          stableService: order-service-stable
          trafficRouting:
            nginx:
              stableIngress: order-service-ingress
              # Ingress-NGINX uses annotations to split traffic
          steps:
            - setWeight: 10    # Send 10% of traffic to canary
            - pause: {duration: 5m}
            - analysis:
                templates:
                  - templateName: order-service-success-rate
            - setWeight: 30
            - pause: {duration: 5m}
            - analysis:
                templates:
                  - templateName: order-service-success-rate
            - setWeight: 60
            - pause: {duration: 5m}
            - setWeight: 100   # Full rollout
    

  2. Traffic splitting with Ingress-NGINX:

  3. Argo Rollouts creates a canary Ingress with the nginx.ingress.kubernetes.io/canary: "true" and canary-weight: "10" annotations
  4. Ingress-NGINX routes the specified percentage to the canary Service

  5. Analysis template (rollback trigger):

    apiVersion: argoproj.io/v1alpha1
    kind: AnalysisTemplate
    metadata:
      name: order-service-success-rate
    spec:
      metrics:
        - name: success-rate
          interval: 60s
          count: 5
          successCondition: result[0] >= 0.95
          provider:
            prometheus:
              address: http://prometheus.monitoring:9090
              query: |
                sum(rate(http_request_duration_seconds_count{
                  namespace="meridian-prod",
                  app="order-service",
                  status=~"2.."
                }[2m])) /
                sum(rate(http_request_duration_seconds_count{
                  namespace="meridian-prod",
                  app="order-service"
                }[2m]))
    

  6. Rollback trigger: If the success rate drops below 95% during any analysis phase, Argo Rollouts automatically aborts the rollout and scales the canary to 0.

Common mistakes: - Not creating separate stable and canary Service objects (traffic splitting requires distinct backends) - Using deployment strategy: canary (which just pauses between batches) instead of Argo Rollouts (which does traffic splitting) - Not defining analysis metrics (canary without analysis is just a slow rollout) - Forgetting that Ingress-NGINX canary annotations have limitations (no per-user targeting)

Study: progressive-delivery/primer.md, argocd-gitops/primer.md, Case: Canary Deploy Wrong Backend Ingress


Q44. Database Migration with Zero-Downtime Deployment

Model Answer (Strong):

The challenge: Adding a non-nullable column to a 50M-row table requires: - The column exists before the new code deploys (new code expects it) - The old code must not break when the column appears (backward compatibility) - The migration on a 50M-row table may lock the table

Safe orchestration (expand-contract pattern):

  1. Phase 1: Expand (before new code deploys)

    -- Add the column as NULLABLE first (no table lock, instant in PostgreSQL 11+)
    ALTER TABLE invoices ADD COLUMN new_field TEXT;
    
    -- Backfill in batches (no lock, runs alongside live traffic)
    UPDATE invoices SET new_field = 'default_value'
      WHERE id BETWEEN 1 AND 1000000;
    -- Repeat in batches of 1M with COMMIT between batches
    
    Run this as a Kubernetes Job or pre-deploy hook, NOT as part of the application startup.

  2. Phase 2: Deploy new code

  3. The new code writes to the new column and reads from it
  4. The new code handles NULL gracefully (some rows may not be backfilled yet)
  5. Rolling update replaces pods one at a time

  6. Phase 3: Contract (after all pods are on new version)

    -- Verify all rows are backfilled
    SELECT COUNT(*) FROM invoices WHERE new_field IS NULL;
    
    -- Add NOT NULL constraint (requires all rows to be non-null)
    ALTER TABLE invoices ALTER COLUMN new_field SET NOT NULL;
    -- In PostgreSQL, this requires a table scan but does NOT lock writes
    -- (it takes an ACCESS EXCLUSIVE lock only briefly at the end)
    

  7. Rollback plan:

  8. If Phase 2 fails: roll back the code. The nullable column remains but causes no harm.
  9. If Phase 3 fails: the column stays nullable. Schedule a maintenance window.
  10. The old code ignores the new column (it does not SELECT * — it selects specific columns).

Implementation in Helm/ArgoCD:

# Use a pre-sync Job in ArgoCD annotations
metadata:
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded

Common mistakes: - Running ALTER TABLE ADD COLUMN NOT NULL DEFAULT 'x' on 50M rows (locks the table for minutes in older PostgreSQL versions) - Making the migration part of the application startup (blocks rolling update, runs on every pod) - Not backfilling in batches (one large UPDATE locks the table) - Not testing rollback (deploying migration + code together makes rollback impossible) - Assuming Django/SQLAlchemy migrate is safe for large tables (it often is not)

Study: database-ops/primer.md, ci-cd-patterns/primer.md, postgresql/primer.md


Section 7: Cross-Domain & Incident Response

Q45. 3 AM Alert — First 10 Minutes

Model Answer (Strong):

Minute 0-1: Acknowledge and orient

- Acknowledge the PagerDuty alert (stops escalation timer)
- Open phone: PagerDuty app → Alert details → Link to Grafana dashboard
- Join Slack #incidents channel
- Post: "Ack'd OrderServiceErrorRateHigh. Investigating. @oncall-secondary FYI."

Minute 1-3: Check the SLO dashboard

- Open Grafana: /d/slo-burn-rate
- Check: Is this a sudden spike or gradual degradation?
- Check: What percentage of requests are failing? (5% = alert threshold, but is it climbing?)
- Check: Which endpoints are affected? (all or specific?)

Minute 3-5: Check recent changes

# Was there a recent deploy?
helm history order-service -n meridian-prod
kubectl get events -n meridian-prod --sort-by='.lastTimestamp' | head -20
# Was there an infrastructure change?
# Check ArgoCD: any recent syncs?
argocd app get order-service

Minute 5-7: Check the Order Service directly

kubectl get pods -n meridian-prod -l app=order-service
kubectl logs -n meridian-prod -l app=order-service --tail=50 --since=15m | grep -i error
# Check downstream dependencies
kubectl exec -it <order-pod> -n meridian-prod -- curl -s http://localhost:8080/health

Minute 7-9: Check dependencies

# PostgreSQL
kubectl exec -it <order-pod> -- pg_isready -h <rds-endpoint>
# RabbitMQ
kubectl get pods -n rabbitmq
# Redis
kubectl exec -it <order-pod> -- redis-cli -h <redis-endpoint> ping

Minute 9-10: Decision point

If root cause is clear  fix it
If recent deploy  rollback: helm rollback order-service <prev> -n meridian-prod
If dependency is down  escalate to the dependency owner
If unclear  escalate to secondary, continue investigation
Post status update to #incidents regardless

Communication template:

Status: Investigating
Impact: ~5% of orders failing (estimated X orders per minute affected)
Timeline: Alert fired at 03:17, investigating since 03:18
Next update: In 10 minutes or when status changes

Common mistakes: - Investigating for 30 minutes without posting a status update - Not acknowledging the alert (escalation timer fires, wakes up secondary unnecessarily) - Deep-diving into logs before checking if a deploy caused it (rollback is faster than debugging) - Not checking dependencies (the Order Service may be healthy but PostgreSQL is the problem) - Trying to fix the root cause at 3 AM instead of mitigating first

Study: incident-command/primer.md, incident-triage/primer.md, incident-command/street_ops.md


Q46. Correlating Multiple Simultaneous Alerts

Model Answer (Strong):

Correlation approach:

  1. Timeline alignment: Check the exact timestamp each alert started. If they started within seconds of each other, they likely share a root cause. If one preceded the others by minutes, it may be the cause.

  2. Dependency mapping: Map the alerts to the architecture:

    NodeDiskPressure on data-node-2
        └── data-node-2 runs Elasticsearch
            └── ElasticsearchClusterYellow (lost replica on data-node-2)
                └── SearchServiceLatencyHigh (degraded Elasticsearch performance)
    

  3. Most likely common root cause: The disk on data-node-2 filled up. This caused:

  4. Elasticsearch on that node to stop writing (disk watermark exceeded)
  5. Elasticsearch cluster to go yellow (replica shards on data-node-2 became unavailable)
  6. Search queries to slow down (fewer shards to serve queries, no replicas for read distribution)

  7. Verification:

    # Check data-node-2 disk
    ssh data-node-2 df -h
    # Check Elasticsearch disk watermark
    curl -s http://elasticsearch:9200/_cluster/settings | jq '.persistent.cluster.routing.allocation.disk'
    # Check which shards are on data-node-2
    curl -s http://elasticsearch:9200/_cat/shards?v&h=index,shard,prirep,state,node | grep data-node-2
    

  8. Treat as one incident, not three. The root cause is disk pressure on one node. Fixing that resolves all three alerts.

Common mistakes: - Treating each alert as a separate incident (wastes time, misses the root cause) - Focusing on the highest-severity alert without checking if a lower-severity alert is the cause - Not using the architecture diagram to trace dependencies - Assuming coincidence when the timing is too close

Study: incident-triage/primer.md, incident-command/primer.md, Case: Alert Storm Flapping Healthchecks


Q47. Postmortem Template and Process

Model Answer (Strong):

Template:

# Incident Postmortem: Order Service Outage — [Date]

## Summary
- Duration: 23 minutes (HH:MM — HH:MM UTC)
- Impact: ~3,400 orders affected, estimated $X revenue impact
- Severity: Sev1
- Detection: Automated alert (OrderServiceErrorRateHigh)
- Resolution: [one-line description]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fires |
| HH:MM | On-call acknowledges |
| ... | ... |
| HH:MM | Service restored |
| HH:MM | All-clear declared |

## Root Cause
[Clear, specific explanation. Not "human error."]

## Contributing Factors
[What made this worse or harder to detect?]

## Trigger
[The specific event that started the incident]

## Detection
- How was it detected? (alert, customer report, internal)
- Could we have detected it sooner? How?

## Response
- What went well?
- What could have been better?

## Action Items
| # | Action | Owner | Priority | Due Date |
|---|--------|-------|----------|----------|
| 1 | ... | ... | P1 | ... |

## Lessons Learned
[What did we learn that we did not know before?]

Who should attend: - Primary and secondary on-call during the incident - Engineers who participated in the response - Engineering manager (observer, not to assign blame) - Product manager (to understand customer impact) - Optional: SRE/Platform team lead, affected service owner

Timeline reconstruction: - Start from PagerDuty and Slack logs (timestamped automatically) - Cross-reference with Grafana annotations and deploy history - Ask each participant to fill in their actions with timestamps - Use the observability stack to verify timing (Prometheus queries, Loki logs)

Distinguishing root cause, contributing factors, and triggers: - Trigger: The specific event that started the incident (e.g., "a deploy at 14:23 introduced a bug in order validation") - Root cause: The underlying systemic issue (e.g., "no integration test covers the order validation path that was broken") - Contributing factors: Things that made it worse (e.g., "the canary analysis did not include order success rate", "the on-call runbook was outdated")

Common mistakes: - Blaming individuals ("Bob deployed the broken code" — blameless postmortems focus on systems) - Not assigning owners and due dates to action items (they never get done) - Confusing trigger with root cause (fixing only the trigger means the systemic issue remains) - Holding the postmortem weeks later when details are forgotten (within 48 hours)

Study: postmortem-slo/primer.md, incident-command/primer.md


Q48. DR Failover Process

Model Answer (Strong):

Step-by-step failover:

  1. Assess the situation (2 minutes):

    - Confirm us-east-1 is actually down (not just our infrastructure)
    - Check AWS status page, Twitter, other services in the region
    - Confirm DR cluster in eu-west-1 is healthy
    

  2. Communicate (1 minute):

    Post to #incidents: "us-east-1 regional outage confirmed. Initiating DR failover."
    Notify engineering manager and VP Engineering (Sev1)
    Update StatusPage: "We are aware of the issue and are failing over to our DR site."
    

  3. Database promotion (5-7 minutes):

    # Promote the RDS read replica in eu-west-1 to primary
    aws rds promote-read-replica --db-instance-identifier meridian-dr-replica --region eu-west-1
    # This takes 3-5 minutes
    # During this time, writes are impossible — the system is read-only
    
    # Update the database connection strings in Vault
    vault kv put secret/meridian/database host=<new-primary-endpoint> ...
    # Restart ESO to pick up new secrets
    kubectl rollout restart deployment/external-secrets -n vault --context=dr-cluster
    

  4. DNS cutover (1-2 minutes):

    # Route53 health checks should auto-failover if configured
    # If not automatic:
    aws route53 change-resource-record-sets --hosted-zone-id <zone> --change-batch '{
      "Changes": [{
        "Action": "UPSERT",
        "ResourceRecordSet": {
          "Name": "api.meridian.io",
          "Type": "A",
          "AliasTarget": {
            "DNSName": "<eu-west-1-nlb-dns>",
            "HostedZoneId": "<nlb-zone>",
            "EvaluateTargetHealth": true
          }
        }
      }]
    }'
    # TTL is 60s, so propagation takes 1-2 minutes
    

  5. Service verification (3-5 minutes):

    # Check all services are running in DR
    kubectl get pods --all-namespaces --context=dr-cluster | grep -v Running
    # Run smoke tests
    curl https://api.meridian.io/health  # Should hit DR cluster now
    # Check error rate in DR cluster Grafana
    

  6. Cache warming (5-10 minutes):

    - Redis in DR is cold (not replicated from prod)
    - First requests will be slower (cache misses)
    - Session cache is empty: all users will need to re-authenticate
    

Expected data loss: - PostgreSQL: RPO < 1 minute (async replication with ~1s lag). Up to 1 second of committed transactions may be lost. - RabbitMQ: All in-flight messages are lost (not replicated to DR). Order events in the queue will need to be reconciled from the database after recovery. - Redis: All cache data is lost. No data loss (cache is not authoritative). - Elasticsearch: Indexes will need to be rebuilt from PostgreSQL in DR. Search may be stale until reindexing completes (45 minutes).

Total RTO: ~15 minutes (DNS propagation + DB promotion + verification)

Common mistakes: - Not confirming the regional outage first (could be a localized issue that does not require DR) - Promoting the DB replica before confirming DR cluster health - Forgetting about session state in Redis (all users are logged out) - Not planning for RabbitMQ message loss (in-flight orders need reconciliation) - Not communicating to customers via StatusPage

Study: disaster-recovery/primer.md, backup-restore/primer.md


Q49. Onboarding a New On-Call Engineer

Model Answer (Strong):

10 most important things, in order:

  1. How to acknowledge and escalate alerts: Show them PagerDuty, the escalation matrix, and the Slack #incidents channel. This is literally the first thing they will need at 3 AM.

  2. The architecture overview: Walk through architecture.md together. Focus on the data flow: user request -> CDN -> Ingress -> Kong -> services -> data stores. They need to know the dependency chain.

  3. Key Grafana dashboards: Show them the System Overview, SLO Burn Rate, and Order Pipeline dashboards. Teach them to read the golden signals (latency, traffic, errors, saturation).

  4. The 5 most common incidents and their runbooks: Based on the last quarter's incidents:

  5. Pod CrashLoopBackOff (check logs, check deploy, rollback)
  6. High error rate (check dependencies, check recent changes)
  7. Database connection exhaustion (check PgBouncer, check pool size)
  8. Certificate expiry (check cert-manager, check issuer)
  9. Node NotReady (check kubelet, check system resources)

  10. How to roll back a deploy: helm rollback + Git revert + ArgoCD sync. Practice this once in staging.

  11. How to access the cluster: kubectl contexts (prod, staging, DR), VPN access, bastion host for node SSH, Vault UI for secrets. Verify they can actually connect.

  12. The deployment pipeline: How code gets from Git to production (GitHub Actions -> GHCR -> ArgoCD -> Helm -> cluster). They need to know this to debug deploy failures.

  13. Database access and safety: How to connect to PostgreSQL read replica for investigation (never the primary during an incident). PgBouncer connection pooling. Read-only credentials.

  14. Communication expectations: When to post to #incidents, when to update StatusPage, when to escalate, how to write a status update.

  15. Where to find help: On-call runbooks, team wiki, secondary on-call phone number, engineering manager phone number, Slack channels for specific domains.

Then: have them shadow for one full rotation before going solo. During shadow, they handle alerts with the primary available for guidance.

Common mistakes: - Starting with the architecture without covering alerting first (they need to know what to do when paged before they understand why) - Giving them documentation to read instead of walking through it together - Not verifying they can access the cluster (VPN, kubectl contexts, Vault) - Overwhelming with details instead of focusing on the 5 most common incidents - Not having them practice a rollback in staging

Study: incident-command/primer.md, runbook-craft/primer.md


Q50. Biggest Operational Risk and Mitigation

Model Answer (Strong):

The single biggest risk: RabbitMQ is a single point of failure for the entire order pipeline, and it is not replicated to DR.

Why this is the biggest risk:

  • Probability: Medium-high. RabbitMQ is self-managed (not a managed service like RDS), runs on in-cluster storage, and has known failure modes during network partitions (even with pause-minority mode).

  • Impact: Critical. If RabbitMQ goes down:

  • Order events stop flowing to Fulfillment Service (orders confirmed but not fulfilled)
  • Inventory updates stop (stock levels become stale, causing overselling)
  • Notification Service stops (customers not notified)
  • Worker Service jobs stall
  • The Order Service circuit breaker trips after 5s, causing order creation failures
  • 6 of 8 microservices are directly affected

  • Detection difficulty: Medium. We have RabbitMQ metrics, but queue backup can grow silently for minutes before error rates trigger alerts. The dead-letter queue fills up last.

  • DR gap: RabbitMQ is not replicated to the DR cluster. During a regional failover, all in-flight messages are lost. This means orders in the pipeline between "confirmed" and "fulfilled" are in an unknown state.

Mitigation plan:

  1. Short-term (1-2 weeks):
  2. Add alerting on queue depth growth rate (catch backup before consumers die)
  3. Implement client-side retry with persistent outbox pattern in the Order Service (orders are never lost even if RabbitMQ is down)
  4. Document the RabbitMQ failure runbook

  5. Medium-term (1-2 months):

  6. Migrate from self-managed RabbitMQ to Amazon MQ or Amazon SQS (managed service, higher availability, cross-region replication)
  7. Add order reconciliation job that compares database state with fulfillment state (catches dropped messages)

  8. Long-term (1 quarter):

  9. Implement event sourcing for the order pipeline (events stored in PostgreSQL, RabbitMQ becomes delivery optimization, not source of truth)
  10. Add chaos engineering tests for RabbitMQ failure

Alternative answer: A strong answer could also argue for PostgreSQL as the biggest risk (single writer, 60-120s failover, 50M-row migration lock risk) or Vault (sealed Vault = all secrets unavailable = all services fail). The key is the reasoning, not the specific choice.

Common mistakes: - Picking a low-probability risk (e.g., "S3 outage") without considering probability - Not explaining impact in terms of customer/business effect - Proposing a mitigation without a timeline and effort estimate - Not considering the DR gap (many risks are amplified by the DR limitations) - Picking a generic answer ("security breach") without tying it to this specific architecture

Study: disaster-recovery/primer.md, chaos-engineering/primer.md, incident-triage/primer.md, risk-management-and-safety-thinking/primer.md