Diagnostic Questions¶
Before revealing the investigation path:¶
-
The alert says the
payment-servicecontainer is OOMKilled. What is your first troubleshooting step? Would you immediately profile the application's memory usage, or check something else first? -
kubectl top pod --containersshows the app using 118Mi/256Mi and the sidecar using 127Mi/128Mi. How does Kubernetes decide which container to OOMKill when the pod is under memory pressure? Could the OOMKill attribution in the event be misleading? -
The sidecar memory usage started climbing 2 hours ago, which coincides with a Helm deployment. What commands would you run to determine what changed in the sidecar's configuration between the old and new deployment?
-
The root cause is a trace sampling change from 1% to 100% in Helm values. Why is the correct fix in the Helm values file (devops tooling) rather than increasing the sidecar memory limit (Kubernetes) or reconfiguring the tracing backend (observability)?
-
What guardrails would you put in place to prevent an observability configuration change from causing a production outage? Consider both the deployment pipeline and runtime monitoring.