Skip to content

K8S Probes

← Back to all decks

12 cards — 🟢 4 easy | 🟡 4 medium | 🔴 4 hard

🟢 Easy (4)

1. What does a Kubernetes liveness probe determine?

Show answer Whether the container is still alive. If the liveness probe fails, Kubernetes kills and restarts the container.

2. What happens when a readiness probe fails?

Show answer The pod is removed from Service endpoints so it stops receiving traffic, but it is NOT restarted. Traffic is routed to other healthy pods.

3. What are the four probe mechanisms Kubernetes supports?

Show answer httpGet (HTTP GET returning 2xx/3xx), tcpSocket (TCP port is open), exec (command exits 0), and grpc (gRPC health check, Kubernetes 1.24+).

4. What is the purpose of a startup probe?

Show answer It tells Kubernetes the container is still booting. While the startup probe is running, liveness and readiness probes are disabled. Once it succeeds, it never runs again and the other probes take over.

🟡 Medium (4)

1. Why is checking database connectivity in a liveness probe dangerous?

Show answer If the database goes down, all pods fail liveness simultaneously, Kubernetes restarts them all, they thundering-herd the database on reconnection, and the cycle repeats — a cascading restart storm.

2. What does failureThreshold control, and how does it interact with periodSeconds?

Show answer failureThreshold is the number of consecutive probe failures before Kubernetes takes action. Combined with periodSeconds, it sets the detection window: e.g., failureThreshold=3 and periodSeconds=10 means 30 seconds of failures before restart or endpoint removal.

3. How do readiness probes interact with rolling deployments?

Show answer New pods must pass their readiness probe before receiving traffic and before old pods are terminated. If new pods never become ready (e.g., broken config), the rollout stalls and old pods continue serving — preventing a bad deploy from causing downtime.

4. Before startup probes existed, how did operators handle slow-starting containers, and why was that approach fragile?

Show answer They set a large initialDelaySeconds on the liveness probe. This was fragile because if the app started faster, detection of a truly dead container was delayed; if it started slower, the liveness probe would kill it during boot.

🔴 Hard (4)

1. How should liveness and readiness probe endpoints differ in what they check, and why?

Show answer Liveness should only verify the process is alive and responsive (no dependency checks) — it answers "should I restart?" Readiness should check dependencies, cache warmth, and load — it answers "can I serve traffic?" Using the same endpoint for both causes dependency failures to trigger unnecessary restarts.

2. A pod shows RESTARTS=5 and readiness probe failures in kubectl describe. Walk through how you would debug this.

Show answer 1) kubectl describe pod to read probe failure events and identify which probe is failing. 2) Check lastState for exit codes (e.g., 137=OOMKilled). 3) kubectl exec into the pod and curl the probe endpoint manually to see the actual response. 4) kubectl logs --previous to read logs from the crashed container. 5) Verify probe port matches the container port and the endpoint returns the expected status code.

3. Why can JVM garbage collection cause liveness probe failures, and how do you mitigate it?

Show answer Full GC pauses can stop the JVM for several seconds. If timeoutSeconds is set too low (e.g., 1s) and a GC pause exceeds that, the probe times out and counts as a failure. Mitigation: increase timeoutSeconds to exceed worst-case GC pause duration, use a startup probe for slow JVM boot, and tune GC to reduce pause times.

4. What is the constraint on successThreshold for liveness and startup probes, and why does it matter for readiness?

Show answer For liveness and startup probes, successThreshold must be 1 (Kubernetes ignores other values). Only readiness probes can require multiple consecutive successes before the pod is added back to endpoints. This matters because you may want a readiness probe to confirm stability (e.g., successThreshold=2) before resuming traffic after a transient failure.