Skip to content

K8S Advanced Ops

← Back to all decks

32 cards — 🟢 3 easy | 🟡 13 medium | 🔴 9 hard

🟢 Easy (3)

1. What is the standard three-command diagnostic flow when a pod is misbehaving?

Show answer 1) kubectl describe pod — check Events, conditions, and container state.
2) kubectl logs [-c container] [--previous] — read application stdout/stderr.
3) kubectl get events --field-selector involvedObject.name= — see cluster-level events. This describe-logs-events flow covers 90% of initial triage.

2. A Service has no endpoints even though pods are running. What do you check?

Show answer kubectl get endpoints shows empty. Compare kubectl get svc -o wide selector labels against kubectl get pods --show-labels. A mismatch between the Service selector and pod labels is the most common cause. Also verify pods are in the same namespace as the Service and that pods are Ready (readiness probe passing).

3. What is the difference between resource requests and limits, and which one affects scheduling?

Show answer Requests are guaranteed resources the scheduler uses for bin-packing — a pod is scheduled only if a node has enough allocatable capacity for the request. Limits are the max the container can use; exceeding CPU limits causes throttling, exceeding memory limits causes OOMKill.
Best practice: set requests close to actual usage and limits as a safety ceiling.

🟡 Medium (13)

1. A Deployment rollout is stuck at 1/3 updated replicas. How do you diagnose it?

Show answer kubectl rollout status deployment/ shows the stall. kubectl describe deployment reveals the reason under Conditions (e.g., ProgressDeadlineExceeded). Check the new ReplicaSet's pods with kubectl get rs then kubectl describe pod on the stuck pods — usually a crash, image pull failure, or resource quota exhaustion.

2. Pods restart every 90 seconds but application logs show no errors. What is the most likely cause?

Show answer A misconfigured liveness probe. The probe path may return non-200, the port may be wrong, or timeoutSeconds is too short for the endpoint. kubectl describe pod will show 'Liveness probe failed' events with the HTTP status or connection error. Fix the probe spec, not the app, if the app is actually healthy.

3. A new node is joined to the cluster but no pods schedule onto it. kubectl describe node shows a taint. What do you do?

Show answer Check the taint with kubectl describe node | grep Taints. If the taint is intentional (e.g., dedicated=gpu:NoSchedule), add a matching toleration to pods that should run there. If accidental, remove it: kubectl taint nodes dedicated=gpu:NoSchedule-. The trailing minus sign removes the taint.

4. A PVC is stuck in Pending state. What are the common causes?

Show answer 1) No PV matches the PVC's storageClassName, access mode, or capacity request.
2) The StorageClass provisioner is misconfigured or not installed.
3) The cloud provider hit a quota or zone availability limit.
4) The PVC requests a mode (ReadWriteMany) the provisioner does not support. Check kubectl describe pvc Events for the specific error.

5. Pods can reach a Service by ClusterIP but DNS resolution for ..svc.cluster.local fails. What do you investigate?

Show answer 1) Check CoreDNS pods are running: kubectl -n kube-system get pods -l k8s-app=kube-dns.
2) Verify the pod's /etc/resolv.conf points to the kube-dns ClusterIP.
3) Test from inside a pod: nslookup ..svc.cluster.local.
4) Check CoreDNS logs for errors.
5) Confirm no NetworkPolicy is blocking UDP/TCP 53 to kube-dns.

6. DNS lookups from pods are slow, adding 5+ seconds of latency. What is the likely cause?

Show answer The default ndots:5 in pod resolv.conf causes the resolver to try multiple search domains before querying the absolute name. For external domains, each attempt times out against cluster DNS. Fix: set dnsConfig.options ndots:2 in the pod spec, or always use FQDNs with a trailing dot (e.g., api.example.com.) to bypass search expansion.

7. Pods are being evicted with the message 'The node was low on resource: ephemeral-storage'. What is happening?

Show answer Kubelet's eviction manager detected ephemeral storage usage exceeding the threshold (default ~85%). It evicts pods in order of priority and usage. Check node conditions: kubectl describe node | grep -A 5 Conditions. Clean up: remove unused images (crictl rmi --prune), check for pods writing large files to emptyDir, and set resource limits on ephemeral-storage in pod specs.

8. A container is OOMKilled but the application memory profiler shows usage well below the limit. Why?

Show answer The OOM limit applies to the entire cgroup, not just heap. It includes RSS, page cache, tmpfs mounts, and child processes. Also, the JVM or runtime may allocate off-heap memory (NIO buffers, thread stacks). Check kubectl describe pod for the exact Last State OOMKilled exit code 137 and compare against the actual RSS with cat /sys/fs/cgroup/memory/memory.usage_in_bytes inside the container.

9. HPA keeps scaling to max replicas even when average CPU is low. What could cause this?

Show answer 1) One pod is spiking and the average is skewed by replica count.
2) The metric source is wrong (using total CPU instead of per-pod average).
3) Readiness probe failures cause fewer ready pods, inflating per-pod averages.
4) A recent deploy created pods that are initializing and consuming startup CPU. Check kubectl describe hpa and kubectl top pods to correlate.

10. You try to drain a node but it hangs. kubectl drain shows 'evicting pod ... Cannot evict pod as it would violate the pod's disruption budget'. How do you proceed?

Show answer A PodDisruptionBudget (PDB) is blocking eviction because draining would reduce available replicas below minAvailable or above maxUnavailable. Options:
1) Scale up the deployment first so draining one node stays within budget.
2) Check if another node already has disrupted pods.
3) As a last resort, delete the PDB temporarily (kubectl delete pdb ), drain, then recreate it.

11. A pod is stuck in Init:0/2 status. How do you debug it?

Show answer kubectl describe pod shows init container status and events. kubectl logs -c shows the init container's output. Init containers run sequentially — 0/2 means the first init container has not completed. Common causes: waiting on a dependency (DNS, service, database), wrong command, or missing ConfigMap/Secret volume.

12. You create a resource but cannot find it with kubectl get. What namespace-related mistakes should you check?

Show answer 1) The resource was created in a different namespace — always use -n or --all-namespaces.
2) Your kubeconfig context has a default namespace set that differs from where the resource lives.
3) The resource is cluster-scoped (e.g., ClusterRole, PV, Node) and does not appear with -n. Check with kubectl api-resources --namespaced=false.
4) RBAC may hide resources you lack permission to list.

13. You delete a pod managed by a Deployment but it immediately reappears. Why?

Show answer The Deployment's ReplicaSet controller continuously reconciles actual state to desired state. When you delete a pod, the controller detects the replica count is below spec.replicas and creates a replacement. To actually remove the workload, delete or scale down the Deployment (kubectl scale deployment --replicas=0), not individual pods.

🔴 Hard (9)

1. You run kubectl rollout undo but the previous version also had issues. How do you roll back to a specific revision?

Show answer kubectl rollout history deployment/ lists revisions with change-cause annotations. kubectl rollout undo deployment/ --to-revision= targets a specific known-good revision. Always confirm with kubectl rollout status and check pod readiness before declaring recovery.

2. An app takes 120 seconds to initialize. Liveness probe kills it before startup completes. How do you fix this without removing the liveness probe?

Show answer Add a startup probe with a generous failureThreshold
* periodSeconds window (e.g., failureThreshold: 30, periodSeconds: 5 = 150s). The startup probe blocks liveness and readiness probes until it succeeds. This keeps fast-restart protection while allowing slow cold starts.

3. Pods are Pending cluster-wide after a control plane upgrade. All worker nodes show a NoSchedule taint. What happened and how do you recover?

Show answer The upgrade likely re-applied node-role.kubernetes.io/control-plane:NoSchedule and may have incorrectly tainted workers. Verify with kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints. Remove errant taints from workers: kubectl taint nodes :NoSchedule-. For control-plane nodes, add tolerations only for system-critical pods.

4. You need to expand a PVC from 10Gi to 50Gi but the pod won't restart. What is the process?

Show answer 1) Verify the StorageClass has allowVolumeExpansion: true.
2) Edit the PVC: kubectl patch pvc -p '{\spec\":{\"resources\":{\"requests\":{\"storage\":\"50Gi\"}}}}'.
3) For file-system volumes, the resize happens on the next pod mount — you must delete and recreate the pod (not the PVC).
4) Check kubectl describe pvc for FileSystemResizePending condition."

5. A node goes NotReady and multiple pods are evicted simultaneously. How do you investigate and prevent recurrence?

Show answer 1) kubectl describe node — check Conditions for MemoryPressure, DiskPressure, PIDPressure.
2) Check kubelet logs on the node: journalctl -u kubelet.
3) For memory: identify pods without memory limits (they can consume unbounded memory).
4) Prevent: set memory requests and limits on all pods, configure ResourceQuotas per namespace, and consider LimitRanges to enforce defaults.

6. HPA shows /80% for CPU target and won't scale. What is wrong?

Show answer The HPA cannot read metrics. Common causes:
1) metrics-server is not installed or not running.
2) The target Deployment pods have no CPU requests set — HPA needs requests to compute utilization percentage.
3) metrics-server cannot reach kubelets (firewall or certificate issue). Verify: kubectl top pods (should return data), and kubectl describe hpa for Conditions and events.

7. A rolling update is stuck because the PDB minAvailable equals the replica count. Why is this a problem and how do you fix it?

Show answer If minAvailable equals replicas (e.g., 3/3), the controller cannot evict any old pod to make room for a new one — a deadlock. Fix: set minAvailable to replicas-1 (or use maxUnavailable: 1 instead). This allows the rolling update to terminate one old pod at a time while maintaining minimum availability.

8. You deploy an Envoy sidecar with your app container. After a rolling update, requests fail for a few seconds. What is the likely cause and how do you fix it?

Show answer The app container starts receiving traffic before the Envoy sidecar is ready (race condition). Fix:
1) Use a startup/readiness probe on the sidecar.
2) In Kubernetes 1.28+, use the native sidecar feature (restartPolicy: Always in initContainers) which guarantees sidecar readiness before the main container starts.
3) Alternatively, add a postStart lifecycle hook that polls the sidecar's health endpoint.

9. Pods from the same Deployment keep landing on the same node, causing a single point of failure. How do you spread them across nodes?

Show answer Use pod topology spread constraints: topologySpreadConstraints with topologyKey: kubernetes.io/hostname, maxSkew: 1, and whenUnsatisfiable: DoNotSchedule. Alternatively, use pod anti-affinity with requiredDuringSchedulingIgnoredDuringExecution matching the app label. Topology spread constraints are more flexible than anti-affinity because they allow fine-grained skew control across zones and nodes.