K8S Advanced Ops¶
32 cards — 🟢 3 easy | 🟡 13 medium | 🔴 9 hard
🟢 Easy (3)¶
1. What is the standard three-command diagnostic flow when a pod is misbehaving?
Show answer
1) kubectl describe pod2) kubectl logs
3) kubectl get events --field-selector involvedObject.name=
2. A Service has no endpoints even though pods are running. What do you check?
Show answer
kubectl get endpoints3. What is the difference between resource requests and limits, and which one affects scheduling?
Show answer
Requests are guaranteed resources the scheduler uses for bin-packing — a pod is scheduled only if a node has enough allocatable capacity for the request. Limits are the max the container can use; exceeding CPU limits causes throttling, exceeding memory limits causes OOMKill.Best practice: set requests close to actual usage and limits as a safety ceiling.
🟡 Medium (13)¶
1. A Deployment rollout is stuck at 1/3 updated replicas. How do you diagnose it?
Show answer
kubectl rollout status deployment/2. Pods restart every 90 seconds but application logs show no errors. What is the most likely cause?
Show answer
A misconfigured liveness probe. The probe path may return non-200, the port may be wrong, or timeoutSeconds is too short for the endpoint. kubectl describe pod3. A new node is joined to the cluster but no pods schedule onto it. kubectl describe node shows a taint. What do you do?
Show answer
Check the taint with kubectl describe node4. A PVC is stuck in Pending state. What are the common causes?
Show answer
1) No PV matches the PVC's storageClassName, access mode, or capacity request.2) The StorageClass provisioner is misconfigured or not installed.
3) The cloud provider hit a quota or zone availability limit.
4) The PVC requests a mode (ReadWriteMany) the provisioner does not support. Check kubectl describe pvc
5. Pods can reach a Service by ClusterIP but DNS resolution for
Show answer
1) Check CoreDNS pods are running: kubectl -n kube-system get pods -l k8s-app=kube-dns.2) Verify the pod's /etc/resolv.conf points to the kube-dns ClusterIP.
3) Test from inside a pod: nslookup
4) Check CoreDNS logs for errors.
5) Confirm no NetworkPolicy is blocking UDP/TCP 53 to kube-dns.
6. DNS lookups from pods are slow, adding 5+ seconds of latency. What is the likely cause?
Show answer
The default ndots:5 in pod resolv.conf causes the resolver to try multiple search domains before querying the absolute name. For external domains, each attempt times out against cluster DNS. Fix: set dnsConfig.options ndots:2 in the pod spec, or always use FQDNs with a trailing dot (e.g., api.example.com.) to bypass search expansion.7. Pods are being evicted with the message 'The node was low on resource: ephemeral-storage'. What is happening?
Show answer
Kubelet's eviction manager detected ephemeral storage usage exceeding the threshold (default ~85%). It evicts pods in order of priority and usage. Check node conditions: kubectl describe node8. A container is OOMKilled but the application memory profiler shows usage well below the limit. Why?
Show answer
The OOM limit applies to the entire cgroup, not just heap. It includes RSS, page cache, tmpfs mounts, and child processes. Also, the JVM or runtime may allocate off-heap memory (NIO buffers, thread stacks). Check kubectl describe pod for the exact Last State OOMKilled exit code 137 and compare against the actual RSS with cat /sys/fs/cgroup/memory/memory.usage_in_bytes inside the container.9. HPA keeps scaling to max replicas even when average CPU is low. What could cause this?
Show answer
1) One pod is spiking and the average is skewed by replica count.2) The metric source is wrong (using total CPU instead of per-pod average).
3) Readiness probe failures cause fewer ready pods, inflating per-pod averages.
4) A recent deploy created pods that are initializing and consuming startup CPU. Check kubectl describe hpa and kubectl top pods to correlate.
10. You try to drain a node but it hangs. kubectl drain shows 'evicting pod ... Cannot evict pod as it would violate the pod's disruption budget'. How do you proceed?
Show answer
A PodDisruptionBudget (PDB) is blocking eviction because draining would reduce available replicas below minAvailable or above maxUnavailable. Options:1) Scale up the deployment first so draining one node stays within budget.
2) Check if another node already has disrupted pods.
3) As a last resort, delete the PDB temporarily (kubectl delete pdb
11. A pod is stuck in Init:0/2 status. How do you debug it?
Show answer
kubectl describe pod12. You create a resource but cannot find it with kubectl get. What namespace-related mistakes should you check?
Show answer
1) The resource was created in a different namespace — always use -n2) Your kubeconfig context has a default namespace set that differs from where the resource lives.
3) The resource is cluster-scoped (e.g., ClusterRole, PV, Node) and does not appear with -n. Check with kubectl api-resources --namespaced=false.
4) RBAC may hide resources you lack permission to list.
13. You delete a pod managed by a Deployment but it immediately reappears. Why?
Show answer
The Deployment's ReplicaSet controller continuously reconciles actual state to desired state. When you delete a pod, the controller detects the replica count is below spec.replicas and creates a replacement. To actually remove the workload, delete or scale down the Deployment (kubectl scale deployment🔴 Hard (9)¶
1. You run kubectl rollout undo but the previous version also had issues. How do you roll back to a specific revision?
Show answer
kubectl rollout history deployment/2. An app takes 120 seconds to initialize. Liveness probe kills it before startup completes. How do you fix this without removing the liveness probe?
Show answer
Add a startup probe with a generous failureThreshold* periodSeconds window (e.g., failureThreshold: 30, periodSeconds: 5 = 150s). The startup probe blocks liveness and readiness probes until it succeeds. This keeps fast-restart protection while allowing slow cold starts.
3. Pods are Pending cluster-wide after a control plane upgrade. All worker nodes show a NoSchedule taint. What happened and how do you recover?
Show answer
The upgrade likely re-applied node-role.kubernetes.io/control-plane:NoSchedule and may have incorrectly tainted workers. Verify with kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints. Remove errant taints from workers: kubectl taint nodes4. You need to expand a PVC from 10Gi to 50Gi but the pod won't restart. What is the process?
Show answer
1) Verify the StorageClass has allowVolumeExpansion: true.2) Edit the PVC: kubectl patch pvc
3) For file-system volumes, the resize happens on the next pod mount — you must delete and recreate the pod (not the PVC).
4) Check kubectl describe pvc for FileSystemResizePending condition."
5. A node goes NotReady and multiple pods are evicted simultaneously. How do you investigate and prevent recurrence?
Show answer
1) kubectl describe node — check Conditions for MemoryPressure, DiskPressure, PIDPressure.2) Check kubelet logs on the node: journalctl -u kubelet.
3) For memory: identify pods without memory limits (they can consume unbounded memory).
4) Prevent: set memory requests and limits on all pods, configure ResourceQuotas per namespace, and consider LimitRanges to enforce defaults.
6. HPA shows
Show answer
The HPA cannot read metrics. Common causes:1) metrics-server is not installed or not running.
2) The target Deployment pods have no CPU requests set — HPA needs requests to compute utilization percentage.
3) metrics-server cannot reach kubelets (firewall or certificate issue). Verify: kubectl top pods (should return data), and kubectl describe hpa
7. A rolling update is stuck because the PDB minAvailable equals the replica count. Why is this a problem and how do you fix it?
Show answer
If minAvailable equals replicas (e.g., 3/3), the controller cannot evict any old pod to make room for a new one — a deadlock. Fix: set minAvailable to replicas-1 (or use maxUnavailable: 1 instead). This allows the rolling update to terminate one old pod at a time while maintaining minimum availability.8. You deploy an Envoy sidecar with your app container. After a rolling update, requests fail for a few seconds. What is the likely cause and how do you fix it?
Show answer
The app container starts receiving traffic before the Envoy sidecar is ready (race condition). Fix:1) Use a startup/readiness probe on the sidecar.
2) In Kubernetes 1.28+, use the native sidecar feature (restartPolicy: Always in initContainers) which guarantees sidecar readiness before the main container starts.
3) Alternatively, add a postStart lifecycle hook that polls the sidecar's health endpoint.
9. Pods from the same Deployment keep landing on the same node, causing a single point of failure. How do you spread them across nodes?