Grokdevops Training¶

57 cards — 🟢 11 easy | 🟡 29 medium | 🔴 10 hard

🟢 Easy (11)¶

1. A pod is in ImagePullBackOff. Can you check its logs? Why or why not?

Show answer

No. The container never started, so there are no container logs. Use `kubectl describe pod` to see the pull error in Events. Check the image name: `kubectl get deploy -o jsonpath='{.spec.template.spec.containers[0].image}'`

2. When should you use kubectl logs --previous?

Show answer

When a container has crashed and restarted. Current logs may be empty (new container just started). --previous shows logs from the LAST terminated container instance. Only works if there was a previous instance.

3. What is the correct order for a break/fix debugging cycle?

Show answer

1. Observe symptoms (kubectl get pods, logs, events)
2. Form hypothesis (ranked by likelihood)
3. Test hypothesis (targeted command)
4. Fix (minimal change)
5. Verify (confirm symptom is gone)
6. Teardown (clean up any debug artifacts)

4. What two flags should chaos scripts support for safety?

Show answer

--dry-run (preview what would change without applying) and --yes (explicit confirmation required before destructive action). This prevents accidental chaos in production.

Remember: --dry-run = preview, --yes = confirm. Both flags together mean show me what would happen, then do it.

Analogy: Like a pilot pre-flight checklist — never skip the safety checks before introducing controlled failure.

5. How do you test a Helm upgrade before applying it?

Show answer

`helm upgrade -f values.yaml --dry-run` renders templates and validates against the K8s API without creating any resources. Add `--debug` for verbose template output.

Gotcha: --dry-run validates against the cluster API but does NOT create resources. Use `helm template` for offline rendering without cluster access.

Remember: --dry-run + --debug = full rendered output with values. Essential for debugging template issues.

6. How does a Kubernetes Service know which pods to route traffic to?

Show answer

Via label selectors. The Service's spec.selector must match labels on the target pods. Only pods that are Ready (readiness probe passing) are included in the Service's endpoints.

Remember: Service selector to Pod labels to Endpoints. A mismatch at any point = no traffic routing.

Debug clue: `kubectl get endpoints ` shows which pod IPs are in the pool. Empty = selector mismatch or no ready pods.

7. How do you make a locally-built Docker image available in k3s?

Show answer

`docker save : | sudo k3s ctr images import -`. k3s uses containerd (not Docker), so Docker images must be explicitly imported. Alternative: push to a registry and pull.

Gotcha: k3s uses containerd, not Docker daemon. `docker images` and `k3s crictl images` are separate image stores.

Remember: In production, always use a registry (Docker Hub, GHCR, ECR). Local import is for dev/testing only.

8. How do you test if a service account has a specific permission?

Show answer

`kubectl auth can-i -n --as=system:serviceaccount::`. Returns 'yes' or 'no'. Example: `kubectl auth can-i list pods -n grokdevops --as=system:serviceaccount:grokdevops:default`

Remember: `kubectl auth can-i --list` shows ALL permissions for the current user. Add -n for namespace-scoped.

Gotcha: Service account format is system:serviceaccount::. Missing the prefix = wrong identity.

9. How do you see what Kubernetes manifests a Helm chart would generate without deploying?

Show answer

`helm template -f values.yaml` renders all templates to stdout. No cluster interaction needed. Useful for debugging template issues before upgrade.

Remember: `helm template` = offline rendering (no cluster needed). `helm install --dry-run` = server-side rendering (validates against cluster API).

Gotcha: `helm template` does not evaluate lookup functions — they always return empty without a cluster connection.

10. What is the Loki query to see logs from the grokdevops namespace?

Show answer

`{namespace=\grokdevops\"}`. Add filters: `{namespace=\"grokdevops\"} |= \"error\"` for lines containing 'error'. Use `|~` for regex matching."

Remember: LogQL syntax: {label=value} for stream selection, |= for contains, != for exclude, |~ for regex. Pipe operators chain left to right.

Gotcha: Label values must be quoted in LogQL. Unquoted values cause parse errors.

11. How do you see recent events in a namespace, sorted by time?

Show answer

`kubectl get events -n grokdevops --sort-by='.lastTimestamp' | tail -20`. Events are ephemeral (default TTL: 1 hour). Check early in your investigation or evidence may be gone.

Gotcha: K8s events have a default TTL of 1 hour. If you investigate too late, evidence is gone. Capture events early in triage.

Remember: For persistent event storage, send events to a log aggregator (Loki, Elasticsearch) via an events exporter.

🟡 Medium (29)¶

1. A pod is in CrashLoopBackOff. What are your first 3 commands?

Show answer

1. `kubectl get pods -n grokdevops` -- check status and restart count
2. `kubectl logs -n grokdevops deploy/grokdevops --previous` -- get crash logs
3. `kubectl describe pod -n grokdevops -l app.kubernetes.io/name=grokdevops` -- check events and exit code

2. What does exit code 137 mean in a Kubernetes pod?

Show answer

Exit code 137 = 128 + 9 (SIGKILL). The container was killed by the Linux kernel's OOM killer because it exceeded its cgroup memory limit. Check: `kubectl describe pod | grep OOMKilled`

3. What is the difference between readiness and liveness probes in terms of what K8s does when they fail?

Show answer

Readiness probe failure: pod is removed from Service endpoints (no traffic routed to it), but pod keeps running. Liveness probe failure: kubelet restarts the container. Key insight: readiness gates traffic, liveness gates lifecycle.

4. A deployment shows 'Progressing' for 20 minutes. What's happening to old pods?

Show answer

Old ReplicaSet pods continue serving traffic. During a RollingUpdate, K8s won't terminate old pods until new ones pass readiness probes. If new pods never become ready, old pods stay running indefinitely (up to progressDeadlineSeconds, default 600s).

5. A Service shows 0 endpoints. What do you check?

Show answer

1. `kubectl get endpoints -n grokdevops` -- confirm empty
2. Compare service selector to pod labels: `kubectl get svc -o yaml | grep selector` vs `kubectl get pods --show-labels`
3. Check if pods are Ready (unready pods are excluded from endpoints)

6. How does helm rollback work under the hood?

Show answer

Helm reads the manifests from a previous revision (stored as a Secret in the namespace), and re-applies them via a 3-way merge. Rollback creates a NEW revision -- it doesn't delete the failed one. The revision history is append-only.

7. Logs stopped appearing in Grafana Loki. What's the first thing to check?

Show answer

Check if Promtail pods are running: `kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail`. Promtail is a DaemonSet that collects logs from nodes. If pods are missing, check the DaemonSet for nodeSelector/toleration issues.

8. In a namespace with no NetworkPolicies, what traffic is allowed?

Show answer

All traffic (ingress and egress) is allowed by default. NetworkPolicies are additive: adding the first policy for a direction implicitly denies everything not explicitly allowed by any policy.

Remember: No NetworkPolicy = allow all. First policy = default deny for that direction. Most counter-intuitive K8s networking concept.

Gotcha: NetworkPolicies require a CNI that supports them (Calico, Cilium). Flannel does NOT enforce NetworkPolicies.

9. What is the default RBAC behavior in Kubernetes?

Show answer

Deny-by-default. Without a Role/ClusterRole + RoleBinding/ClusterRoleBinding granting access, all API requests are denied. Use `kubectl auth can-i --as=` to test permissions.

Remember: RBAC = deny by default. NetworkPolicy = allow by default. Opposite defaults — a common source of confusion.

Gotcha: system:anonymous and system:unauthenticated groups have some discovery permissions by default.

10. What is the full DNS name (FQDN) for a service called 'grokdevops' in namespace 'grokdevops'?

Show answer

grokdevops.grokdevops.svc.cluster.local. Format: ..svc.cluster.local. Short names work within the same namespace due to search domains in /etc/resolv.conf.

Remember: DNS format: ..svc.cluster.local. Within the same namespace, just the service name works.

Gotcha: Cross-namespace calls need at least .. The full FQDN with trailing dot bypasses search domain expansion.

11. Most CRITICAL CVEs in a Trivy scan come from what layer?

Show answer

The OS base image layer (apt/apk packages). The most impactful fix is usually updating the base image (e.g., python:3.12-slim-bookworm instead of python:3.9-slim-buster), not patching individual packages.

Remember: Fix base image first, then application dependencies. 80% of CVEs come from the OS layer.

Gotcha: Distroless and Alpine images have far fewer CVEs than Debian/Ubuntu base images. Consider switching for production.

12. What causes 'configuration drift' in a GitOps-managed cluster?

Show answer

Manual changes via kubectl (scale, edit, set env) that bypass the Git-managed desired state. In a GitOps setup (ArgoCD/Flux), these changes will be reverted on the next reconciliation loop. Fix: always go through Git, never kubectl directly in production.

13. What does the --atomic flag do in helm upgrade?

Show answer

If the upgrade fails (pods don't become ready within --timeout), Helm automatically rolls back to the previous revision. Without --atomic, a failed upgrade leaves the release in 'failed' state and you must manually rollback.

14. A slow-starting application keeps getting killed by the liveness probe. What setting do you adjust?

Show answer

Increase initialDelaySeconds on the liveness probe, or better: use a startupProbe. startupProbe disables liveness and readiness probes until the app signals it has started. This is preferred for slow-starting apps (e.g., Java).

15. What's the difference between pathType: Prefix and pathType: Exact in an Ingress rule?

Show answer

Prefix: matches the URL path prefix (e.g., /api matches /api, /api/v1, /api/users). Exact: matches only the exact path (e.g., /api matches only /api, not /api/v1). Most apps should use Prefix.

16. kubectl top pods returns 'error: Metrics API not available'. What's missing?

Show answer

metrics-server is not installed. Install: `kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml`. For k3s, also add `--kubelet-insecure-tls` arg.

Gotcha: metrics-server requires TLS access to kubelets. In k3s/minikube, add --kubelet-insecure-tls to bypass self-signed cert issues.

Remember: metrics-server provides the Metrics API for `kubectl top` and HPA. Without it, neither works.

17. What should an incident forensics bundle contain?

Show answer

Pod status, events, logs (current + previous), describe output, resource usage (top), HPA status, service endpoints, recent events. Capture BEFORE fixing so evidence isn't lost. Use: `make incident-forensics`

Remember: Capture BEFORE fixing — evidence disappears after restart. The forensics bundle is your incident black box recorder.

Gotcha: `kubectl logs --previous` only works if the container has restarted. Capture current logs too.

18. What's the difference between resource requests and limits?

Show answer

Request: guaranteed minimum. Used for scheduling (K8s places pod on node with enough capacity). Limit: maximum allowed. Enforced by cgroups (CPU throttled, memory OOMKilled). A pod can use more CPU than requested (up to limit) but will be killed if it exceeds memory limit.

19. How do you verify CoreDNS is working?

Show answer

1. Check pods: `kubectl get pods -n kube-system -l k8s-app=kube-dns`
2. Test resolution: `kubectl run dns-test --rm -it --restart=Never --image=busybox:1.36 -- nslookup kubernetes.default.svc.cluster.local`
3. Check logs: `kubectl logs -n kube-system -l k8s-app=kube-dns`

20. What is progressDeadlineSeconds and what's the default?

Show answer

Default: 600 seconds (10 minutes). If a deployment's rollout doesn't make progress for this duration, K8s marks it as Failed. 'Progress' means at least one new pod became Ready. This doesn't auto-rollback -- it just updates the status condition.

21. How often does Prometheus scrape targets by default?

Show answer

Default scrape interval is 30 seconds. After changing ServiceMonitor labels, wait at least one scrape interval for Prometheus to detect the new config and scrape the target.

Gotcha: Changing scrape interval affects PromQL functions like rate(). Use $__rate_interval in Grafana to auto-adjust.

Remember: 15s interval = more granularity but more storage. 60s = less storage but may miss short spikes.

22. What's the key architectural difference between how Prometheus/Promtail collect data vs Tempo?

Show answer

Prometheus scrapes (pulls) metrics from /metrics endpoints. Promtail tails (pushes) log files to Loki. Tempo is receive-only: the application must push traces via OTLP protocol. Tempo doesn't collect -- it waits.

23. During a RollingUpdate, when does K8s terminate old pods?

Show answer

Only after new pods pass readiness probes and are added to Service endpoints. maxUnavailable controls how many old pods can be down simultaneously. maxSurge controls how many extra pods can exist. Default: 25% each.

24. What is the recommended investigation loop in this training system?

Show answer

1. `make incident YES=1` -- inject a random failure
2. `make investigate` -- see step-by-step investigation plan
3. Use kubectl/helm to gather evidence
4. `make hint` if stuck (progressive hints 1-4)
5. Fix the issue
6. `make incident-resolve` -- mark resolved and record time

25. A DaemonSet shows DESIRED=0. What's wrong?

Show answer

No nodes match the DaemonSet's nodeSelector or tolerations. Check: `kubectl get daemonset -o yaml | grep -A5 nodeSelector`. Remove the impossible selector or add matching labels to nodes.

Debug clue: DESIRED=0 means the scheduler found zero qualifying nodes. Check nodeSelector, tolerations, and node labels.

Remember: DaemonSets run one pod per matching node. 0 matching nodes = 0 desired pods.

26. How do you extract a specific field from a Kubernetes resource using kubectl?

Show answer

`kubectl get -o jsonpath='{.spec.template.spec.containers[0].image}'`. Use python3 -m json.tool to format JSON output. For multiple fields, use custom-columns: `-o custom-columns='NAME:.metadata.name,IMAGE:.spec.template.spec.containers[0].image'`

Remember: JSONPath starts with . for the root. Array access: [0]. Wildcard: [*]. Recursive: ..

Gotcha: JSONPath in kubectl uses single quotes around the expression. Escape carefully in shell scripts.

27. You updated a ConfigMap. Why don't pods see the new values?

Show answer

Pods using ConfigMap via envFrom or env don't auto-restart when the ConfigMap changes. You must restart the deployment: `kubectl rollout restart deployment/`. ConfigMaps mounted as volumes DO auto-update (after kubelet sync period, ~1 minute), but the app must re-read the file.

28. An Ingress resource exists but returns 404. First thing to check after the Ingress spec?

Show answer

Check if the ingress controller is running: `kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik` (k3s default) or `-l app.kubernetes.io/name=ingress-nginx`. No controller = Ingress resources are ignored.

Remember: Ingress resource = routing rules. Ingress controller = software that implements them. Without a controller, rules are ignored.

Gotcha: k3s ships with Traefik by default. If you disabled it at install (--disable traefik), install your own.

29. In helm rollback <release> 0, what does revision 0 mean?

Show answer

Revision 0 means 'roll back to the previous revision' (one before current). It's a shortcut. To roll back to a specific revision, use: `helm rollback

`

Gotcha: After rollback, always verify with `helm status` and `kubectl get pods`. Rollback creates a NEW revision — it does not delete the failed one.

Remember: `helm history ` shows all revisions including rollbacks.

🔴 Hard (10)¶

1. HPA shows '/50%' for CPU. What are the two most likely causes?

Show answer

1. metrics-server is not installed (no CPU metrics available). Check: `kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes`
2. Deployment has no CPU resource requests defined. HPA calculates percentage as actual/request * 100 -- no request = undefined.

2. A Helm release is stuck in 'pending-upgrade' state. How do you recover?

Show answer

Run: `helm rollback

-n `. The pending-upgrade state means the previous upgrade never completed. Rolling back to a known-good revision resets the state.

Gotcha: If rollback also fails, you may need to manually delete the broken Helm secret: `kubectl delete secret sh.helm.release.v1..v`.

Remember: Helm stores release state in Kubernetes Secrets. Stuck states mean a corrupted release secret.

3. Prometheus shows no data for your app. ServiceMonitor exists. What's the most likely issue?

Show answer

Label selector mismatch. Two things must match: (1) Prometheus's serviceMonitorSelector must find the ServiceMonitor, (2) ServiceMonitor's spec.selector.matchLabels must match the Service's labels. Check both: `kubectl get servicemonitor -o yaml` and `kubectl get svc --show-labels`

4. You applied a NetworkPolicy and now DNS doesn't work. Why?

Show answer

Adding a NetworkPolicy enables default-deny for the specified direction (ingress/egress). If you created an egress policy without allowing port 53 UDP to kube-system, DNS queries are blocked. Fix: add an egress rule allowing UDP port 53 to any namespace.

5. What's the difference between OOMKilled and pod eviction?

Show answer

OOMKilled: container exceeded its cgroup memory limit (set by resources.limits.memory). Kernel kills the process (exit 137). Eviction: kubelet removes pods when NODE memory is under pressure. OOMKilled is per-container; eviction is per-node.

6. Where does Helm store release information?

Show answer

As Secrets in the release's namespace (default driver). Each revision is a separate Secret named sh.helm.release.v1..v. This is why `helm list` works without a local state file -- it reads from the cluster.

7. What is the ndots setting in /etc/resolv.conf and why does it matter?

Show answer

Default ndots:5 in K8s means any name with fewer than 5 dots is first tried with search domain suffixes before absolute lookup. This means 'google.com' (1 dot) gets tried as google.com.grokdevops.svc.cluster.local first. Can cause slow DNS if external names are used frequently.

8. How does Kubernetes enforce memory limits at the Linux level?

Show answer

K8s sets cgroup memory.limit_in_bytes for each container. When the process's RSS exceeds this limit, the kernel's OOM killer sends SIGKILL (signal 9). Exit code = 128+9 = 137. This is a kernel-level enforcement, not a K8s decision.

9. How long does HPA wait before scaling down after load decreases?

Show answer

Default stabilization window for scale-down is 5 minutes (300 seconds). This prevents flapping. Scale-up is faster (15 seconds default). Both are configurable via HPA behavior spec.

Remember: Scale-up = 15s default, scale-down = 5min default. Asymmetric by design to prevent flapping.

Gotcha: HPA behavior spec (K8s 1.18+) allows customizing scale-up/down policies, stabilization windows, and rate limits.

10. A pod is Running but not Ready. Does it receive traffic from the Service?

Show answer

No. Kubernetes removes non-Ready pods from Service endpoints. The kube-proxy/iptables rules won't route traffic to pods that haven't passed their readiness probe. The pod stays running but is effectively invisible to the Service.