Kubernetes Debugging Playbook — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about debugging Kubernetes.

kubectl describe is more useful than kubectl get for debugging 90% of the time¶

kubectl get shows current state. kubectl describe shows current state plus the event stream — the sequence of scheduling decisions, image pulls, probe failures, mount errors, and OOM kills that explain how the resource reached its current state. Events are stored for only 1 hour by default (configurable via --event-ttl on kube-apiserver), so debugging must happen promptly or the evidence vanishes.

Kubernetes events are not logs — they are first-class API objects¶

Events are stored in etcd as Event resources, not in a log file. You can query them with kubectl get events --sort-by=.metadata.creationTimestamp, filter by namespace, and even set up alerts on specific event reasons. However, because events are stored in etcd, a cluster with very high event volume can put significant pressure on etcd, which is why the default TTL is only 1 hour.

The "Pending" pod state is almost always a scheduling problem¶

When a pod is stuck in Pending, the scheduler cannot find a node that satisfies its requirements. The top causes are: insufficient CPU/memory on any node, node selectors or affinity rules matching zero nodes, PersistentVolumeClaim bound to a PV in a different availability zone, and taints with no matching tolerations. kubectl describe pod shows the scheduler's exact reason in the Events section.

DNS resolution failures cause more mysterious Kubernetes outages than network partitions¶

CoreDNS is the default DNS server in Kubernetes, and it is a single point of failure for service discovery. If CoreDNS pods are overwhelmed, under-resourced, or misconfigured, every service-to-service call in the cluster fails with cryptic connection timeout errors. The debugging path (kubectl exec -it <pod> -- nslookup kubernetes.default) is simple but non-obvious, and many teams spend hours investigating application code before checking DNS.

kubectl debug can attach to nodes, not just pods¶

Since Kubernetes 1.26, kubectl debug node/<node-name> creates a privileged pod on the specified node with the host filesystem mounted at /host. This gives you a root shell on any node without SSH access. The feature was designed for managed Kubernetes services (EKS, GKE, AKS) where SSH to nodes is impossible or discouraged. It works by creating a pod with hostPID: true, hostNetwork: true, and the root filesystem bind-mounted.

The most common kubectl typo has a dedicated alias¶

kubectl is so frequently mistyped that k is the officially recommended alias. The Kubernetes documentation includes instructions for setting up alias k=kubectl and enabling bash/zsh completion for the alias. Power users go further: kgp for kubectl get pods, kd for kubectl describe, and kl for kubectl logs. The kubectl binary name was itself a compromise — the original proposal was kubecfg.

JSONPath and custom-columns in kubectl are criminally underused¶

kubectl get pods -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}' extracts specific fields without piping through jq. Custom columns (kubectl get pods -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount) produce clean tabular output. These features eliminate the fragile grep | awk | cut pipelines that break whenever output formatting changes.

The kubectl plugin system lets you add any debugging command¶

Any executable named kubectl-<name> on your PATH becomes a kubectl subcommand. kubectl sniff (packet capture), kubectl tree (resource ownership), kubectl neat (clean YAML output), and kubectl images (list images in a cluster) are all community plugins installable via krew, the kubectl plugin manager. Krew hosts over 200 plugins as of 2024.

Resource quotas and LimitRanges cause silent deployment failures¶

If a namespace has a ResourceQuota and your deployment does not specify resource requests/limits, the pod creation fails silently — the Deployment controller creates a ReplicaSet, the ReplicaSet tries to create pods, the pods are rejected, but no top-level error appears on the Deployment. You must drill down: kubectl describe deployment -> kubectl describe replicaset -> see the event showing quota violation. This multi-level indirection is a common debugging trap.

Kubernetes audit logs record every API call — if you enable them¶

By default, most managed Kubernetes services have audit logging disabled or set to minimal levels. When enabled at the RequestResponse level, audit logs capture the complete request and response body for every API call. This is invaluable for debugging "who deleted my pod" or "what changed this ConfigMap" mysteries, but the volume is enormous — a busy cluster can generate gigabytes of audit logs per hour.

Port-forward is a debugging lifeline but it is fragile¶

kubectl port-forward creates a TCP tunnel from your local machine to a pod, service, or deployment. It is the fastest way to access a database, admin UI, or debug endpoint without creating an Ingress. However, port-forward connections are tunneled through the API server and are fragile — they drop on network hiccups, pod restarts, and sometimes spontaneously. For persistent access, a proper Service or Ingress is always more reliable.