Practical Kubernetes Ops - Street Ops¶
What experienced K8s operators know that gets asked in interviews and matters at 2am.
Node Drain: What Actually Happens¶
When you run kubectl drain, here's the real sequence:
- Node is cordoned (SchedulingDisabled)
- For each pod on the node:
a. Kubernetes checks PDBs - if evicting this pod would violate a PDB, it waits
b. The pod receives a SIGTERM
c. The pod's
preStophook runs (if configured) d.terminationGracePeriodSecondstimer starts (default: 30s) e. If the pod doesn't exit by the deadline, it gets SIGKILL f. The pod is rescheduled on another node (if it has a controller)
Things that block drain:
- PDB with minAvailable: 100% or maxUnavailable: 0 - nothing can be evicted
- DaemonSet pods (use --ignore-daemonsets)
- Bare pods (no controller) - use --force, but they won't come back
- Stuck finalizers on pods - the pod won't delete until the finalizer is resolved
- Pod with terminationGracePeriodSeconds: 3600 - drain will wait up to an hour
Gotcha: A PDB with
maxUnavailable: 0on a single-replica deployment makes the node un-drainable. The drain will wait forever because evicting the only pod would violate the PDB. Always ensure PDBs allow at least one pod to be evicted, or your node maintenance will hang.
Pro tip: Always test drain on a non-critical node first. Set a --timeout to avoid waiting forever:
The "It's Always DNS" Debugging Flow¶
DNS failures are the most common K8s networking issue:
# 1. Can the pod resolve anything?
kubectl exec -it <pod> -- nslookup kubernetes.default
# If this fails, CoreDNS is broken
# 2. Can the pod resolve the specific service?
kubectl exec -it <pod> -- nslookup <service>.<namespace>.svc.cluster.local
# If this fails, check service exists and has endpoints
# 3. Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Are they Running? Check logs:
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# 4. Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml
# Look for forward rules, custom zones
# 5. Check resolv.conf in the pod
kubectl exec -it <pod> -- cat /etc/resolv.conf
# Should point to the kube-dns service IP (usually 10.96.0.10)
# 6. Check the service has endpoints
kubectl get endpoints <service> -n <namespace>
# Empty endpoints = no matching pods (check labels)
Resource Limits: The Gotchas¶
CPU limits cause throttling, not OOMKill:
- A pod exceeding its CPU limit gets throttled (runs slower), not killed
- This shows up as increased latency, not crashes
- Check for throttling: kubectl top pod <pod> shows current usage vs request
Memory limits cause OOMKill:
- A pod exceeding its memory limit gets killed immediately (OOMKilled)
- Check: kubectl describe pod <pod> | grep -A 3 "Last State"
- The pod's exit code will be 137 (128 + SIGKILL signal 9)
Requests vs Limits: - Request: guaranteed minimum. Used for scheduling decisions. - Limit: maximum allowed. Enforced at runtime. - A pod with requests but no limits can use all available node resources (noisy neighbor) - A pod with limits but no requests: request defaults to limit (over-provisions)
Remember: Exit code 137 = 128 + 9 (SIGKILL) = OOMKilled. Exit code 143 = 128 + 15 (SIGTERM) = graceful shutdown. Mnemonic: 137 = murdered by the OOM killer, 143 = asked politely to leave.
The OOM pattern:
# Find OOMKilled pods
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") | "\(.metadata.namespace)/\(.metadata.name)"'
# Check a pod's memory limit vs actual usage
kubectl top pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].resources.limits.memory}'
Rolling Update Gotchas¶
maxSurge and maxUnavailable:
strategy:
rollingUpdate:
maxSurge: 1 # 1 extra pod during update
maxUnavailable: 0 # Never reduce below desired count
maxSurge: 1, maxUnavailable: 0 = safest. New pod starts and passes readiness before old one terminates. Slower.
- maxSurge: 0, maxUnavailable: 1 = old pod terminates first, then new one starts. Faster but briefly reduced capacity.
- maxSurge: 25%, maxUnavailable: 25% = default. Balances speed and availability.
Readiness probes are critical during rollouts: - Without a readiness probe, Kubernetes considers the pod ready as soon as the container starts - A pod that starts but takes 30s to load config will receive traffic and return 500s - Always have readiness probes. Always.
Deployment stuck in "Progressing":
# Check rollout status
kubectl rollout status deployment/<name> -n <ns>
# If stuck, check the new ReplicaSet's pods
kubectl get rs -n <ns> -l app=<name>
kubectl describe rs <new-rs> -n <ns>
# Common causes:
# - Image doesn't exist (ImagePullBackOff)
# - Readiness probe failing on new pods
# - Insufficient resources to schedule new pods
# - PDB blocking eviction of old pods
Quick Diagnostic Commands¶
# Cluster health at a glance
kubectl get nodes -o wide
kubectl get pods -A | grep -v Running | grep -v Completed
# Resource pressure
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
# Recent events (things that happened)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Pods that have restarted
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[]?.restartCount > 0) | "\(.metadata.namespace)/\(.metadata.name): \(.status.containerStatuses[0].restartCount) restarts"' | sort -t: -k2 -rn
# Pending pods (not scheduled)
kubectl get pods -A --field-selector=status.phase=Pending
# Find what's consuming the most CPU/memory
kubectl top pods -A --sort-by=cpu | head -20
kubectl top pods -A --sort-by=memory | head -20
Upgrade Checklist (Production)¶
Before upgrading a production cluster:
- Read the release notes for breaking changes
- Check version skew rules (kubelet, kubectl, API server)
- Back up etcd:
etcdctl snapshot save - Review PDBs:
kubectl get pdb -A(ensure none block drains) - Verify monitoring is working (you need to see the upgrade in metrics)
- Upgrade control plane first, then workers one at a time
- After each node: verify
kubectl get nodesshows Ready + correct version - Run smoke tests after all nodes are upgraded
- Monitor for 24 hours before declaring success
HPA Operations¶
Create and Check HPA¶
# Quick HPA from the command line
kubectl autoscale deployment web-frontend --cpu-percent=70 --min=3 --max=50 -n production
# Watch HPA status in real time
kubectl get hpa -n production -w
# Detailed status with conditions and events
kubectl describe hpa web-frontend -n production
Diagnose "Unknown" Metrics¶
# Step 1: Is metrics-server running?
kubectl get pods -n kube-system -l k8s-app=metrics-server
# Step 2: Is the metrics API registered?
kubectl get apiservices | grep metrics
# Step 3: Can you get raw metrics?
kubectl top pods -n production
# Step 4: Do your pods have resource requests?
kubectl get pod -n production -l app=web-frontend \
-o jsonpath='{.items[0].spec.containers[0].resources.requests}'
# {} means no requests — HPA CANNOT compute utilization without them
Default trap: HPA computes utilization as
current_usage / request. Without resource requests, the denominator is zero and HPA shows<unknown>for the metric. This is the number one reason HPA "doesn't work" — the fix is always to set resource requests on the target pods.
Emergency: Disable Autoscaling¶
# Freeze at current count
kubectl patch hpa web-frontend -n production -p '{"spec":{"minReplicas":10,"maxReplicas":10}}'
# Or delete the HPA (preserves current replica count)
kubectl delete hpa web-frontend -n production
Probe Operations¶
Check Probe Status¶
# See if probes are causing restarts
kubectl describe pod myapp-abc123 | grep -A3 "Liveness\|Readiness\|Startup"
# Check events for probe failures
kubectl describe pod myapp-abc123 | grep -i "unhealthy\|probe failed\|killing"
Test Probe Endpoints Manually¶
# Exec into pod and hit health endpoints
kubectl exec -it myapp-abc123 -- curl -v http://localhost:8080/healthz
kubectl exec -it myapp-abc123 -- curl -v http://localhost:8080/ready
Diagnose Cascading Restart Storm¶
# Check if liveness and readiness use the same endpoint
kubectl get pod myapp-abc123 -o json | jq '{
liveness: .spec.containers[0].livenessProbe.httpGet.path,
readiness: .spec.containers[0].readinessProbe.httpGet.path
}'
# BAD if same endpoint. Fix: separate /healthz (liveness) and /ready (readiness)
Under the hood: Liveness probes answer "is this process fundamentally broken?" — if yes, kill and restart it. Readiness probes answer "can this pod handle traffic right now?" — if no, remove it from the service endpoint list but keep it running. Using the same endpoint for both means a pod that is temporarily overloaded (should just stop receiving traffic) gets killed instead, causing a restart storm.
Emergency: Disable a Problematic Probe¶
# If a broken liveness probe is causing a restart storm, patch it out
kubectl patch deployment myapp -n production --type=json \
-p='[{"op":"remove","path":"/spec/template/spec/containers/0/livenessProbe"}]'
Quick Reference¶
- Cheatsheet: Kubernetes-Core
- Deep Dive: Kubernetes Pod Lifecycle
- Deep Dive: Kubernetes Scheduler