Portal | Level: L2: Operations | Topics: Node Lifecycle & Maintenance, Probes (Liveness/Readiness), CrashLoopBackOff, OOMKilled | Domain: Kubernetes
Practical Kubernetes Ops - Primer¶
Why This Matters¶
Most K8s training teaches you to create deployments and services. Real ops work is about maintaining, upgrading, and debugging clusters under pressure. This is what interviewers ask about and what you'll actually do on the job.
Worker Node Upgrade Workflow¶
The standard process for upgrading a Kubernetes worker node:
1. Cordon -> Mark node unschedulable (no new pods)
2. Drain -> Evict existing pods (respecting PDBs)
3. Upgrade -> Update kubelet, container runtime, etc.
4. Uncordon -> Mark node schedulable again
Cordon¶
kubectl cordon node-2
# Node is now SchedulingDisabled - existing pods keep running
# New pods won't be scheduled here
Drain¶
kubectl drain node-2 --ignore-daemonsets --delete-emptydir-data
# Evicts all pods (except DaemonSets)
# Pods with PDBs will block if eviction would violate the budget
# --delete-emptydir-data: allow eviction of pods using emptyDir volumes
What can go wrong during drain:
- PodDisruptionBudget (PDB) prevents eviction: the budget says "must have at least N pods available" and draining would violate it
- Pod without a controller (bare pod): use --force to evict, but it won't be rescheduled
- Pod with local storage: use --delete-emptydir-data
- Drain timeout: pod won't terminate (stuck finalizer, long graceful shutdown)
Upgrade¶
# On the node itself:
apt-get update && apt-get install -y kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
systemctl daemon-reload
systemctl restart kubelet
Uncordon¶
Version Skew Rules¶
Kubernetes has strict version compatibility rules:
- kubelet: can be 1 minor version behind the API server (e.g., API server 1.30, kubelet 1.29 is OK)
- kubectl: within 1 minor version of the API server (either direction)
- Control plane components: must be upgraded before workers
- Upgrade order: etcd -> API server -> controller-manager/scheduler -> kubelet
Practical rule: always upgrade control plane first, then workers one at a time.
Remember: Kubernetes version skew rule mnemonic: "Control plane leads, Workers follow, One version gap max." Upgrade order: etcd, API server, controller-manager/scheduler, then workers. Never skip minor versions (1.28 to 1.30 is not supported — must go 1.28 to 1.29 to 1.30).
PodDisruptionBudgets (PDBs)¶
PDBs tell Kubernetes how many pods of a workload must remain available during voluntary disruptions (drains, rolling updates):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # At least 2 pods must be up
# OR: maxUnavailable: 1 # At most 1 pod can be down
selector:
matchLabels:
app: api
Key: PDBs only apply to voluntary disruptions (node drain, rolling update). They do NOT prevent involuntary disruptions (node crash, OOM kill).
Cluster Maintenance¶
etcd Snapshots¶
Gotcha: etcd snapshots only capture etcd data, not node-level state (kubelet config, certificates, local volumes). A full disaster recovery plan also needs the PKI certificates from
/etc/kubernetes/pki/and the kubelet configuration.
etcd stores all cluster state. Backing it up is critical:
# Create snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify snapshot
etcdctl snapshot status /backup/etcd-snapshot.db
Rollout Strategies¶
# Check rollout status
kubectl rollout status deployment/api -n prod
# View rollout history
kubectl rollout history deployment/api -n prod
# Undo last rollout
kubectl rollout undo deployment/api -n prod
# Undo to specific revision
kubectl rollout undo deployment/api -n prod --to-revision=3
# Pause/resume rollout (for canary-style manual approval)
kubectl rollout pause deployment/api -n prod
kubectl rollout resume deployment/api -n prod
Debugging Cheatsheet¶
CrashLoopBackOff¶
# 1. Check pod events
kubectl describe pod <pod-name> -n <ns> | tail -20
# 2. Check current container logs
kubectl logs <pod-name> -n <ns>
# 3. Check PREVIOUS container logs (crashed container)
kubectl logs <pod-name> -n <ns> --previous
# 4. Common causes:
# - App exits immediately (bad entrypoint/command)
# - Missing config (ConfigMap/Secret not mounted)
# - Port conflict
# - OOMKilled (check resources section in describe)
Probes Failing¶
# Check which probe is failing
kubectl describe pod <pod-name> | grep -A 5 "Liveness\|Readiness\|Startup"
# Check pod events for probe failure messages
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name>
# Common fixes:
# - initialDelaySeconds too short (app not ready yet)
# - Wrong port or path
# - timeoutSeconds too short for slow endpoints
# - periodSeconds too aggressive
Node Pressure¶
# Check node conditions
kubectl describe node <node> | grep -A 5 "Conditions"
# Memory pressure: node is running low on memory
# Disk pressure: node filesystem is running low on space
# PID pressure: too many processes
# Check resource usage
kubectl top nodes
kubectl top pods -n <ns> --sort-by=memory
# Find pods using the most resources
kubectl top pods -A --sort-by=cpu | head -20
Network Policy Debugging¶
# List all network policies
kubectl get networkpolicy -A
# Check if a specific pod is affected
kubectl describe networkpolicy <policy-name> -n <ns>
# Test connectivity from a pod
kubectl exec -it <pod> -n <ns> -- curl -v --connect-timeout 5 http://<target>:<port>
# Common issue: default deny policy with no matching allow rule
# Check: does the pod's labels match the policy's podSelector?
# Check: does the target match the policy's egress rules?
HPA / Autoscaling¶
Kubernetes workloads rarely experience constant load. Traffic spikes during business hours, batch jobs surge at end-of-month, and marketing campaigns drive unpredictable bursts. Manual scaling is slow, error-prone, and does not work at 3 AM. The HorizontalPodAutoscaler (HPA) is the primary mechanism Kubernetes provides for matching pod count to actual demand.
HorizontalPodAutoscaler v2¶
The HPA controller runs in the kube-controller-manager. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), it queries metrics, computes the desired replica count, and patches the target workload's .spec.replicas.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-frontend
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-frontend
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
The formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)).
Default trap: HPA requires
resources.requeststo be set on pods for CPU-based scaling. If requests are not set, HPA shows<unknown>for the current metric and never scales. This is the number-one reason HPA "doesn't work."
Metrics Server¶
The HPA relies on the Metrics API for pod resource usage. The most common provider is metrics-server.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl -n kube-system get pods -l k8s-app=metrics-server
kubectl top nodes && kubectl top pods -n production
Custom and External Metrics¶
| Metric Type | Source | Use Case |
|---|---|---|
| Resource | metrics-server | CPU, memory utilization |
| Pods | custom metrics adapter | Per-pod app metrics (RPS, latency) |
| External | external metrics adapter | Cloud services (SQS depth, Pub/Sub backlog) |
| Object | custom metrics adapter | Metrics from a specific k8s object (Ingress RPS) |
Common adapters: Prometheus Adapter, Datadog Cluster Agent, KEDA.
Scaling Behavior¶
spec:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Min
The stabilization window prevents flapping by looking back over a window of recommendations. For scale-down, it defaults to 300 seconds. For scale-up, the default is 0 (react immediately).
VPA (Vertical Pod Autoscaler)¶
VPA adjusts container resource requests and limits instead of replica count. VPA and HPA should not target the same metric — if HPA scales on CPU, do not let VPA adjust CPU requests.
HPA Debugging¶
kubectl describe hpa web-frontend -n production
kubectl autoscale deployment web --cpu-percent=70 --min=3 --max=20
kubectl get hpa -w
Key checks: are current metrics populated or <unknown>? Do pods have resources.requests.cpu set? Is metrics-server healthy?
HPA Key Takeaways¶
- Always set resource requests on pods that HPA targets.
- Prefer CPU over memory as the primary scaling metric.
- Use stabilization windows to prevent flapping.
- Custom metrics via Prometheus Adapter or KEDA unlock scaling on business signals.
- HPA cannot scale to zero natively — use KEDA or Knative for that.
Probes (Liveness / Readiness / Startup)¶
Kubernetes probes determine whether your pod receives traffic and whether it gets restarted. Probes are the most commonly misconfigured part of a Kubernetes deployment.
Three Probe Types¶
| Probe | Question It Answers | Failure Action |
|---|---|---|
| Liveness | Is this container still alive? | Kill and restart the container |
| Readiness | Can this container serve traffic right now? | Remove from Service endpoints |
| Startup | Has this container finished starting up? | Keep waiting (block liveness/readiness) |
Probe Mechanisms¶
- httpGet — Send HTTP GET. Success = 2xx/3xx. Use for web servers, APIs.
- tcpSocket — Open TCP connection. Success = port open. Use for databases, brokers.
- exec — Run command. Success = exit code 0. Use for custom checks.
- grpc — gRPC health check (K8s 1.24+). Use for gRPC services.
Probe Parameters¶
| Parameter | Default | Description |
|---|---|---|
initialDelaySeconds |
0 | Wait before first probe |
periodSeconds |
10 | Probe frequency |
timeoutSeconds |
1 | Timeout per probe |
successThreshold |
1 | Consecutive successes needed |
failureThreshold |
3 | Consecutive failures before action |
Liveness vs Readiness Design¶
Liveness should be fast, local, and dependency-free:
Readiness SHOULD check dependencies:
@app.get("/ready")
def readiness():
try:
db.execute("SELECT 1")
except Exception:
return JSONResponse(status_code=503, content={"reason": "database unavailable"})
return {"status": "ready"}
War story: A team configured liveness probes to check the database connection. When the database had a 30-second blip, Kubernetes killed and restarted all 50 API pods simultaneously. All 50 reconnected at once, overwhelming the database connection pool and causing a cascading failure that lasted 20 minutes. The fix was moving the database check to the readiness probe and making liveness a simple
/healthzthat returns 200 if the process is alive.
The cardinal sin is checking dependency health in liveness. When the database goes down, every pod fails liveness, Kubernetes restarts them all, they thundering-herd the database on reconnect — cascading restart storm.
Startup Probes¶
Replace the fragile initialDelaySeconds pattern. While a startup probe is defined, liveness and readiness probes are disabled:
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 5
failureThreshold: 60 # 5 * 60 = 300 seconds max startup time
Complete Pod Spec Example¶
containers:
- name: myapp
image: myapp:1.4.2
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 5
failureThreshold: 60
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Probe Gotchas¶
- Liveness == Readiness (same endpoint) — Dependency outage triggers restarts. Always separate.
- Timeout too short for GC pauses — JVM full GC can pause for seconds.
- No startup probe on slow-starting apps — Must guess
initialDelaySeconds. - Probing the wrong port — Common when copying manifests between projects.
See Also¶
- Deep dives: Pod Lifecycle, K8s Networking, K8s Scheduler
- Cheatsheet: Kubernetes Core, K8s YAML Patterns
- Drills: kubectl Drills
- Skillcheck: Kubernetes, Under the Covers
- Runbooks: CrashLoopBackOff, OOMKilled, PVC Stuck, ImagePullBackOff
Wiki Navigation¶
Prerequisites¶
- Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1)
Next Steps¶
- API Gateways & Ingress (Topic Pack, L2)
- Argo Workflows (Topic Pack, L2)
- ArgoCD & GitOps (Topic Pack, L2)
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2)
- Case Study: CNI Broken After Restart (Case Study, L2)
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2)
- Case Study: CoreDNS Timeout Pod DNS (Case Study, L2)
- Case Study: CrashLoopBackOff No Logs (Case Study, L1)
Related Content¶
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — CrashLoopBackOff, HPA / Autoscaling, OOMKilled
- Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1) — HPA / Autoscaling, Kubernetes Networking, Probes (Liveness/Readiness)
- Skillcheck: Kubernetes (Assessment, L1) — HPA / Autoscaling, Probes (Liveness/Readiness)
- Skillcheck: Kubernetes Under the Covers (Assessment, L2) — Kubernetes Networking, Node Lifecycle & Maintenance
- Track: Kubernetes Core (Reference, L1) — Kubernetes Networking, Probes (Liveness/Readiness)
- API Gateways & Ingress (Topic Pack, L2) — Kubernetes Networking
- Case Study: CNI Broken After Restart (Case Study, L2) — Kubernetes Networking
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Networking
- Case Study: CoreDNS Timeout Pod DNS (Case Study, L2) — Kubernetes Networking
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Node Lifecycle & Maintenance
Pages that link here¶
- Anti-Primer: Kubernetes Ops
- Argo Workflows
- ArgoCD & GitOps
- Certification Prep: CKA — Certified Kubernetes Administrator
- Certification Prep: CKAD — Certified Kubernetes Application Developer
- Certification Prep: CKS — Certified Kubernetes Security Specialist
- Certification Prep: PCA — Prometheus Certified Associate
- Comparison: Container Orchestrators
- Comparison: Managed Kubernetes
- Incident Replay: CNI Broken After Node Restart
- Incident Replay: CoreDNS Timeout — Pod DNS Resolution Failing
- Incident Replay: CrashLoopBackOff with No Logs
- Incident Replay: DaemonSet Blocks Node Eviction
- Incident Replay: ImagePullBackOff — Registry Authentication Failure
- Incident Replay: Node Drain Blocked by PDB