Skip to content

Portal | Level: L2: Operations | Topics: Node Lifecycle & Maintenance, Probes (Liveness/Readiness), CrashLoopBackOff, OOMKilled | Domain: Kubernetes

Practical Kubernetes Ops - Primer

Why This Matters

Most K8s training teaches you to create deployments and services. Real ops work is about maintaining, upgrading, and debugging clusters under pressure. This is what interviewers ask about and what you'll actually do on the job.

Worker Node Upgrade Workflow

The standard process for upgrading a Kubernetes worker node:

1. Cordon    -> Mark node unschedulable (no new pods)
2. Drain     -> Evict existing pods (respecting PDBs)
3. Upgrade   -> Update kubelet, container runtime, etc.
4. Uncordon  -> Mark node schedulable again

Cordon

kubectl cordon node-2
# Node is now SchedulingDisabled - existing pods keep running
# New pods won't be scheduled here

Drain

kubectl drain node-2 --ignore-daemonsets --delete-emptydir-data
# Evicts all pods (except DaemonSets)
# Pods with PDBs will block if eviction would violate the budget
# --delete-emptydir-data: allow eviction of pods using emptyDir volumes

What can go wrong during drain: - PodDisruptionBudget (PDB) prevents eviction: the budget says "must have at least N pods available" and draining would violate it - Pod without a controller (bare pod): use --force to evict, but it won't be rescheduled - Pod with local storage: use --delete-emptydir-data - Drain timeout: pod won't terminate (stuck finalizer, long graceful shutdown)

Upgrade

# On the node itself:
apt-get update && apt-get install -y kubelet=1.30.4-1.1 kubectl=1.30.4-1.1
systemctl daemon-reload
systemctl restart kubelet

Uncordon

kubectl uncordon node-2
# Node is schedulable again

Version Skew Rules

Kubernetes has strict version compatibility rules:

  • kubelet: can be 1 minor version behind the API server (e.g., API server 1.30, kubelet 1.29 is OK)
  • kubectl: within 1 minor version of the API server (either direction)
  • Control plane components: must be upgraded before workers
  • Upgrade order: etcd -> API server -> controller-manager/scheduler -> kubelet

Practical rule: always upgrade control plane first, then workers one at a time.

Remember: Kubernetes version skew rule mnemonic: "Control plane leads, Workers follow, One version gap max." Upgrade order: etcd, API server, controller-manager/scheduler, then workers. Never skip minor versions (1.28 to 1.30 is not supported — must go 1.28 to 1.29 to 1.30).

PodDisruptionBudgets (PDBs)

PDBs tell Kubernetes how many pods of a workload must remain available during voluntary disruptions (drains, rolling updates):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2        # At least 2 pods must be up
  # OR: maxUnavailable: 1  # At most 1 pod can be down
  selector:
    matchLabels:
      app: api

Key: PDBs only apply to voluntary disruptions (node drain, rolling update). They do NOT prevent involuntary disruptions (node crash, OOM kill).

Cluster Maintenance

etcd Snapshots

Gotcha: etcd snapshots only capture etcd data, not node-level state (kubelet config, certificates, local volumes). A full disaster recovery plan also needs the PKI certificates from /etc/kubernetes/pki/ and the kubelet configuration.

etcd stores all cluster state. Backing it up is critical:

# Create snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify snapshot
etcdctl snapshot status /backup/etcd-snapshot.db

Rollout Strategies

# Check rollout status
kubectl rollout status deployment/api -n prod

# View rollout history
kubectl rollout history deployment/api -n prod

# Undo last rollout
kubectl rollout undo deployment/api -n prod

# Undo to specific revision
kubectl rollout undo deployment/api -n prod --to-revision=3

# Pause/resume rollout (for canary-style manual approval)
kubectl rollout pause deployment/api -n prod
kubectl rollout resume deployment/api -n prod

Debugging Cheatsheet

CrashLoopBackOff

# 1. Check pod events
kubectl describe pod <pod-name> -n <ns> | tail -20

# 2. Check current container logs
kubectl logs <pod-name> -n <ns>

# 3. Check PREVIOUS container logs (crashed container)
kubectl logs <pod-name> -n <ns> --previous

# 4. Common causes:
#    - App exits immediately (bad entrypoint/command)
#    - Missing config (ConfigMap/Secret not mounted)
#    - Port conflict
#    - OOMKilled (check resources section in describe)

Probes Failing

# Check which probe is failing
kubectl describe pod <pod-name> | grep -A 5 "Liveness\|Readiness\|Startup"

# Check pod events for probe failure messages
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name>

# Common fixes:
# - initialDelaySeconds too short (app not ready yet)
# - Wrong port or path
# - timeoutSeconds too short for slow endpoints
# - periodSeconds too aggressive

Node Pressure

# Check node conditions
kubectl describe node <node> | grep -A 5 "Conditions"

# Memory pressure: node is running low on memory
# Disk pressure: node filesystem is running low on space
# PID pressure: too many processes

# Check resource usage
kubectl top nodes
kubectl top pods -n <ns> --sort-by=memory

# Find pods using the most resources
kubectl top pods -A --sort-by=cpu | head -20

Network Policy Debugging

# List all network policies
kubectl get networkpolicy -A

# Check if a specific pod is affected
kubectl describe networkpolicy <policy-name> -n <ns>

# Test connectivity from a pod
kubectl exec -it <pod> -n <ns> -- curl -v --connect-timeout 5 http://<target>:<port>

# Common issue: default deny policy with no matching allow rule
# Check: does the pod's labels match the policy's podSelector?
# Check: does the target match the policy's egress rules?

HPA / Autoscaling

Kubernetes workloads rarely experience constant load. Traffic spikes during business hours, batch jobs surge at end-of-month, and marketing campaigns drive unpredictable bursts. Manual scaling is slow, error-prone, and does not work at 3 AM. The HorizontalPodAutoscaler (HPA) is the primary mechanism Kubernetes provides for matching pod count to actual demand.

HorizontalPodAutoscaler v2

The HPA controller runs in the kube-controller-manager. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), it queries metrics, computes the desired replica count, and patches the target workload's .spec.replicas.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-frontend
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-frontend
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

The formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)).

Default trap: HPA requires resources.requests to be set on pods for CPU-based scaling. If requests are not set, HPA shows <unknown> for the current metric and never scales. This is the number-one reason HPA "doesn't work."

Metrics Server

The HPA relies on the Metrics API for pod resource usage. The most common provider is metrics-server.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl -n kube-system get pods -l k8s-app=metrics-server
kubectl top nodes && kubectl top pods -n production

Custom and External Metrics

Metric Type Source Use Case
Resource metrics-server CPU, memory utilization
Pods custom metrics adapter Per-pod app metrics (RPS, latency)
External external metrics adapter Cloud services (SQS depth, Pub/Sub backlog)
Object custom metrics adapter Metrics from a specific k8s object (Ingress RPS)

Common adapters: Prometheus Adapter, Datadog Cluster Agent, KEDA.

Scaling Behavior

spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
      selectPolicy: Min

The stabilization window prevents flapping by looking back over a window of recommendations. For scale-down, it defaults to 300 seconds. For scale-up, the default is 0 (react immediately).

VPA (Vertical Pod Autoscaler)

VPA adjusts container resource requests and limits instead of replica count. VPA and HPA should not target the same metric — if HPA scales on CPU, do not let VPA adjust CPU requests.

HPA Debugging

kubectl describe hpa web-frontend -n production
kubectl autoscale deployment web --cpu-percent=70 --min=3 --max=20
kubectl get hpa -w

Key checks: are current metrics populated or <unknown>? Do pods have resources.requests.cpu set? Is metrics-server healthy?

HPA Key Takeaways

  1. Always set resource requests on pods that HPA targets.
  2. Prefer CPU over memory as the primary scaling metric.
  3. Use stabilization windows to prevent flapping.
  4. Custom metrics via Prometheus Adapter or KEDA unlock scaling on business signals.
  5. HPA cannot scale to zero natively — use KEDA or Knative for that.

Probes (Liveness / Readiness / Startup)

Kubernetes probes determine whether your pod receives traffic and whether it gets restarted. Probes are the most commonly misconfigured part of a Kubernetes deployment.

Three Probe Types

Probe Question It Answers Failure Action
Liveness Is this container still alive? Kill and restart the container
Readiness Can this container serve traffic right now? Remove from Service endpoints
Startup Has this container finished starting up? Keep waiting (block liveness/readiness)

Probe Mechanisms

  • httpGet — Send HTTP GET. Success = 2xx/3xx. Use for web servers, APIs.
  • tcpSocket — Open TCP connection. Success = port open. Use for databases, brokers.
  • exec — Run command. Success = exit code 0. Use for custom checks.
  • grpc — gRPC health check (K8s 1.24+). Use for gRPC services.

Probe Parameters

Parameter Default Description
initialDelaySeconds 0 Wait before first probe
periodSeconds 10 Probe frequency
timeoutSeconds 1 Timeout per probe
successThreshold 1 Consecutive successes needed
failureThreshold 3 Consecutive failures before action

Liveness vs Readiness Design

Liveness should be fast, local, and dependency-free:

@app.get("/healthz")
def liveness():
    return {"status": "alive"}

Readiness SHOULD check dependencies:

@app.get("/ready")
def readiness():
    try:
        db.execute("SELECT 1")
    except Exception:
        return JSONResponse(status_code=503, content={"reason": "database unavailable"})
    return {"status": "ready"}

War story: A team configured liveness probes to check the database connection. When the database had a 30-second blip, Kubernetes killed and restarted all 50 API pods simultaneously. All 50 reconnected at once, overwhelming the database connection pool and causing a cascading failure that lasted 20 minutes. The fix was moving the database check to the readiness probe and making liveness a simple /healthz that returns 200 if the process is alive.

The cardinal sin is checking dependency health in liveness. When the database goes down, every pod fails liveness, Kubernetes restarts them all, they thundering-herd the database on reconnect — cascading restart storm.

Startup Probes

Replace the fragile initialDelaySeconds pattern. While a startup probe is defined, liveness and readiness probes are disabled:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 5
  failureThreshold: 60  # 5 * 60 = 300 seconds max startup time

Complete Pod Spec Example

containers:
  - name: myapp
    image: myapp:1.4.2
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 5
      failureThreshold: 60
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3

Probe Gotchas

  • Liveness == Readiness (same endpoint) — Dependency outage triggers restarts. Always separate.
  • Timeout too short for GC pauses — JVM full GC can pause for seconds.
  • No startup probe on slow-starting apps — Must guess initialDelaySeconds.
  • Probing the wrong port — Common when copying manifests between projects.

See Also


Wiki Navigation

Prerequisites

  • Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1)

Next Steps