Skip to content

Thinking Out Loud: Kubernetes Ops

A senior SRE's internal monologue while working through a real cluster operations task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

It's Monday morning and the weekly cluster health review is flagged. Three namespaces are using more resources than budgeted, the etcd database is approaching its storage limit, and there's a stale canary deployment that's been sitting at 10% traffic split for two weeks. Time for some cluster hygiene.

The Monologue

Monday cluster review. Let me get a quick picture of the overall cluster state before diving into specifics.

kubectl get nodes -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[?(@.type=="Ready")].status,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory,VERSION:.status.nodeInfo.kubeletVersion'

8 nodes, all Ready, all on 1.29.2. Good. Let me check the resource situation across namespaces. The alert said three namespaces are over budget.

kubectl top nodes

Cluster-wide: 67% CPU, 72% memory. Not critical, but higher than I'd like for a Monday morning. Let me find the over-consumers.

kubectl resource-quota --all-namespaces 2>/dev/null || \
kubectl get resourcequotas --all-namespaces -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,CPU-USED:.status.used.requests\.cpu,CPU-HARD:.status.hard.requests\.cpu,MEM-USED:.status.used.requests\.memory,MEM-HARD:.status.hard.requests\.memory'

The analytics, ml-experiments, and staging namespaces are at 95%, 110%, and 88% of their CPU quotas. ML-experiments is OVER quota — how is that even possible? Oh wait, they probably created pods before the quota was applied. Existing pods are grandfathered.

Mental Model: ResourceQuotas Are Admission Controls, Not Enforcement

ResourceQuotas only prevent the creation of NEW pods that would exceed the budget. Existing pods that were created before the quota (or before a quota reduction) continue to run. This means a namespace can be "over quota" if the quota was tightened after pods were already running. To truly enforce limits, you need LimitRanges plus quotas.

Let me deal with the ml-experiments namespace first since it's over quota.

kubectl get pods -n ml-experiments --sort-by=.metadata.creationTimestamp -o custom-columns='NAME:.metadata.name,AGE:.metadata.creationTimestamp,CPU-REQ:.spec.containers[*].resources.requests.cpu,STATUS:.status.phase'

There are 15 experiment pods, some from 3 weeks ago, all still running. This namespace is meant for short-lived experiments. These should have TTLs. Let me check if there's a team Slack channel... yeah, I'll ping them. But for now, I'll tag the old pods for visibility.

Now, the etcd storage issue. This is the more urgent one.

kubectl exec -it etcd-master-1 -n kube-system -- etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status --write-out=table

DB size: 7.2GB. The default max is 8GB. We're at 90%. Etcd getting full is a cluster-killer — once it hits the limit, the API server can't write, and nothing works. Let me find out what's consuming the space.

kubectl exec -it etcd-master-1 -n kube-system -- etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key get / --prefix --keys-only | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -20

Top consumers: /registry/events — 2.1M entries. Events. Kubernetes events are stored in etcd and they accumulate. The default event TTL is 1 hour, but if the API server isn't garbage-collecting properly...

Let me check the kube-apiserver event TTL setting.

kubectl get pods -n kube-system -l component=kube-apiserver -o jsonpath='{.items[0].spec.containers[0].command}' | tr ',' '\n' | grep event

No --event-ttl flag set. So it's using the default 1 hour. But 2.1M events in etcd suggests the GC isn't keeping up, or something is generating events at a very high rate.

kubectl get events --all-namespaces --sort-by=.lastTimestamp --no-headers | wc -l

8,400 events visible through the API. That's not abnormal for the 1-hour window. The issue might be stale data from old events that weren't properly compacted in etcd.

Mental Model: Etcd Compaction and Defragmentation

Etcd stores all revisions of every key. Deleting a key doesn't free space — it marks the revision as a tombstone. You need compaction (remove old revisions) and then defragmentation (reclaim the space) to actually reduce the DB size. A growing etcd DB often means compaction isn't running, not that there's too much current data.

kubectl exec -it etcd-master-1 -n kube-system -- etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key compact $(kubectl exec -it etcd-master-1 -n kube-system -- etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status --write-out=json | jq '.[0].Status.header.revision')

Compaction running. Now defragment.

kubectl exec -it etcd-master-1 -n kube-system -- etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key defrag

Let me check the DB size after defrag.

kubectl exec -it etcd-master-1 -n kube-system -- etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status --write-out=table

DB size: 2.1GB. Down from 7.2GB. That's a massive compaction. The auto-compaction wasn't running. I need to set that up — --auto-compaction-mode=periodic --auto-compaction-retention=1h on the etcd flags. I'll do that as a maintenance change this week.

Now, the stale canary. Someone started a canary rollout of the search-service two weeks ago and never promoted or rolled back.

kubectl get rollout search-service -n commerce -o yaml 2>/dev/null || \
kubectl get deployment search-service -n commerce -o yaml | grep -A 5 strategy

It's using Argo Rollouts. Let me check the status.

kubectl argo rollouts get rollout search-service -n commerce

Status: Paused at step 2/5 (setWeight: 10%). It's been paused for 14 days. 10% of traffic is going to v2.8.0, 90% to v2.7.5. This is essentially running an A/B test for two weeks with no analysis. Either promote or abort.

Let me check if there are any error rate differences between the two versions.

kubectl exec -it prometheus-0 -n monitoring -- promtool query instant http://localhost:9090 'sum(rate(http_requests_total{service="search-service",code=~"5.."}[1h])) by (version) / sum(rate(http_requests_total{service="search-service"}[1h])) by (version)'

Both versions show 0.1% error rate. No difference. The canary is fine — it just needs to be promoted. But I'm not going to promote someone else's canary without talking to them. I'll message the search team and give them a deadline: promote by end of day or I abort.

Alright, cluster review done. Etcd is healthy, I'll ping the over-quota teams, and the canary has a deadline. Let me document this in the weekly ops log.

What Made This Senior-Level

Junior Would... Senior Does... Why
Not have a regular cluster health review Run a structured weekly review covering resources, storage, and stale state Preventive maintenance catches issues before they become incidents
Panic when etcd is at 90% Know that compaction + defragmentation will likely recover most of the space Understanding etcd's revision storage model turns a scary alert into a routine operation
Not notice the stale canary Review in-flight rollouts as part of the health check Stale canaries waste resources and create an ambiguous state that complicates future deploys
Think ResourceQuotas enforce limits on existing pods Know quotas are admission controls only and plan enforcement accordingly This misunderstanding leads to false confidence in resource governance

Key Heuristics Used

  1. Structured Cluster Review: Regularly check node health, resource utilization, etcd storage, and in-flight operations to catch issues before they escalate.
  2. Etcd Compaction Model: Etcd DB growth usually means compaction isn't running. Compact then defragment to reclaim space, then ensure auto-compaction is configured.
  3. Stale State Is a Liability: In-flight canaries, forgotten experiments, and unclaimed resources create operational ambiguity. Set deadlines and clean up.

Cross-References

  • Primer — Cluster architecture, etcd role, resource management, HPA, and probes
  • Street Ops — Etcd maintenance, resource inspection, HPA debugging, and probe tuning
  • Footguns — Etcd storage limits, ResourceQuota grandfathering, HPA flapping, and probe death spirals

Thinking Out Loud: HPA Tuning

A senior SRE's internal monologue while working through a real HPA tuning task.

The Situation

The checkout service has been slow during the daily traffic peak (12:00-13:00). The HPA is configured but pods aren't scaling fast enough — by the time new pods are ready, the peak is half over.

The Monologue

HPA not scaling fast enough. Let me start by understanding what the HPA is doing today.

kubectl describe hpa checkout-service -n commerce

Events show scaling up in steps of 3-4 pods every 5 minutes. The traffic spike is steep. Going from 3 to 12 pods takes 15+ minutes when scaling in steps.

Mental Model: HPA Scaling Velocity vs Traffic Ramp Rate

If your traffic ramps faster than the HPA can provision, change the HPA velocity (behavior policies). Options: pre-scale (scheduled), lower target percentage, raise minimum replicas, or use leading-indicator metrics.

The fix: use the behavior field to allow faster scale-up. Scale up aggressively (allow doubling every 60s), scale down conservatively (5-minute stabilization, 10% per minute).

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 60
    selectPolicy: Max
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 10
      periodSeconds: 60

With 6 min replicas and aggressive scale-up: minute 0 = 6, minute 1 = 12. One minute of scaling lag instead of 15 minutes.

Mental Model: The Asymmetric Scaling Principle

Scale up aggressively, scale down conservatively. Over-provisioning briefly is a few cents. Under-provisioning during a spike is failed requests and revenue loss.

What Made This Senior-Level

Junior Would... Senior Does... Why
Increase maxReplicas Analyze scaling velocity — the problem is speed, not ceiling HPA was reaching the right target eventually; it couldn't get there fast enough
Set symmetric policies Scale up aggressively, scale down conservatively Cost asymmetry demands it
Not calculate total lag Add up HPA period + policy period + pod startup time Knowing the math predicts whether the config will work

Thinking Out Loud: Probe Configuration

A senior SRE's internal monologue while working through a real probes issue.

The Situation

The Java-based recommendation-engine service experiences rolling restarts during peak hours. No crashes, no OOM — just Kubernetes restarting "healthy" pods.

The Monologue

Pods killed with no crash and no OOM. That screams liveness probe failure.

kubectl describe pod recommendation-engine-abc -n ml | grep -i "unhealthy\|probe failed\|killing"
# "Liveness probe failed: HTTP probe failed with statuscode: 503"

HTTP GET on /health, timeout 1 second. A Java app with a 1-second timeout during peak. GC pauses or thread pool exhaustion makes the health endpoint slow, and Kubernetes kills it for being "unhealthy" when it's just busy.

Mental Model: Liveness vs Readiness — The Cardinal Rule

Liveness = "Is this process deadlocked beyond recovery?" Readiness = "Can it handle traffic right now?" If liveness fails because the app is under load, you have the wrong thresholds. Liveness failures REDUCE capacity during the moment you need it most — a death spiral.

Fix: separate concerns. Make liveness lenient (5s timeout, 15s period, threshold 5 = 75s budget). Add readiness probe that's stricter (2s timeout, 5s period, threshold 3). Readiness failure removes from traffic (cheap, reversible). Liveness failure kills the pod (expensive, destructive).

Mental Model: Probe Thresholds as Circuit Breakers

Readiness = circuit breaker (remove from rotation, recoverable). Liveness = last resort (kill and restart). Set readiness sensitive, liveness lenient.

What Made This Senior-Level

Junior Would... Senior Does... Why
Think "pods are restarting, app must be crashing" Recognize the liveness probe failure pattern No OOM + no crash = probe-driven restart
Tighten liveness to make it "more reliable" Loosen liveness and add readiness Strict liveness under load causes death spiral
Only fix Kubernetes config Also advise fixing the app's health endpoint threading Kubernetes config is a bandaid; real fix is in app architecture