Skip to content

The Autoscaler That Almost Bankrupted Us

Category: The Close Call Domains: kubernetes, finops Read time: ~5 min


Setting the Scene

We ran a 60-node GKE cluster on Google Cloud. Our platform team had recently rolled out Horizontal Pod Autoscalers (HPAs) across all production services to handle traffic spikes without manual intervention. The setup was working great — during Black Friday, our checkout service scaled from 12 to 35 pods seamlessly. Everyone was happy. Too happy, maybe, because nobody questioned the defaults.

One of our QA engineers, Dana, was running a load test against our staging environment using Locust. Staging shared the same GKE cluster as production, separated by namespaces. The load test was designed to simulate 50,000 concurrent users hammering the product catalog API.

What Happened

Dana kicked off the load test at 3 PM on a Thursday and left for a dentist appointment. She planned to stop it at 5 PM. She forgot. The load test ran through the night.

The HPA for the catalog service had been configured with minReplicas: 3 and targetCPUUtilizationPercentage: 60. But nobody had set maxReplicas. When you don't set maxReplicas in a Kubernetes HPA, it doesn't default to something reasonable. The API server default was, at the time, maxReplicas: 2147483647 — the max value of a 32-bit integer.

By midnight, the catalog service had scaled to 85 pods. By 3 AM, it was at 190. Each pod requested 2 CPU cores and 4GB of RAM. The cluster autoscaler, doing exactly what it was designed to do, started adding nodes. By 6 AM, the cluster had grown from 60 to 210 nodes, all n2-standard-8 instances at $0.38/hour each.

At 7:14 AM, our Slack channel #cloud-costs lit up. A custom bot we'd built using the GCP Billing API posted: "Daily projected spend has exceeded $12,400. Yesterday's actual: $4,100. Current burn rate: $80,256/day." The bot included a breakdown by namespace and service. The catalog service in the staging namespace was responsible for 94% of the spike.

I was on the train when I saw the message. I ran kubectl get hpa -n staging from my phone using Termius. The output showed REPLICAS: 412 / 2147483647. Four hundred and twelve pods. For a load test nobody was watching.

I killed the load test with kubectl delete deployment locust-master -n staging and then scaled the HPA down manually: kubectl patch hpa catalog-hpa -n staging -p '{"spec":{"maxReplicas":5}}'. The cluster autoscaler started draining nodes within minutes.

The Moment of Truth

The Slack cost bot caught it 16 hours into the runaway scale. Without it, the load test would have run until Dana returned from the dentist — the next day. At $80K/day, a full 24-hour runaway would have cost more than our monthly cloud budget. The actual damage: about $6,200 in wasted compute. Painful, but survivable.

The Aftermath

We set maxReplicas on every HPA in the cluster — no exceptions. Our platform team wrote an OPA Gatekeeper policy that rejects any HPA manifest without an explicit maxReplicas value. We also added automatic termination to all load test configurations: Locust jobs now have a --run-time 2h flag and are wrapped in a Kubernetes Job with activeDeadlineSeconds: 7200. Finally, we moved staging to a separate cluster with a hard billing budget cap of $500/day.

The Lessons

  1. Always set max replicas: An HPA without maxReplicas is an open credit card with no limit. Kubernetes defaults are not safe defaults. Every autoscaler configuration must have an explicit ceiling, reviewed and approved by both the platform team and finance.
  2. Cost monitoring is as important as performance monitoring: We had Prometheus, Grafana, and PagerDuty for uptime. But the cost bot in Slack was the thing that actually saved us. If you're not monitoring spend in real time, you're flying blind in the cloud.
  3. Load tests need automatic termination: A load test without a timeout is a runaway process. It should have a maximum duration, a maximum request count, and ideally a cost ceiling. Never rely on a human remembering to stop it.

What I'd Do Differently

I'd implement ResourceQuota objects per namespace from day one, capping total CPU and memory for staging at a fraction of production. Even if the HPA goes wild, it can't exceed the namespace quota. Belt and suspenders — the HPA cap stops the pods, the resource quota stops the namespace, and the billing alert stops the wallet.

The Quote

"The cloud will give you exactly as much infrastructure as you ask for. The problem is when you forget to stop asking."

Cross-References