The Kubernetes Migration That Took a Year¶
Category: The Migration Domains: kubernetes, containers Read time: ~5 min
Setting the Scene¶
I was the lead SRE at a mid-size e-commerce company running 60 services on a fleet of EC2 instances managed by Ansible. We had Chef recipes nobody understood, deploy scripts held together with bash and hope, and a CEO who'd just come back from KubeCon very excited. The directive: "Get us on Kubernetes by Q2." That was January. Q2 came and went. So did Q3. And Q4.
We had a 12-person platform team, a shared EKS cluster in staging, and a migration spreadsheet that started with 60 rows and ended with 200 once we discovered all the sidecar processes nobody had documented.
What Happened¶
Week 1-2 — We containerized our first stateless API gateway. Wrote a Dockerfile, built a Helm chart, deployed to staging. Took two days. We high-fived. "We'll be done by March," I said. I was an idiot.
Month 2 — The first 15 stateless services migrated smoothly. We had a helm upgrade --install pipeline in GitHub Actions, rolling deployments, health checks. I wrote a blog post draft titled "How We Migrated to Kubernetes in 90 Days." Never published it.
Month 3 — We hit the stateful services. Our order processing system used local disk for a write-ahead log. Our search service had a 200GB Elasticsearch cluster with custom shard allocation. Our legacy billing system talked to a local PostgreSQL instance over a Unix socket. None of this was going to "just work" in a pod.
Month 4-6 — We fought with PersistentVolumeClaims, EBS CSI drivers, and StatefulSets. The Elasticsearch migration alone took six weeks. We discovered that our PV reclaim policy was set to Delete by default, which we found out when a node replacement nuked a volume. Lost staging data. I added a kubectl get pv check to my morning routine.
Month 7-9 — The billing system. It used shared memory segments to talk to a co-located process. We had to refactor the IPC layer to use gRPC, which meant touching code nobody had modified since 2019. The original developer had left. His comments were in a mix of English and what I think was Portuguese.
Month 10-12 — Long tail. A service that shelled out to wkhtmltopdf. A cron job that assumed /var/log persisted across restarts. A monitoring agent that scraped /proc on the host. Each one was a week of discovery, refactoring, testing, and praying. We finished in December, eleven months after "Q2."
The Moment of Truth¶
The real moment wasn't the final cutover. It was month 5, when I sat in a meeting and said out loud: "This is going to take a year." The room went quiet. The CTO looked uncomfortable. But once we accepted the timeline, we stopped cutting corners. We built proper runbooks, tested rollbacks for every service, and ran dual-stack for the critical path. The last three months were actually the smoothest.
The Aftermath¶
We hit our revised December deadline. Stateless services had been running in production on EKS for eight months by then without incident. The stateful services needed babysitting for another two months. We decommissioned the last EC2 instance on February 3rd. I took a week off.
The Lessons¶
- Stateful workloads triple your timeline: Every service with local state, shared memory, or filesystem assumptions is a mini-project unto itself. Budget accordingly.
- Migrate stateless first: Get your easy wins deployed and stable while you figure out the hard stuff. It builds confidence and proves the platform.
- Have a rollback plan for each service: We rolled back four services during migration. Each one had a tested procedure. The ones without tested rollback plans were the ones that kept me up at night.
What I'd Do Differently¶
I'd spend the first month doing a deep audit of every service's runtime assumptions: filesystem access, IPC mechanisms, host-level dependencies, local state. I'd classify each service as "container-ready," "needs refactoring," or "needs redesign" before writing a single Dockerfile. And I'd add 50% padding to every estimate involving StatefulSets.
The Quote¶
"The Kubernetes migration plan survived exactly until we met our first StatefulSet."
Cross-References¶
- Topic Packs: Kubernetes Ops, Containers Deep Dive, K8s Storage
- Case Studies: Kubernetes Ops