The Kubernetes Migration That Took a Year

lesson
migration-planning
"bare-metal-\u2192-vms-\u2192-containers-\u2192-k8s"
blast-radius
strangler-fig
l2 ---# The Kubernetes Migration That Took a Year

Topics: migration planning, bare metal → VMs → containers → K8s, blast radius, strangler fig Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic understanding of VMs and containers helpful

The Mission¶

The CTO announces: "We're moving to Kubernetes." Engineering estimates 3 months. It takes 14 months. Six services are still on VMs a year later. The K8s cluster had three production incidents caused by the migration itself. And the team is exhausted.

This isn't a Kubernetes tutorial. It's a lesson about migration — how to move a running production system from one platform to another without dropping everything in transit. The principles apply whether you're migrating to K8s, to the cloud, or between cloud providers.

Why Migrations Take Longer Than Expected¶

The Iceberg Problem¶

The visible work: containerize services, write manifests, deploy to K8s.

The hidden work (the other 80%):

Visible (20%):
  ├── Write Dockerfiles
  ├── Write K8s manifests
  └── Deploy to cluster

Hidden (80%):
  ├── Persistent storage (databases, file uploads, sessions)
  ├── Networking (service discovery, DNS, firewalls between old and new)
  ├── Secrets management (move from files/env to K8s Secrets or Vault)
  ├── Monitoring and alerting (rewrite dashboards, adjust alerts)
  ├── CI/CD pipelines (rebuild for container workflow)
  ├── Local development workflow (everyone needs to learn containers)
  ├── Log aggregation (stdout/stderr vs log files, new pipeline)
  ├── Cron jobs (crontab → K8s CronJobs or systemd timers)
  ├── Stateful services (databases, message queues — hardest to move)
  ├── Compliance and security (new attack surface, new audit requirements)
  ├── Team training (everyone needs Kubernetes knowledge)
  └── Edge cases (that one service with hardcoded file paths, the batch job
       that needs 50GB of temp space, the legacy app that only runs on
       CentOS 6)

Mental Model: Migrating to Kubernetes is like renovating a house while living in it. You can't knock down all the walls at once — you move room by room, keeping the house livable the whole time. The "easy" part (painting) takes 20% of the budget. The hidden part (plumbing, electrical, foundation) takes 80%.

The Strangler Fig Pattern¶

Don't migrate everything at once. Wrap the old system with the new one, routing traffic gradually:

Phase 1: Deploy K8s cluster alongside existing VMs
         Old system handles 100% of traffic

Phase 2: Migrate one stateless service (the simplest one)
         K8s handles 5% of traffic (one service)
         VMs handle 95%

Phase 3: Migrate more stateless services
         K8s handles 30%, VMs handle 70%

Phase 4: Migrate stateful services (databases stay last)
         K8s handles 80%, VMs handle 20%

Phase 5: Decommission VMs
         K8s handles 100%

Name Origin: The strangler fig is a tree that grows around a host tree, gradually replacing it. The old tree dies slowly as the fig takes over. Martin Fowler named this software pattern in 2004. The key insight: you never have a "big bang" cutover. The new system gradually replaces the old one, and at every step, you can stop or reverse.

Migration order¶

Stateless services first (API servers, web frontends) — easiest, no data to migrate
Workers and batch jobs second — need to handle queue consumers on both platforms
Databases and stateful services last — hardest, highest risk, most planning needed

The Mistakes Everyone Makes¶

Mistake 1: "Let's rewrite while we migrate"¶

"Since we're containerizing, let's also refactor to microservices and switch to GraphQL and adopt event sourcing."

Result: A migration that should take 3 months takes 14 because you're doing 4 things at once. Each change introduces bugs. You can't tell if the bug is from the migration or the rewrite.

Rule: Migrate first, improve later. Move the existing code to containers exactly as-is. Once it's stable on K8s, then refactor. One change at a time.

Mistake 2: Migrating the database first¶

Databases are stateful, high-risk, and the hardest to move. Starting with them means the highest-risk work happens when the team has the least K8s experience.

Rule: Databases move last. Keep them on VMs (or managed services like RDS) even after everything else is on K8s. Many production systems run K8s workloads talking to external databases permanently.

Mistake 3: No hybrid connectivity plan¶

During migration, services on K8s need to talk to services on VMs. Without planning this:

K8s pod → needs database → database is on VM 10.0.1.50
  How does the pod reach it?
  Does it use the VM's IP directly? (breaks if VM changes IP)
  DNS? (which DNS — CoreDNS in K8s or the VM's resolver?)
  VPN? (latency, configuration)

Rule: Plan hybrid networking before migrating the first service. Service discovery between old and new platforms must work seamlessly.

Mistake 4: No rollback per service¶

"We migrated service A to K8s. It's having problems. Can we move it back to the VM?"

If the VM was decommissioned, the CI/CD pipeline was dismantled, and the config was deleted — no. You're stuck.

Rule: Keep the old deployment working until the new one has been stable for at least 2 weeks. Rollback must be one command, not a project.

The Migration Checklist Per Service¶

Before migration:
  □ Dockerfile works and passes all tests
  □ K8s manifests (Deployment, Service, Ingress) written and reviewed
  □ Resource requests and limits set (not guessed — measured from current metrics)
  □ Health checks (readiness + liveness) implemented and tested
  □ Secrets moved to K8s Secrets or Vault
  □ Logging to stdout/stderr (not files)
  □ Monitoring dashboards updated for K8s metrics
  □ CI/CD pipeline builds and pushes container image
  □ Hybrid networking tested (K8s pod can reach VM dependencies)

During migration:
  □ Deploy to K8s (0% traffic)
  □ Run smoke tests against K8s deployment
  □ Shift 5% of traffic to K8s (canary)
  □ Monitor for 24 hours
  □ Shift 50% of traffic
  □ Monitor for 48 hours
  □ Shift 100% of traffic
  □ Keep VM running as fallback for 2 weeks

After migration:
  □ Decommission VM (only after 2-week bake period)
  □ Update runbooks for K8s operations
  □ Train on-call team on K8s debugging for this service
  □ Delete old CI/CD pipeline and configs

Flashcard Check¶

Q1: Migration estimated 3 months, took 14. What went wrong?

The iceberg problem. Containerizing services (visible) is 20% of the work. Persistent storage, networking, secrets, monitoring, CI/CD, training — the hidden 80%.

Q2: What is the strangler fig pattern?

Gradually replace the old system by routing traffic to the new one, service by service. At every step, you can stop or reverse. No big-bang cutover.

Q3: Which services should you migrate first?

Stateless services (API servers, web frontends). Easiest, no data to migrate. Databases and stateful services migrate last.

Q4: "Let's rewrite while we migrate" — why is this a mistake?

You're doing 4 things at once. Each introduces bugs. You can't tell if failures are from the migration or the rewrite. Migrate first, improve later.

Takeaways¶

The hidden work is 80%. Networking, secrets, monitoring, CI/CD, training, compliance, edge cases. Don't estimate based on "write Dockerfiles and manifests."
Strangler fig, not big bang. Migrate one service at a time. Keep the old system running. Rollback must be one command.
Stateless first, databases last. Build K8s experience on low-risk services before touching the most critical ones.
Migrate first, improve later. Don't rewrite and migrate simultaneously. Move the code as-is, stabilize, then refactor.
Plan hybrid networking from day one. K8s pods and VMs will coexist for months. Service discovery between them must be seamless.

What Happens When You kubectl apply — the K8s side of the migration
Deploy a Web App From Nothing — building the deployment layer by layer
The Rollback That Wasn't — when migration rollback fails