From Monolith to Misery¶

Category: The Migration Domains: microservices, architecture Read time: ~5 min

Setting the Scene¶

I was an SRE at a B2B SaaS company with a Rails monolith that had been growing for seven years. It was 400,000 lines of Ruby, deployed as a single process on 8 beefy VMs behind an HAProxy load balancer. Deploys took 45 minutes. The test suite took 90. But it worked. Customers were happy. P99 latency was 120ms.

Then the new VP of Engineering arrived from a FAANG company and declared we were "decomposing the monolith." Not into 5 services. Not into 10. Into 40. He had a spreadsheet with bounded contexts.

What Happened¶

Month 1 — We identified service boundaries and drew architecture diagrams on whiteboards. It looked elegant. Clean boxes, clean arrows. We picked gRPC for inter-service communication, Kubernetes for orchestration, and Istio for service mesh. The "modern stack."

Month 2-3 — We extracted the first 8 services. User auth, billing, notifications, product catalog, search, analytics, admin, and reporting. Each got its own repo, its own CI pipeline, its own database. What used to be a method call — User.find(id) — became an HTTP request traversing a load balancer, a service mesh sidecar, and a network hop.

Month 4 — Latency started creeping up. A single page load that previously made 3 database queries now made 12 network calls across 6 services. P99 went from 120ms to 800ms. Some requests hit 2 seconds when a downstream service was slow. Customers noticed.

Month 5-6 — We added caching, circuit breakers, and retry logic. Every service needed a Resilience4j wrapper. We introduced distributed tracing with Jaeger because nobody could debug a request that touched 8 services. The team spent more time on infrastructure plumbing than on features. Our deploy frequency actually dropped — coordinating changes across 40 repos was harder than deploying one monolith.

Month 7 — An incident. The billing service had a connection pool leak. It cascaded. The retry storms from 6 upstream services brought down the user auth service. Auth going down took out everything. We had a 47-minute outage that would never have happened with the monolith, because the billing code and auth code shared a process and a connection pool.

Month 8 — The VP left. We quietly started merging services back together. We went from 40 to 12. The 12 that survived were genuinely independent domains: auth, billing, notifications, and a few others that had different scaling profiles.

The Moment of Truth¶

Standing in the postmortem for the cascade failure, looking at a Jaeger trace that spanned 14 services and 23 network hops for what used to be a single Rails controller action. Someone said, "We replaced function calls with network calls and called it architecture." Nobody laughed because it was true.

The Aftermath¶

With 12 services, things stabilized. P99 came back down to 200ms — still worse than the monolith but acceptable. Deploy frequency actually improved because the remaining services had real independence. The lesson became part of our onboarding: "We tried 40 microservices. It was bad. Here's why 12 is the right number for us."

The Lessons¶

Start with 3-5 services, not 40: Extract only when you have a clear scaling or deployment reason. If two things always deploy together, they're one service.
Distributed systems are harder than monoliths: Network calls fail. Latency compounds. Debugging spans services. You need tracing, circuit breakers, and retry budgets before day one.
Measure before you split: If your monolith's P99 is 120ms, that's your baseline. Any decomposition that makes it worse needs justification.

What I'd Do Differently¶

I'd start by identifying the 3 services that genuinely need independent scaling or deployment cycles — usually auth, async processing, and one domain-specific hotspot. Extract those. Run them for 6 months. Then decide if you need more. I'd also require every proposed service boundary to answer: "What happens when this network call fails?" If the answer is "the whole request fails," it's not a real boundary.

The Quote¶

"We turned a 120ms monolith into a 2-second distributed system and called it progress."

Cross-References¶

Topic Packs: Distributed Systems, API Gateways, Service Mesh