Interview Gauntlet: Multi-Region Kubernetes Deployment¶
Category: System Design Difficulty: L3 Duration: 15-20 minutes Domains: Kubernetes, Networking
Round 1: The Opening¶
Interviewer: "Design a multi-region Kubernetes deployment for an e-commerce platform. The system needs to serve users in North America, Europe, and Asia-Pacific with low latency."
Strong Answer:¶
"I'd run a separate Kubernetes cluster in each region — us-east-1, eu-west-1, and ap-southeast-1 for AWS, or equivalent zones in GCP/Azure. Each cluster runs the full application stack: API servers, background workers, and a regional read replica of the database. Traffic routing happens at the DNS layer using Route 53 latency-based routing or a global load balancer like AWS Global Accelerator or Cloudflare. Users hit the nearest region automatically based on network latency. For stateless services, this is straightforward — each region runs the same deployment manifests. For stateful data, I'd use a primary-secondary database topology: writes go to the primary region (say, us-east-1), reads go to the local replica. This means writes from Europe and APAC have cross-region latency, but reads are fast. If we need multi-write, we'd look at CockroachDB or DynamoDB Global Tables, but that adds consistency complexity. Deployments are staggered: roll out to one region, validate with synthetic checks and error budget monitoring, then promote to the next region."
Common Weak Answers:¶
- "Run one big cluster across regions." — Kubernetes doesn't work well with high-latency etcd communication. Cross-region etcd is a recipe for split-brain and performance degradation.
- "Just use a CDN." — CDNs cache static content but don't solve the compute or data locality problem for dynamic API requests.
- No mention of the data layer — Multi-region compute is easy; multi-region data is the hard part. Skipping this shows the candidate hasn't thought through the real challenge.
Round 2: The Probe¶
Interviewer: "The primary database is in us-east-1 and it goes down. What's your failover strategy, and how long are users affected?"
What the interviewer is testing: Whether the candidate understands the difference between failover strategies and can reason about RTO/RPO trade-offs.
Strong Answer:¶
"It depends on whether we've set up automated or manual failover, and what database we're running. For AWS RDS Multi-AZ, failover within the region is automatic and takes 1-2 minutes — but that's within-region, not cross-region. For cross-region failover of the primary database, I'd use RDS Cross-Region Read Replicas with manual promotion, or Aurora Global Database which supports managed failover. With Aurora Global Database, the RPO is typically under 1 second (replication lag) and the RTO is under 1 minute for managed planned failover, or 1-5 minutes for unplanned failover. During failover, I'd need to update the application's write endpoint. If we're using Route 53 health checks on the database endpoint, we can automate the DNS cutover, but DNS TTL adds propagation time — even with a 30-second TTL, some clients will hold the old record for up to 60 seconds. The total user impact window is roughly: detection time (1-2 minutes with health checks) + failover time (1-5 minutes) + DNS propagation (30-60 seconds) + connection pool refresh (depends on driver configuration, usually under 30 seconds). So realistic worst case is 5-8 minutes of write unavailability for users not in the failover region. Reads continue serving from local replicas throughout."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "What's the replication lag between us-east-1 and eu-west-1 for Aurora Global Database?" A reasonable answer is "typically 50-200ms for Aurora Global, but I'd need to measure under our specific write throughput to know the actual RPO." Claiming "zero data loss" for async replication is a red flag.
Round 3: The Constraint¶
Interviewer: "Management says the cross-region database replication costs are too high — $15k/month just for the Aurora Global Database replicas. Budget is being cut to $5k/month for the entire multi-region data strategy. What do you do?"
Strong Answer:¶
"At $5k/month total, Aurora Global Database is off the table. I'd rethink the data architecture. First question: do all regions actually need a database replica, or can we solve most of the latency problem with aggressive caching? For an e-commerce platform, product catalog reads massively outnumber writes. I'd put a Redis or ElastiCache cluster in each region for catalog data, product pages, and user session data, with a 5-minute TTL for catalog items. The application in EU and APAC reads from local cache for 95% of requests and falls back to a cross-region API call to the primary database for cache misses and writes. This costs roughly $500-1000/month per region for a modest Redis cluster, well within budget. For the write path, I'd accept the cross-region latency to us-east-1 — for cart operations and order placement, an extra 100-200ms is noticeable but tolerable. I'd use connection pooling (PgBouncer) in each region to manage the cross-region connections efficiently. For failover, instead of a hot replica, I'd rely on automated daily database backups to each region's S3 bucket. The RTO goes from 5 minutes to 30-60 minutes, and the RPO goes from near-zero to potentially 24 hours, but at $5k/month, that's the trade-off. I'd make sure the business understands this: we're trading recovery time for cost."
The Senior Signal:¶
What separates a senior answer: Willingness to challenge the architecture rather than just finding a cheaper version of the same thing. The cache-first approach is fundamentally different from the replica approach, and recognizing that most e-commerce reads can be served from cache (with acceptable staleness) shows practical design thinking. Also: explicitly stating the degraded RTO/RPO and getting business sign-off rather than silently accepting increased risk.
Round 4: The Curveball¶
Interviewer: "A new regulation requires that European user data must not leave the EU. How does this affect your architecture?"
Strong Answer:¶
"This is a data residency requirement and it changes the architecture significantly. European user data — account info, order history, payment data — must be stored in EU. That means the EU region can't just be a cache or a read replica of a US primary; it needs its own primary database for EU users. I'd shard by user geography: when a user creates an account, their region is determined (by billing address, IP geolocation, or explicit selection), and their data is routed to the corresponding regional database. The application needs a routing layer that maps user IDs to their home region. This is a significant refactor if the application wasn't designed for it. For the transition, I'd start with a geo-router service that sits in front of the database layer and directs queries to the correct regional database based on user ID prefix or a lookup table. Cross-region data queries — like a US admin dashboard viewing EU user data — would need to go through the EU database, adding latency but maintaining compliance. I'd also need to audit the log pipeline, backup system, and any analytics pipeline to ensure EU user data doesn't flow to US-based systems. This is where data classification becomes critical."
Trap Question Variant:¶
The right answer involves acknowledging uncertainty. Candidates who confidently design a GDPR-compliant architecture without saying "I'd involve legal counsel to validate the specific requirements" are overstepping. Data residency regulations vary by country, industry, and data type. The technical design is necessary but not sufficient — legal review is mandatory. "I know the broad requirement, and here's my technical approach, but the specific regulatory interpretation needs legal sign-off" is the strongest answer.
Round 5: The Synthesis¶
Interviewer: "You've gone from a simple multi-region deploy to dealing with cost constraints and data residency. When is multi-region not worth it?"
Strong Answer:¶
"Multi-region isn't worth it when the complexity exceeds the value. Specifically: if your user base is concentrated in one geography and the latency difference between single-region and multi-region is 50-100ms for the distant users, that's rarely worth the operational overhead. If your team is fewer than 5 engineers, the operational burden of managing multiple clusters, cross-region data replication, and regional compliance will slow down feature development more than the uptime improvement is worth. And if your revenue doesn't justify it — multi-region done properly doubles or triples your infrastructure cost. A $50k/year SaaS doesn't need $30k/year in multi-region infrastructure. The honest framework is: multi-region exists to solve two problems — latency and availability. For latency, a CDN plus edge caching solves 80% of the problem at 10% of the cost. For availability, a well-architected single-region deployment with multi-AZ redundancy gives you 99.95%+ uptime, which is sufficient for most businesses. Multi-region becomes essential when: you have contractual SLAs above 99.99%, your revenue makes even 5 minutes of downtime cost more than the multi-region infrastructure, or regulations require data residency. For everyone else, it's premature optimization of availability."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Multi-region architecture fundamentals and data strategy |
| 2 | Database failover mechanics and RTO/RPO reasoning |
| 3 | Cost-constrained design and creative alternatives |
| 4 | Data residency compliance and honest scoping of expertise |
| 5 | Pragmatic judgment about when complexity isn't worth it |