Interview Gauntlet: Managed Database or Self-Hosted?¶
Category: Architecture Trade-offs Difficulty: L2-L3 Duration: 15-20 minutes Domains: Database, Cloud Infrastructure
Round 1: The Opening¶
Interviewer: "Your team needs a PostgreSQL database for a new production service. Should you use a managed service like AWS RDS or self-host PostgreSQL on EC2 or Kubernetes?"
Strong Answer:¶
"For most teams, I'd default to managed (RDS) and only consider self-hosting if there's a specific reason. RDS provides automated backups, point-in-time recovery, automated minor version upgrades, Multi-AZ failover, read replicas with a few clicks, monitoring via CloudWatch, and storage autoscaling. You'd need a full-time DBA or dedicated SRE time to replicate this manually on EC2 or Kubernetes. The cost premium for RDS over equivalent EC2 instances is roughly 30-40% — but you're buying operational hours, not just compute. A 4-person team self-hosting PostgreSQL will spend 10-20% of an engineer's time on database operations: patching, backup verification, failover testing, storage management, and incident response. That engineering cost usually exceeds the RDS premium. That said, I'd self-host if: the cost at scale is prohibitive (RDS costs grow faster than EC2 at high instance sizes), we need database features that RDS restricts (custom extensions, specific kernel tuning, logical replication setups that RDS doesn't support), or compliance requires running on specific infrastructure (on-prem, air-gapped). The question I'd ask the team: 'Do you want to be in the business of operating PostgreSQL, or the business of building your product?'"
Common Weak Answers:¶
- "Always use managed — why would you self-host?" — This ignores legitimate reasons for self-hosting (cost, customization, compliance).
- "Self-host for control." — Vague. What specific control do you need that RDS doesn't provide? Most teams overestimate how much control they'll actually use.
- "Run it on Kubernetes with a PostgreSQL operator." — Running stateful databases on Kubernetes adds significant complexity. This needs justification, not assumption.
Round 2: The Probe¶
Interviewer: "The team is growing and now has 5 RDS instances across services. The monthly RDS bill is $12,000. The finance team asks if self-hosting would be cheaper. Walk me through the cost comparison."
What the interviewer is testing: The ability to do a realistic TCO (Total Cost of Ownership) analysis that includes both infrastructure and labor costs.
Strong Answer:¶
"$12,000/month for 5 RDS instances — so roughly $2,400/instance average. Let me estimate the self-hosted equivalent. First, the compute cost: RDS pricing is roughly 30-40% above equivalent EC2 pricing. A db.r6g.xlarge RDS instance (4 vCPU, 32 GB) costs about $0.48/hour ($350/month). The equivalent r6g.xlarge EC2 instance costs about $0.25/hour ($180/month). So the raw compute savings would be about 40%, bringing $12,000 down to $7,200 — a $4,800/month saving. But then add the hidden costs. Storage: RDS includes managed storage with IOPS provisioning. On EC2, you manage EBS volumes yourself — similar cost but you're responsible for snapshots, sizing, and IOPS allocation. Backup management: RDS automates daily snapshots and PITR. On EC2, you'd write and maintain backup scripts (pg_basebackup + WAL archiving to S3), test restore procedures regularly, and monitor backup success. I'd estimate 2-4 hours/month of engineering time. High availability: RDS Multi-AZ is a checkbox. On EC2, you'd set up streaming replication, a failover mechanism (Patroni, repmgr, or pg_auto_failover), and a virtual IP or DNS update for the failover. Setup is 2-3 days; ongoing maintenance is 2-4 hours/month. Monitoring: RDS feeds CloudWatch automatically. On EC2, you'd configure and maintain a monitoring stack (pg_stat_statements, Prometheus postgres_exporter, alerting rules). The total engineering cost: I'd estimate 20-30 hours/month for a 5-instance fleet, including patching, backup verification, failover testing, and incident response. At $100-150/hour fully-loaded engineering cost, that's $2,000-4,500/month. Net savings: $4,800 compute savings minus $2,000-4,500 engineering cost = $300-2,800 net savings. It's break-even to marginally better, not the dramatic savings people expect. And that doesn't account for the risk: when RDS has an issue, AWS fixes it. When your self-hosted database has an issue at 3 AM, your engineer fixes it."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "What does 'point-in-time recovery' actually require if you self-host?" PITR requires continuous WAL archiving to a durable store (S3), a base backup to restore from, and then replaying WAL files up to the target timestamp using
pg_restoreor starting PostgreSQL withrecovery_target_time. The process is well-documented but easy to get wrong if you haven't tested it. RDS does this automatically; self-hosted requires regular restore drills.
Round 3: The Constraint¶
Interviewer: "The application grows. You now have a 2 TB PostgreSQL database doing 10,000 transactions per second. RDS works but costs $8,000/month for a single instance. Scaling up to the next instance type doubles the cost. What are your options?"
Strong Answer:¶
"At 2 TB and 10k TPS, we're hitting the limits where vertical scaling on RDS becomes very expensive. Options beyond just scaling up the instance. First, read replicas: if a significant portion of the 10k TPS is reads, offload them to read replicas. RDS supports up to 5 read replicas. The application needs to route read queries to the replica endpoint (many ORMs support this: Django's DATABASE_ROUTERS, SQLAlchemy's bind parameter). This can absorb 60-80% of read load without changing the primary. Second, connection pooling: if the TPS is limited by connection overhead, adding PgBouncer in front of RDS can increase effective throughput by 3-5x by reusing connections. RDS Proxy is the managed option but has its own cost. Third, data partitioning: if the 2 TB table has a natural partition key (like date for time-series data, or tenant_id for multi-tenant), PostgreSQL native partitioning can improve query performance and enable partition-level operations (drop old partitions instead of DELETE). Fourth, caching layer: put Redis or ElastiCache in front of the database for frequently-read, infrequently-changed data. A well-designed cache can reduce database TPS by 70-90%. Fifth, at this scale, it's worth evaluating Aurora PostgreSQL, which has a different storage architecture (shared distributed storage) and can handle 5x the write throughput of standard RDS PostgreSQL at similar cost. Aurora also supports up to 15 read replicas with sub-10ms replica lag."
The Senior Signal:¶
What separates a senior answer: Not immediately jumping to "shard the database" or "switch to NoSQL." At 2 TB and 10k TPS, PostgreSQL is nowhere near its architectural limits — the issue is the provisioned resources, not the database engine. Read replicas, connection pooling, caching, and partitioning are all changes that extend PostgreSQL's runway without the massive engineering cost of sharding or a database migration. Knowing when the database engine is the bottleneck vs when the infrastructure is the bottleneck is a key senior skill.
Round 4: The Curveball¶
Interviewer: "A compliance audit requires that your database can be restored to any point in the last 90 days with no more than 1 minute of data loss (RPO of 1 minute). RDS automated backups have a maximum retention of 35 days. How do you meet this requirement?"
Strong Answer:¶
"RDS automated backups max out at 35 days, so we need a supplementary strategy for the 36-90 day window. Options: first, RDS manual snapshots. Manual snapshots persist until explicitly deleted and aren't subject to the retention window. I'd create a daily manual snapshot and manage retention with a Lambda function or AWS Backup that deletes snapshots older than 90 days. But manual snapshots are point-in-time snapshots at the moment of creation — they don't provide PITR between snapshots. Second, AWS Backup: supports RDS and can be configured with a 90-day retention policy, including PITR within the retention window (AWS Backup uses the same WAL-based mechanism as RDS automated backups). I'd verify that AWS Backup's PITR retention covers the full 90 days — this is a relatively recent feature. Third, a self-managed WAL archiving pipeline: configure PostgreSQL's WAL archiving (if using an EC2-based setup) to ship WAL files to S3 with a 90-day lifecycle policy. Combined with a periodic base backup, this provides PITR to any second within the 90-day window. For the RPO of 1 minute: RDS in Multi-AZ mode synchronously replicates to a standby, so failover loses zero data. For PITR, the RPO is bounded by the WAL archiving frequency. RDS archives WAL to S3 every 5 minutes by default (this is the 'backup window' granularity). So the default PITR RPO is 5 minutes, not 1 minute. To achieve 1-minute RPO, I'd need to verify that the WAL archiving latency is under 1 minute — or use a streaming replication setup to a separate region where WAL is applied continuously."
Trap Question Variant:¶
The right answer is "I need to verify RDS's actual WAL archiving latency." Most people assume RDS PITR gives you second-level granularity, and the PITR restore does let you specify a target time to the second. But the actual RPO depends on how frequently WAL is archived to S3, which has some lag. Candidates who confidently claim "RDS has zero RPO for PITR" haven't tested a recovery under pressure. Saying "I'd need to verify the actual archiving latency in our configuration" is the honest, correct answer.
Round 5: The Synthesis¶
Interviewer: "You've been managing databases for years. What's the one mistake you see teams make most often with their database infrastructure?"
Strong Answer:¶
"Not testing backups. Every team has backups running — RDS automated snapshots, pg_dump cron jobs, WAL archiving. But very few teams regularly test the restore process. I've seen teams discover during an actual outage that their backup files are corrupted, that the restore process takes 6 hours instead of the assumed 30 minutes, or that the restored database is missing recent transactions because the WAL archiving was silently failing. The fix is simple but requires discipline: schedule a monthly restore drill. Spin up a new RDS instance from the latest snapshot, verify the data is complete and consistent (row counts on key tables, check the latest transaction timestamp), measure the restore time, and document the results. Automate this with a Lambda or a CI pipeline so it happens without human intervention. The related mistake is not knowing your RTO. Teams say 'we have backups' without knowing how long a full restore takes. A 2 TB database restoration from an RDS snapshot takes 30-60 minutes for the volume restore plus application reconnection time. If the business expects 5-minute recovery, you need a hot standby (Multi-AZ) or read replica promotion, not a backup restore. The backup is for disaster recovery; the standby is for high availability. Confusing these two is how teams end up with hour-long outages when they promised minute-level recovery."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Managed vs self-hosted trade-off evaluation |
| 2 | Total cost of ownership analysis including labor |
| 3 | Database scaling strategies beyond vertical scaling |
| 4 | Backup/recovery compliance requirements and RPO mechanics |
| 5 | Operational wisdom about database reliability |