AWS RDS: The Managed Database Tradeoff
- lesson
- aws-rds
- aurora
- postgresql
- mysql
- multi-az-failover
- read-replicas
- rds-proxy
- performance-insights
- backup/restore
- encryption
- iam-database-auth
- connection-pooling
- maintenance-windows ---# AWS RDS — The Managed Database Tradeoff
Topics: AWS RDS, Aurora, PostgreSQL, MySQL, Multi-AZ failover, read replicas, RDS Proxy, Performance Insights, backup/restore, encryption, IAM database auth, connection pooling, maintenance windows Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Prerequisites: None (SQL basics helpful but not required)
The Mission¶
It's 3:07 AM. PagerDuty fires:
CRITICAL: RDS instance "prod-api-db" — FreeStorageSpace below 2 GB
CloudWatch Alarm: rds-storage-critical
Current value: 1.8 GB (was 48 GB six days ago)
Your production PostgreSQL RDS instance is eating storage at a terrifying rate. The application is still running, but at this pace you have maybe 90 minutes before writes start failing. Nobody deployed anything unusual — or so the Slack thread claims.
You need to diagnose why storage is disappearing, buy yourself time, fix the root cause, and then set things up so this can't sneak up on you again. Along the way, you'll learn how RDS actually works under the hood — what AWS manages for you, what it hides from you, and where the "managed" in "managed database" quietly becomes "your problem."
By the end of this lesson you'll understand: - How RDS storage, snapshots, and auto-scaling work (and why storage can vanish) - Multi-AZ failover: what happens in those 60 seconds - When Aurora is worth the premium (and when it isn't) - How to use RDS Proxy, Performance Insights, and parameter groups - The maintenance window trap that reboots databases without asking
Part 1: First Response — Storage Is Vanishing¶
Before anything: how bad is it, and what's consuming space?
# Check current free storage (CloudWatch — you can't SSH into RDS)
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name FreeStorageSpace \
--dimensions Name=DBInstanceIdentifier,Value=prod-api-db \
--start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Minimum \
--output table
You can't df -h or du -sh on RDS. That's the first tradeoff of managed databases:
you trade SSH access for operational convenience. Everything goes through the AWS API,
CloudWatch, or SQL.
-- Connect to the RDS instance
psql "host=prod-api-db.abc123.us-east-1.rds.amazonaws.com \
port=5432 user=admin dbname=appdb sslmode=require"
-- What's eating storage? Check table sizes.
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 15;
-- Check for dead tuple bloat (MVCC strikes again)
SELECT relname, n_dead_tup, n_live_tup,
round(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 1) AS dead_pct,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
Under the Hood: RDS PostgreSQL uses the same MVCC mechanism as self-hosted PostgreSQL. Updates don't modify rows in place — they create new row versions and mark old ones dead.
VACUUMreclaims dead rows. But here's the RDS twist: autovacuum runs with RDS-managed parameter defaults that are often too conservative for write-heavy workloads. A batch job that updates millions of rows can generate dead tuples faster than autovacuum cleans them.
The Diagnosis¶
You find it: a session_events table with 2 million live rows and 48 million dead tuples.
An analytics batch job has been running nightly UPDATE statements across the entire table,
and autovacuum hasn't kept up.
The Fix (Buy Time First)¶
-- Manual vacuum on the bloated table (doesn't lock reads)
VACUUM VERBOSE session_events;
-- Check WAL (transaction log) size — another storage consumer
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_size;
-- Check for old replication slots holding WAL
SELECT slot_name, active, pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal
FROM pg_replication_slots;
Gotcha: Inactive replication slots are the silent storage killer on RDS PostgreSQL. If you created a logical replication slot for a migration and forgot to drop it, WAL files accumulate indefinitely. AWS won't auto-drop them. A single orphaned slot can consume hundreds of gigabytes.
Enable Storage Auto-Scaling (So This Doesn't Page You Again)¶
aws rds modify-db-instance \
--db-instance-identifier prod-api-db \
--max-allocated-storage 200 \
--apply-immediately
| Parameter | What it does |
|---|---|
--allocated-storage |
Current allocated storage (floor) |
--max-allocated-storage |
Auto-scaling ceiling — RDS scales storage up automatically when free space drops below 10% |
Storage auto-scaling only goes up, never down. Once RDS grows your volume from 50 GB to 120 GB, you're paying for 120 GB forever. Plan accordingly.
Remember: RDS storage scales up but never down. Think of it like a ratchet. Set
--max-allocated-storageto a sensible ceiling so a runaway process can't grow your volume (and your bill) to 16 TB.
Flashcard Check #1¶
| Question | Answer |
|---|---|
| Can you SSH into an RDS instance? | No. RDS is a managed service — you interact through the AWS API, CloudWatch, and SQL connections. No shell access. |
| What are the two most common causes of RDS storage consumption? | Dead tuple bloat (MVCC — fix with VACUUM) and orphaned replication slots retaining WAL files. Transaction logs and automated backup snapshots also consume space. |
| Does RDS storage auto-scaling reduce storage when usage drops? | No. Storage auto-scaling only increases. Once allocated, you pay for it permanently. |
Part 2: How RDS Actually Works — What AWS Manages (And What It Hides)¶
RDS is an EC2 instance running a database engine, wrapped in AWS automation. You don't see the instance, but it's there — a real machine in a real availability zone.
┌──────────────────────────────────────┐
│ Your Application │
│ (connects via endpoint DNS) │
└──────────────┬───────────────────────┘
│ port 5432 (PostgreSQL)
▼
┌──────────────────────────────────────┐
│ RDS Endpoint (DNS) │
│ prod-api-db.abc123.us-east-1 │
│ .rds.amazonaws.com │
└──────────────┬───────────────────────┘
│
┌─────────┴─────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Primary │ sync │ Standby │ ← Multi-AZ
│ AZ-1a │─────▶│ AZ-1b │
│ (active) │ │(passive) │
└──────────┘ └──────────┘
│
│ async
▼
┌──────────┐
│ Read │ ← Read Replica
│ Replica │
│ AZ-1c │
└──────────┘
| AWS manages | You manage |
|---|---|
| OS/engine patching | Query performance (indexes, plans) |
| Automated backups + snapshots | Schema migrations |
| Multi-AZ failover | Connection pooling / pool sizing |
| CloudWatch, Enhanced Monitoring | Parameter tuning |
| EBS storage, encryption | Application-level backups (logical dumps) |
| Hardware replacement | Replication slot cleanup |
Mental Model: Think of RDS like an apartment vs. a house. The landlord handles the roof, plumbing, and electrical. But you're still responsible for not overflowing the bathtub. RDS handles infrastructure; you handle the database inside.
Part 3: Engine Choices — Six Databases, One Service¶
RDS supports six engines. The choice matters more than most people realize.
| Engine | Max Storage | Read Replicas | Best For |
|---|---|---|---|
| PostgreSQL | 64 TB | 15 | Feature-rich SQL, JSONB, extensions (PostGIS, pg_trgm) |
| MySQL | 64 TB | 15 | Web apps, WordPress, legacy LAMP stacks |
| MariaDB | 64 TB | 15 | MySQL-compatible with community governance |
| Oracle | 64 TB | 5 | Enterprise apps with Oracle-specific PL/SQL |
| SQL Server | 16 TB | 5 | .NET shops, SSRS, SSIS integration |
| Aurora | 128 TB | 15 | High-throughput PostgreSQL/MySQL with AWS-optimized storage |
Trivia: RDS launched in October 2009 with MySQL as its only engine. PostgreSQL support didn't arrive until November 2013 — four years later. Today PostgreSQL is the fastest-growing RDS engine and the default choice for new projects.
Aurora vs. Standard RDS — The Decision Matrix¶
Aurora is not "RDS but better." It's a fundamentally different storage architecture.
| Factor | Standard RDS | Aurora |
|---|---|---|
| Storage | Single EBS volume (gp3/io2) | Distributed across 6 copies in 3 AZs |
| Replication lag | Seconds to minutes (async) | Typically <100ms (shared storage) |
| Failover time | ~60 seconds (DNS flip) | ~30 seconds (faster promotion) |
| Storage auto-scale | Manual or auto-scale (up only) | Automatic, 10 GB increments to 128 TB |
| Read replicas | Up to 5 (RDS PG), 15 (Aurora) | Up to 15, share storage with writer |
| Cost | ~$X/hr for compute + storage | ~20% more than equivalent RDS |
| Serverless | No | Yes (Aurora Serverless v2) |
| Global database | No (cross-region read replicas only) | Yes (<1 second cross-region replication) |
| Cloning | Snapshot + restore (slow) | Copy-on-write clone (seconds, near-zero storage) |
Choose standard RDS when: - Budget is tight and workload is modest - You need Oracle or SQL Server - You want a simpler mental model
Choose Aurora when: - You need fast failover and minimal replication lag - Your storage needs are large or unpredictable - You want instant database cloning for dev/test - You're running a multi-region architecture
Under the Hood: Aurora's storage layer is a distributed, log-structured system shared across all instances (writer + readers). When you write to Aurora, data goes to 6 copies across 3 AZs. Aurora can lose an entire AZ and 1 additional copy without losing write availability, or lose 2 AZs without losing read availability. Standard RDS writes to a single EBS volume and synchronously replicates to the standby — a fundamentally different durability model.
Part 4: Multi-AZ Failover — The 60-Second DNS Flip¶
Multi-AZ is the single most important RDS setting for production databases. Here's what actually happens when the primary fails.
The Failover Sequence¶
T+0s Primary instance in AZ-1a becomes unreachable
(crash, AZ failure, hardware issue, or scheduled maintenance)
T+0-5s RDS health check detects failure
T+5-30s RDS promotes the synchronous standby in AZ-1b to primary
- Standby applies any remaining transaction logs
- Standby opens for connections
T+30-60s RDS flips the DNS CNAME record
- prod-api-db.abc123.us-east-1.rds.amazonaws.com
- Now points to the new primary in AZ-1b
T+60s+ Applications reconnect using the same endpoint
- Stale DNS cache = connection failures until TTL expires
Gotcha: The RDS endpoint DNS record has a TTL of 5 seconds, but many applications cache DNS longer than that. Java's default DNS caching is infinite in some JVM versions. Python's
psycopg2caches the resolved IP for the lifetime of the connection. If your app holds long-lived connections, it'll keep talking to the old (dead) primary IP after failover. You need connection retry logic or a connection pooler that handles reconnection.
How to Convert Single-AZ to Multi-AZ¶
aws rds modify-db-instance \
--db-instance-identifier prod-api-db \
--multi-az \
--apply-immediately
What happens behind the scenes (per the flashcard data from the training library): 1. RDS takes a snapshot of the primary 2. Restores it to a new standby instance in another AZ 3. Enables synchronous replication between them
This is a zero-downtime operation. The snapshot is taken from the primary, and sync replication starts after the standby catches up. No reboot required.
Interview Bridge: "How does RDS Multi-AZ failover work?" is a common AWS interview question. The answer: synchronous replication to a standby in another AZ, automatic DNS flip on failure (~60 seconds), zero data loss because the standby is always up-to-date. Contrast this with read replicas, which use asynchronous replication and are NOT automatic failover targets.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| What's the difference between Multi-AZ and a read replica? | Multi-AZ: synchronous replication, automatic failover, standby not readable. Read replica: asynchronous, manual promotion, readable for offloading queries. |
| How long does RDS Multi-AZ failover typically take? | ~60 seconds. The standby is promoted, then the DNS CNAME is flipped to point to the new primary. |
| Is converting from Single-AZ to Multi-AZ a downtime event? | No. RDS takes a snapshot, restores a standby, and enables sync replication — all while the primary keeps serving traffic. |
| How many copies of data does Aurora maintain? | 6 copies across 3 availability zones. It can lose an entire AZ plus one additional copy without losing write availability. |
Part 5: Parameter Groups and the Reboot Trap¶
RDS uses parameter groups to manage database configuration. This is where the
postgresql.conf or my.cnf equivalent lives.
# List parameter groups
aws rds describe-db-parameter-groups --output table
# View current values for a parameter group
aws rds describe-db-parameters \
--db-parameter-group-name prod-postgres16 \
--query "Parameters[?ParameterName=='shared_buffers' || \
ParameterName=='max_connections' || \
ParameterName=='work_mem']" \
--output table
# Modify a parameter
aws rds modify-db-parameter-group \
--db-parameter-group-name prod-postgres16 \
--parameters "ParameterName=work_mem,ParameterValue=65536,ApplyMethod=immediate"
Parameters have two apply types:
| Apply Type | Takes Effect | Requires Reboot | Examples |
|---|---|---|---|
dynamic |
Immediately after modification | No | work_mem, log_min_duration_statement |
static |
After reboot | Yes | shared_buffers, max_connections, wal_buffers |
Gotcha: After modifying a static parameter, RDS shows the instance status as "pending-reboot." It will NOT apply the change until you manually reboot — or until the next maintenance window reboots it for you. Engineers often change
max_connections, see it still reads the old value, and assume the change failed. It didn't fail. It's waiting for a reboot.
Option Groups¶
Option groups are for engine-specific features that need to be enabled at the instance
level. Examples: Oracle Enterprise Manager, SQL Server native backup/restore, MySQL memcached
plugin. PostgreSQL uses extensions instead of option groups — you enable them with
CREATE EXTENSION.
Part 6: Backups — Your Safety Net (If You Test It)¶
RDS provides three backup mechanisms. Use all of them.
1. Automated Snapshots¶
# Check backup retention and window
aws rds describe-db-instances \
--db-instance-identifier prod-api-db \
--query "DBInstances[0].{Retention:BackupRetentionPeriod,Window:PreferredBackupWindow}" \
--output table
- Taken daily during the backup window (configurable, default is an AWS-chosen 30-min slot)
- Retention: 1–35 days (default: 7 days)
- Transaction logs backed up every 5 minutes
- Deleted when the RDS instance is deleted (unless you take a final snapshot)
2. Manual Snapshots¶
# Create a manual snapshot (persists until you delete it)
aws rds create-db-snapshot \
--db-instance-identifier prod-api-db \
--db-snapshot-identifier prod-api-db-before-migration-20260323
- Persist until manually deleted — even after the RDS instance is gone
- Take one before every migration, engine upgrade, or major config change
- Can be shared with other AWS accounts or copied to other regions
3. Point-in-Time Recovery (PITR)¶
# Restore to a specific timestamp (creates a NEW instance)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier prod-api-db \
--target-db-instance-identifier prod-api-db-pitr-recovery \
--restore-time "2026-03-23T02:45:00Z" \
--db-instance-class db.r6g.xlarge \
--multi-az
PITR restores the latest automated snapshot, then replays transaction logs up to your target time. Granularity: 5 minutes (the transaction log backup interval).
War Story: A team at a mid-size SaaS company ran a data migration that accidentally deleted 3 months of customer records. They had RDS automated backups enabled, so they used PITR to restore to 2 minutes before the migration ran. The catch: PITR creates a new instance with a new endpoint. They had to update application configs, DNS records, and security groups to point to the restored instance. The recovery took 40 minutes of database restore time plus 2 hours of application reconfiguration. The lesson: practice the full PITR workflow before you need it — not just the restore command but the entire cutover process.
Part 7: RDS Proxy — Connection Pooling Without PgBouncer¶
If you run Lambda functions or microservices that connect to RDS, you have a connection problem. Each Lambda invocation opens a new database connection. At scale, that's thousands of connections slamming a database that performs best with 20–50 actual connections.
Without RDS Proxy: With RDS Proxy:
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Lambda│ │Lambda│ │Lambda│ │Lambda│ │Lambda│ │Lambda│
│ #1 │ │ #2 │ │ #3 │ │ #1 │ │ #2 │ │ #3 │
└──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
│ │ │ │ │ │
▼ ▼ ▼ └───────┬┘────────┘
┌──────────────────────┐ ▼
│ RDS (overwhelmed) │ ┌─────────────────┐
│ 500+ connections │ │ RDS Proxy │
│ context-switching │ │ pools to ~30 │
│ memory exhaustion │ │ connections │
└──────────────────────┘ └────────┬─────────┘
▼
┌──────────────────┐
│ RDS (happy) │
│ 30 connections │
└──────────────────┘
RDS Proxy vs. PgBouncer¶
| Feature | RDS Proxy | PgBouncer |
|---|---|---|
| Managed by | AWS | You |
| Setup | AWS API/Console | Install, configure pgbouncer.ini |
| Pool mode | Transaction-level (pin-based) | Session, transaction, or statement |
| IAM auth | Yes | No |
| Failover-aware | Yes (auto-routes to new primary) | No (needs DNS update or HAProxy) |
| Cost | Per vCPU of proxy | Free (open source) |
| Latency | ~1ms added | ~0.5ms added |
| Best for | Lambda, serverless, IAM-centric shops | Self-hosted, k8s, maximum control |
Gotcha: RDS Proxy "pins" a connection when you use features that require session-level state — prepared statements, advisory locks, SET commands, temporary tables. A pinned connection can't be reused by other clients, which defeats the purpose of pooling. Check your app's connection behavior before assuming RDS Proxy will solve connection exhaustion.
Part 8: Monitoring — Performance Insights and Enhanced Monitoring¶
CloudWatch gives you the basics (CPU, free storage, connections). Performance Insights tells you why the database is slow.
Essential CloudWatch Alarms¶
| Metric | Alarm Threshold | Why |
|---|---|---|
FreeStorageSpace |
< 10% of allocated | Storage full = writes fail |
CPUUtilization |
> 80% sustained | Performance degradation |
DatabaseConnections |
> 80% of max_connections |
Connection exhaustion |
ReplicaLag |
> 30 seconds | Stale reads, failover risk |
FreeableMemory |
< 10% instance RAM | Buffer pool pressure, OOM risk |
SwapUsage |
> 0 consistently | Instance class too small |
Performance Insights¶
Performance Insights is a visual query profiler built into the RDS console. It shows you what queries are running, how much time they spend waiting, and what they're waiting for.
The key metric is DB Load — the number of active sessions averaged over time. If DB Load exceeds the number of vCPUs, the database is saturated.
DB Load = 12 vCPUs = 4
├── CPU: 4 ← fully saturated
├── I/O wait: 5 ← queries waiting for disk
├── Lock wait: 2 ← queries waiting for row locks
└── Other: 1
Reading this: "12 sessions want to do work, but only 4 can run at a time. 5 are stuck waiting for disk. 2 are stuck waiting for locks." Fix the I/O first (missing indexes causing full table scans?), then investigate the locks.
# Enable Performance Insights
aws rds modify-db-instance \
--db-instance-identifier prod-api-db \
--enable-performance-insights \
--performance-insights-retention-period 731 \
--apply-immediately
Enhanced Monitoring¶
Enhanced Monitoring adds OS-level metrics (process list, memory breakdown, file system usage) at up to 1-second granularity.
The monitoring stack: CloudWatch tells you CPU is high. Enhanced Monitoring tells you which processes are consuming CPU. Performance Insights tells you which queries are causing it.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What is DB Load in Performance Insights? | The average number of active sessions. If DB Load > vCPUs, the database is saturated and sessions are waiting. |
| What's the difference between CloudWatch metrics and Enhanced Monitoring? | CloudWatch: hypervisor-level, 60-second minimum granularity. Enhanced Monitoring: OS-level (process list, memory breakdown), down to 1-second granularity. |
| Why is RDS Proxy especially useful for Lambda functions? | Lambda invocations each open a new database connection. Without pooling, concurrent invocations overwhelm the database. RDS Proxy multiplexes thousands of Lambda connections into a small pool. |
Part 9: Encryption and IAM Database Authentication¶
Encryption at Rest¶
RDS encrypts storage using AWS KMS. Encryption covers EBS storage, automated backups,
snapshots, and read replicas. Enable --storage-encrypted at instance creation.
Gotcha: You cannot encrypt an existing unencrypted RDS instance in place. The procedure: create a snapshot, copy it with encryption enabled, restore a new instance from the encrypted copy, then migrate your application to the new endpoint. This means a DNS change and a maintenance window.
Encryption in Transit¶
# Force SSL connections (PostgreSQL parameter group)
aws rds modify-db-parameter-group \
--db-parameter-group-name prod-postgres16 \
--parameters "ParameterName=rds.force_ssl,ParameterValue=1,ApplyMethod=immediate"
IAM Database Authentication¶
Instead of a static username/password stored in a config file, IAM auth generates short-lived (15-minute) authentication tokens.
# Generate an IAM auth token
TOKEN=$(aws rds generate-db-auth-token \
--hostname prod-api-db.abc123.us-east-1.rds.amazonaws.com \
--port 5432 \
--username iam_app_user \
--region us-east-1)
# Connect with the token as the password
PGPASSWORD=$TOKEN psql \
"host=prod-api-db.abc123.us-east-1.rds.amazonaws.com \
port=5432 user=iam_app_user dbname=appdb sslmode=require"
Why use IAM auth? No long-lived credentials to rotate. The token is valid for 15 minutes and is tied to the IAM identity of the caller. Combine with IAM roles on EC2 or Lambda for credential-free database access.
Part 10: Maintenance Windows — When AWS Reboots Your Database¶
This is where the "managed" in managed database bites.
RDS has a preferred maintenance window — a weekly time slot when AWS can apply patches, minor version upgrades, and OS updates. Some of these require a reboot.
# Check your maintenance window
aws rds describe-db-instances \
--db-instance-identifier prod-api-db \
--query "DBInstances[0].PreferredMaintenanceWindow"
# "sun:06:00-sun:06:30"
War Story: A production database with Multi-AZ enabled had its maintenance window set to the default (an AWS-chosen slot that happened to be Tuesday at 4 AM UTC — peak hours for the company's Australian customers). AWS applied a minor engine patch that required a reboot. Multi-AZ meant the standby rebooted first, then a failover occurred, then the old primary rebooted. Total disruption: two failovers and ~2 minutes of connection drops. The application's connection pool didn't handle the failover gracefully, so users saw errors for 8 minutes. The fix: set the maintenance window to a genuine low-traffic period, enable connection retry logic, and test failover before AWS does it for you.
Pending Maintenance Actions¶
# Check what maintenance is pending
aws rds describe-pending-maintenance-actions \
--output table
# Apply maintenance immediately (on YOUR terms)
aws rds apply-pending-maintenance-action \
--resource-identifier arn:aws:rds:us-east-1:123456789012:db:prod-api-db \
--apply-action system-update \
--opt-in-type immediate
You can defer maintenance, but not forever. AWS will eventually force critical security patches. Better to apply them on your schedule, during your maintenance window, with your team watching — than to have AWS surprise you.
Part 11: Aurora Deep Dive — When Standard RDS Isn't Enough¶
Aurora Serverless v2¶
Scales compute automatically based on load. No capacity planning, no right-sizing.
Set --serverless-v2-scaling-configuration MinCapacity=0.5,MaxCapacity=32 and use
db.serverless as the instance class. MinCapacity=0.5 means half an ACU (Aurora Capacity
Unit, ~2 GB RAM). At idle, you pay almost nothing. Under load, it scales to 32 ACUs
(~64 GB RAM). Scaling takes seconds, not minutes.
Aurora Global Database¶
Cross-region replication with typically less than 1 second of lag. Use cases: disaster recovery (promote a secondary region to primary) and serving reads closer to users in other geographies.
Aurora Cloning¶
Cloning uses copy-on-write (use --restore-type copy-on-write with
restore-db-cluster-to-point-in-time). A 500 GB production database clones in seconds
and consumes near-zero additional storage until data diverges. This is a game-changer
for creating staging environments.
Trivia: Aurora's copy-on-write cloning is similar to how Linux's
fork()system call works — the child process shares memory pages with the parent until one of them writes, triggering a copy of just that page. The same principle, applied at the storage layer.
Part 12: Common Operational Nightmares (And How to Survive Them)¶
Nightmare 1: Storage Full at 3 AM (Our Mission)¶
Covered in Part 1. Checklist: dead tuple bloat, orphaned replication slots, VACUUM, drop unused slots, enable storage auto-scaling.
Nightmare 2: Maintenance Window Surprise¶
Symptoms: application errors during a time nobody was deploying. Check with
aws rds describe-events --source-identifier prod-api-db --source-type db-instance.
Prevention: set your window intentionally, subscribe to event notifications, handle
reconnection gracefully.
Nightmare 3: Parameter Group Requires Reboot¶
Changed max_connections but the old value still shows? Check
aws rds describe-db-instances --query "DBInstances[0].PendingModifiedValues".
Static parameters require a reboot. With Multi-AZ, a reboot triggers a failover (~60 seconds).
Nightmare 4: Read Replica Falling Behind¶
Monitor ReplicaLag in CloudWatch. Common causes: replica on a smaller instance class than
the primary, long-running queries on the replica blocking replay, or heavy write load
exceeding the replica's apply rate. Fix: match the replica instance class to the primary,
kill long queries on the replica, or switch to Aurora (shared storage eliminates replay lag).
Flashcard Check #4¶
| Question | Answer |
|---|---|
| Can you encrypt an existing unencrypted RDS instance in place? | No. You must snapshot, copy the snapshot with encryption, then restore from the encrypted snapshot. This means a new endpoint and a cutover. |
| What happens during an RDS maintenance window with Multi-AZ? | Standby is patched first, then a failover occurs, then the old primary is patched. This means ~two brief connection disruptions, not one. |
| How does Aurora cloning work? | Copy-on-write: the clone shares storage pages with the original. Only pages that change are copied, so cloning is nearly instant and uses minimal extra storage. |
| What does IAM database authentication provide? | 15-minute auth tokens instead of static passwords. Tokens are generated via the AWS API using IAM credentials, eliminating long-lived database passwords. |
Exercises¶
Exercise 1: Decode the Alarm (2 minutes)¶
You receive this CloudWatch alarm:
Metric: AWS/RDS DatabaseConnections
Threshold: > 160
Current: 187
Instance: prod-api-db (db.r6g.large, max_connections=200)
What's happening? What's your immediate action and your long-term fix?
Answer
The database is at 93% of its connection limit. At 200, new connections will be refused. **Immediate:** Kill idle connections: **Long-term:** Deploy RDS Proxy or PgBouncer. Reduce `max_connections` on the database (counterintuitive but correct — let the pooler handle multiplexing). Investigate which service is leaking connections.Exercise 2: PITR Decision (5 minutes)¶
A developer ran DELETE FROM orders WHERE status = 'pending' at 14:22 UTC instead of
DELETE FROM orders WHERE status = 'pending' AND created_at < '2025-01-01'. They deleted
45,000 current orders instead of 1,200 old ones.
Your RDS automated backups have 7-day retention. Describe the recovery steps.
Answer
1. Note the exact time: recovery target is `2026-03-23T14:21:00Z` (one minute before) 2. Restore PITR to a new instance: 3. From the recovered instance, export just the `orders` table 4. Import the missing rows into production (merge carefully — 22 minutes of new data exist) 5. Don't swap the entire database — you'd lose 22 minutes of everything else Key insight: PITR is table-level recovery done at the instance level. You often restore to a separate instance and then surgically extract the data you need.Exercise 3: Aurora vs. RDS Decision (10 minutes)¶
Requirements: PostgreSQL, 50 GB growing to 500 GB, 80/20 read/write ratio, < 30 seconds failover, developers want DB clones for testing, budget-conscious but reliability-first.
Make the case for both. Which do you choose and why?
Answer
**Standard RDS:** Cheaper at small scale, simpler model, Multi-AZ gives ~60 sec failover. **Aurora:** Auto-scaling storage (50 to 500 GB seamless), ~30 sec failover (meeting the requirement), copy-on-write cloning (dev gets full clones in seconds), 15 read replicas. **Decision:** Aurora. The < 30 sec requirement eliminates standard RDS. Cloning addresses the developer need. Storage auto-scaling handles growth. The ~20% premium is justified.Cheat Sheet¶
RDS CLI Quick Reference¶
| Task | Command |
|---|---|
| List instances | aws rds describe-db-instances --query "DBInstances[].DBInstanceIdentifier" |
| Check storage | aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name FreeStorageSpace ... |
| Take snapshot | aws rds create-db-snapshot --db-instance-identifier X --db-snapshot-identifier Y |
| Enable Multi-AZ | aws rds modify-db-instance --db-instance-identifier X --multi-az |
| Restore PITR | aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier X --target-db-instance-identifier Y --restore-time Z |
| Check events | aws rds describe-events --source-identifier X --source-type db-instance |
| Pending maintenance | aws rds describe-pending-maintenance-actions |
| Reboot (failover) | aws rds reboot-db-instance --db-instance-identifier X |
RDS vs. Aurora Decision Quick-Test¶
Need Oracle/SQL Server? → Standard RDS (Aurora doesn't support them)
Need < 30s failover? → Aurora
Storage > 64 TB? → Aurora (128 TB ceiling)
Need instant DB cloning? → Aurora
Budget-constrained, small DB? → Standard RDS
Need serverless auto-scaling? → Aurora Serverless v2
Just need "managed PostgreSQL"? → Start with standard RDS, migrate to Aurora if needed
Takeaways¶
-
RDS trades SSH access for operational convenience — you can't
df -h, but you get automated backups, patching, and failover. Know what you're giving up and what you're getting. -
Storage auto-scaling only goes up. Enable it with a ceiling, but also fix the root cause of storage growth (dead tuples, orphaned replication slots, uncontrolled logging).
-
Multi-AZ is not optional for production. The cost is ~2x compute, but the alternative is hours of downtime during AZ failures or maintenance events.
-
PITR creates a new instance, not a rollback. Practice the full recovery workflow — restore, extract, merge — before you need it at 3 AM.
-
Performance Insights > CloudWatch for database debugging. CloudWatch tells you the database is slow. Performance Insights tells you which query, which wait event, and why.
-
Set your maintenance window deliberately. The default is AWS-chosen, and AWS will reboot your database during it. Make sure it's your actual low-traffic period.
Related Lessons¶
- The Database That Wouldn't Start — PostgreSQL/MySQL startup failures and recovery
- AWS EC2 — The Virtual Server You Never See — The compute layer under RDS
- AWS IAM — The Permissions Puzzle — IAM roles for RDS access and Proxy auth
- AWS VPC — The Network You Can't See — Security groups and subnet placement for RDS
- Redis in Production — ElastiCache for caching in front of RDS
- The Backup Nobody Tested — Why untested backups are not backups