Interview Gauntlet: Disk Usage on Prod Database¶
Category: Incident Response Difficulty: L2-L3 Duration: 15-20 minutes Domains: Database, Replication
Round 1: The Opening¶
Interviewer: "You get a disk usage alert at 3 AM — the production PostgreSQL database server is at 92% disk utilization. It was at 60% yesterday. What do you do?"
Strong Answer:¶
"First, I'd check the growth rate to understand urgency. If it jumped from 60% to 92% in 24 hours, I have hours, not days. I'd SSH in and run df -h to confirm which filesystem, then du -sh /var/lib/postgresql/*/ to see whether it's the data directory or something else. The most common culprits for sudden disk growth on a PostgreSQL server are: WAL (Write-Ahead Log) files accumulating in pg_wal/, bloated tables from a vacuum not running, temporary files from large queries, or log files if verbose logging is enabled. I'd check du -sh /var/lib/postgresql/14/main/pg_wal/ specifically — WAL accumulation is the most common cause of sudden growth because it means either archiving is failing or replication is lagging. Quick mitigation while investigating: if there are old log files or temp files, I can safely remove those. But I would not touch WAL files without understanding why they're accumulating — deleting WAL files can break replication and point-in-time recovery."
Common Weak Answers:¶
- "I'd increase the disk size." — This buys time but doesn't address the root cause. If WAL files are growing at 30% per day, a bigger disk just delays the same alert.
- "I'd delete old data from the database." — Extremely risky at 3 AM without understanding what data is expendable. Also, in PostgreSQL, deleting rows doesn't immediately free disk space — you need VACUUM.
- "I'd restart PostgreSQL." — Doesn't free disk space and risks making things worse if the database takes a long time to recover.
Round 2: The Probe¶
Interviewer: "You check and it's not the data tables — the pg_wal/ directory has grown from 2 GB to 80 GB. What's happening and how do you fix it?"
What the interviewer is testing: Understanding of PostgreSQL WAL mechanics — specifically why WAL files accumulate and the implications of various interventions.
Strong Answer:¶
"WAL files accumulate when they can't be recycled. PostgreSQL recycles WAL files after they've been both checkpointed and either archived (if archiving is enabled) or consumed by all replication slots. I'd check three things. First, SELECT * FROM pg_stat_archiver; — if last_failed_wal has a recent timestamp, WAL archiving is failing. Maybe the archive destination (S3, a network share) is unreachable or full. Second, SELECT * FROM pg_replication_slots; — if there's a replication slot with a restart_lsn that's far behind pg_current_wal_lsn(), that slot is preventing WAL cleanup. This is the most common cause in my experience. Third, SHOW wal_keep_size; — if this is set very high, PostgreSQL is retaining WAL files even if nothing else requires them. For immediate remediation: if it's a stale replication slot from a decommissioned replica, I'd drop it with SELECT pg_drop_replication_slot('slot_name');. The WAL files will be cleaned up at the next checkpoint. If the slot is for an active replica that's lagging, I need to understand why the replica is behind before dropping it."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "What's the difference between
wal_keep_sizeandmin_wal_size?"wal_keep_size(formerlywal_keep_segments) is the minimum amount of WAL retained for streaming replication regardless of checkpoint status.min_wal_sizeis the minimum total size of thepg_waldirectory — PostgreSQL won't recycle below this floor. Mixing these up suggests textbook knowledge without operational experience. It's fine to say "I always have to look up which parameter does what — let me check the docs."
Round 3: The Constraint¶
Interviewer: "It's a replication slot. The replica is 70 GB behind the primary. You check the replica and it's running, but replication throughput is only 5 MB/s when it needs to be 50 MB/s to catch up before the primary runs out of disk. What's bottlenecking the replica?"
Strong Answer:¶
"Replication throughput is constrained by either network, disk I/O on the replica, or the replay (apply) speed. I'd check each. Network: iperf3 between primary and replica — if the network link is only 50 Mbps, that's our ceiling. But 5 MB/s is 40 Mbps, which is suspiciously close to a throttled link. I'd check for QoS policies, a saturated network interface, or a firewall doing deep packet inspection. Disk I/O on the replica: iostat -x 1 on the replica to check if the disk is at 100% utilization. If the replica is using spinning disks or an EBS volume at its IOPS/throughput limit, the WAL replay speed is capped by write throughput. I'd check aws ec2 describe-volumes if it's EBS — gp2 volumes have a throughput ceiling that depends on volume size. A 100 GB gp2 volume maxes out at ~128 MB/s burst, which should be fine, but the baseline for a small volume is much lower. Replay speed: SELECT * FROM pg_stat_wal_receiver; on the replica shows write_lsn vs flush_lsn vs replay_lsn. If write and flush are keeping up but replay is behind, the bottleneck is applying the WAL — which can happen if the WAL contains heavy index updates that are slow to replay."
The Senior Signal:¶
What separates a senior answer: Distinguishing between the three layers of replication (receive, write, replay) and knowing how to check each independently using
pg_stat_wal_receiver. Also: recognizing that EBS volumes have throughput limits that scale with volume size — this is a common "invisible" bottleneck that only becomes apparent under sustained write loads.
Round 4: The Curveball¶
Interviewer: "You discover the network is saturated — but not by replication. A backup job runs at 2 AM every night and consumes 80% of the replica's network bandwidth. The backup was configured by the DBA who left the company 6 months ago. There's no documentation. Do you kill the backup?"
Strong Answer:¶
"I would not kill the backup without understanding what depends on it. But I need to act fast because the primary is filling up. My approach: first, check if the backup can be throttled rather than killed. Tools like pg_basebackup support --max-rate for bandwidth limiting, and rsync can use --bw-limit. If the backup command is running via a cron job, I can see exactly what it's doing from ps aux | grep -i backup or the crontab. If I can throttle it to 20% of bandwidth and give replication the other 80%, both can proceed. If I can't throttle it and the primary is going to run out of disk in the next 2 hours, then yes, I'd kill the backup — but only after confirming that yesterday's backup completed successfully, so we still have a recent backup. I'd check the backup destination for the most recent successful backup and its timestamp. Then I'd document everything I did and why, file a ticket to reschedule the backup window to avoid conflicting with peak replication demand, and — critically — write a runbook for this scenario so the next on-call engineer doesn't have to rediscover this at 3 AM."
Trap Question Variant:¶
The right answer involves admitting uncertainty. "I don't know what downstream systems depend on this backup — it could be feeding a data warehouse, a compliance archive, or a disaster recovery process. Killing it might fix the immediate problem but cause a different failure tomorrow morning when someone discovers the backup didn't complete." Candidates who immediately say "just kill it" are showing bias toward action over safety. Candidates who refuse to kill it under any circumstances are showing bias toward inaction. The senior answer weighs the risks and takes a bounded action (throttle first, kill only if needed, document everything).
Round 5: The Synthesis¶
Interviewer: "This whole incident chain — WAL growth, replication lag, network contention with backups — was preventable. What monitoring and operational practices would you put in place?"
Strong Answer:¶
"Multiple layers. First, monitoring: alert on WAL directory size (not just total disk), replication lag in both bytes and time, and replication slot age. The replication slot alert is the critical one — a slot that's more than 10 GB behind or 1 hour behind should page. Second, capacity planning: the backup job should have a dedicated network path or at least be scheduled to not conflict with peak replication demand. If the replica is the backup source (which is a good practice), it needs enough network bandwidth for both replication and backup simultaneously — this is a sizing requirement that should be documented. Third, runbooks: every critical cron job needs a documented owner, a description of what it does, where the output goes, and what happens if it fails or is killed. The fact that the DBA left and took all the knowledge with them is an organizational failure. Fourth, guardrails: max_slot_wal_keep_size in PostgreSQL 13+ lets you set a maximum WAL size per slot. If the replica falls too far behind, the slot is invalidated and WAL files are cleaned up, which protects the primary at the cost of requiring a full replica rebuild. Whether to enable this depends on whether you'd rather have the primary run out of disk (total outage) or lose the replica (reduced redundancy). I'd choose to protect the primary."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Structured triage for disk pressure on a database server |
| 2 | PostgreSQL WAL mechanics and replication slot management |
| 3 | Replication performance debugging across network, disk, and replay layers |
| 4 | Judgment under pressure when operational knowledge is missing |
| 5 | Preventive monitoring design and operational maturity |