Pattern: Missing Point-in-Time Recovery¶
ID: FP-027 Family: Silent Corruption Frequency: Common Blast Radius: Single Service Detection Difficulty: Subtle
The Shape¶
An application has daily full backups but no continuous WAL/binlog archiving that would allow recovery to an arbitrary point in time. When a logical error occurs (accidental DELETE, bad migration, corruption introduced gradually), the only recovery options are "restore from yesterday's backup (losing a day of data)" or "no recovery." PITR fills this gap: with WAL archiving, recovery is possible to any point in time, limited only by WAL retention period.
How You'll See It¶
In Linux/Infrastructure¶
Bad migration runs at 3pm: DELETE FROM orders WHERE status='pending' without a WHERE
clause selects all orders. Realized at 3:15pm. Without PITR: options are "restore from
midnight backup (lose 15 hours)" or "manually reconstruct from application logs." With
PITR: restore to 2:59pm (1 minute before migration).
In Kubernetes¶
A Kubernetes operator accidentally deletes all ConfigMaps in a namespace (kubectl delete cm --all).
Without PITR (etcd snapshots don't give fine-grained recovery): restore the most recent
etcd snapshot (potentially hours old); lose all ConfigMap changes since the snapshot.
With frequent etcd snapshots + velero: restore to within minutes.
In CI/CD¶
An automated migration script has a bug that slowly corrupts data over 3 days (bad index, silent truncation). Discovered on day 3. Without PITR: no way to know which point in time was the last uncorrupted state. With PITR: restore to "before first bad migration" and replay clean transactions.
The Tell¶
The team needs to recover to a specific point in time (not the last full backup). Only full backups exist; no continuous WAL/binlog archiving is configured. Recovery options are binary: "full restore" or "nothing."
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Needs application-level undo | Missing PITR at database level | Application has no undo; the only option is database-level restore |
| Backup is sufficient | Backup loses hours of data | Time of corruption is hours after last backup; PITR would recover to within minutes |
| PITR exists | WAL archiving not tested | PITR configuration exists but was never validated via restore test |
The Fix (Generic)¶
- Immediate: Document data loss window; restore from the nearest backup before the corruption; use application logs or transaction IDs to reconstruct lost data manually.
- Short-term: Enable WAL archiving (Postgres:
archive_mode=on,archive_command); configure binlog retention (MySQL:binlog_expire_logs_seconds). - Long-term: Implement and test PITR regularly; define RPO targets and validate that WAL archiving meets them; use managed database services with built-in PITR (RDS, Cloud SQL) if self-managing WAL archiving is a burden.
Real-World Examples¶
- Example 1:
UPDATE users SET verified=falserun without WHERE clause at 2pm. Last backup: midnight. Without PITR: 14-hour data loss. With PITR (enabled retroactively in drill): would have been 2 minutes of loss. - Example 2: Multi-tenant SaaS: tenant data corrupted by bad multi-tenancy bug deployed at 10am. Discovered at 4pm. No PITR. Had to restore 6-hour-old backup and manually patch 6 hours of tenant writes from audit logs.
War Story¶
A developer ran
DELETE FROM sessionsthinking they were on the staging database. They were on production. 2.3 million active sessions gone. All users logged out. Without PITR, the only recovery was the midnight backup — logging everyone out for the day and losing 6 hours of session data. We restored from midnight and accepted the loss. The next week we enabled WAL archiving to S3. RPO went from "6 hours" to "30 seconds." Cost: $40/month in S3.
Cross-References¶
- Topic Packs: database-ops, backup-restore
- Footguns: database-ops/footguns.md — "No PITR (point-in-time recovery)"
- Related Patterns: FP-025 (untested backup — related DR gap), FP-026 (replication lag — PITR helps recover from lag-based data loss too)