Skip to content

Pattern: Missing Point-in-Time Recovery

ID: FP-027 Family: Silent Corruption Frequency: Common Blast Radius: Single Service Detection Difficulty: Subtle

The Shape

An application has daily full backups but no continuous WAL/binlog archiving that would allow recovery to an arbitrary point in time. When a logical error occurs (accidental DELETE, bad migration, corruption introduced gradually), the only recovery options are "restore from yesterday's backup (losing a day of data)" or "no recovery." PITR fills this gap: with WAL archiving, recovery is possible to any point in time, limited only by WAL retention period.

How You'll See It

In Linux/Infrastructure

Bad migration runs at 3pm: DELETE FROM orders WHERE status='pending' without a WHERE clause selects all orders. Realized at 3:15pm. Without PITR: options are "restore from midnight backup (lose 15 hours)" or "manually reconstruct from application logs." With PITR: restore to 2:59pm (1 minute before migration).

In Kubernetes

A Kubernetes operator accidentally deletes all ConfigMaps in a namespace (kubectl delete cm --all). Without PITR (etcd snapshots don't give fine-grained recovery): restore the most recent etcd snapshot (potentially hours old); lose all ConfigMap changes since the snapshot. With frequent etcd snapshots + velero: restore to within minutes.

In CI/CD

An automated migration script has a bug that slowly corrupts data over 3 days (bad index, silent truncation). Discovered on day 3. Without PITR: no way to know which point in time was the last uncorrupted state. With PITR: restore to "before first bad migration" and replay clean transactions.

The Tell

The team needs to recover to a specific point in time (not the last full backup). Only full backups exist; no continuous WAL/binlog archiving is configured. Recovery options are binary: "full restore" or "nothing."

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Needs application-level undo Missing PITR at database level Application has no undo; the only option is database-level restore
Backup is sufficient Backup loses hours of data Time of corruption is hours after last backup; PITR would recover to within minutes
PITR exists WAL archiving not tested PITR configuration exists but was never validated via restore test

The Fix (Generic)

  1. Immediate: Document data loss window; restore from the nearest backup before the corruption; use application logs or transaction IDs to reconstruct lost data manually.
  2. Short-term: Enable WAL archiving (Postgres: archive_mode=on, archive_command); configure binlog retention (MySQL: binlog_expire_logs_seconds).
  3. Long-term: Implement and test PITR regularly; define RPO targets and validate that WAL archiving meets them; use managed database services with built-in PITR (RDS, Cloud SQL) if self-managing WAL archiving is a burden.

Real-World Examples

  • Example 1: UPDATE users SET verified=false run without WHERE clause at 2pm. Last backup: midnight. Without PITR: 14-hour data loss. With PITR (enabled retroactively in drill): would have been 2 minutes of loss.
  • Example 2: Multi-tenant SaaS: tenant data corrupted by bad multi-tenancy bug deployed at 10am. Discovered at 4pm. No PITR. Had to restore 6-hour-old backup and manually patch 6 hours of tenant writes from audit logs.

War Story

A developer ran DELETE FROM sessions thinking they were on the staging database. They were on production. 2.3 million active sessions gone. All users logged out. Without PITR, the only recovery was the midnight backup — logging everyone out for the day and losing 6 hours of session data. We restored from midnight and accepted the loss. The next week we enabled WAL archiving to S3. RPO went from "6 hours" to "30 seconds." Cost: $40/month in S3.

Cross-References

  • Topic Packs: database-ops, backup-restore
  • Footguns: database-ops/footguns.md — "No PITR (point-in-time recovery)"
  • Related Patterns: FP-025 (untested backup — related DR gap), FP-026 (replication lag — PITR helps recover from lag-based data loss too)