Decision Tree: Which Database for This Workload?¶

Category: Architecture Decisions Starting Question: "What type of database should we use for this workload?" Estimated traversal: 4-5 minutes Domains: databases, storage, architecture, performance, data-modeling

The Tree¶

What type of database should we use for this workload?
│
├── What is the primary access pattern?
│   │
│   ├── Key-value lookups (get by ID, no joins, sub-ms latency required)
│   │   └── Is this a cache / session store (data is reconstructable if lost)?
│   │       ├── Yes → DECISION: Redis (cache mode, no persistence required)
│   │       └── No  →
│   │           └── Do you need persistence + single-digit ms reads at scale?
│   │               ├── Yes → DECISION: Redis with AOF/RDB persistence
│   │               └── No  → DECISION: DynamoDB / Cassandra (depending on scale)
│   │
│   ├── Relational queries (joins, aggregations, ad-hoc filtering across columns)
│   │   └── Do you need ACID transactions across multiple tables?
│   │       ├── Yes →
│   │       │   └── Will you exceed 10TB of data or 50k writes/sec in 2 years?
│   │       │       ├── Yes → DECISION: CockroachDB / Spanner (distributed SQL)
│   │       │       └── No  → DECISION: PostgreSQL (general-purpose relational)
│   │       └── No  →
│   │           └── Is read throughput the bottleneck (> 100k reads/sec)?
│   │               ├── Yes → DECISION: PostgreSQL + read replicas or Aurora
│   │               └── No  → DECISION: PostgreSQL (default relational choice)
│   │
│   ├── Document / JSON (flexible schema, nested objects, no complex joins)
│   │   └── Do you need multi-document ACID transactions?
│   │       ├── Yes → DECISION: PostgreSQL with JSONB (transactions + flexibility)
│   │       └── No  →
│   │           └── Is the schema likely to change frequently?
│   │               ├── Yes → DECISION: MongoDB (flexible schema, rapid iteration)
│   │               └── No  → DECISION: PostgreSQL JSONB (stability + SQL power)
│   │
│   ├── Time-series (metrics, events, IoT, logs indexed by timestamp)
│   │   └── Is this for infrastructure metrics / monitoring specifically?
│   │       ├── Yes → DECISION: Prometheus + Thanos / VictoriaMetrics (purpose-built)
│   │       └── No  →
│   │           └── Do you need SQL-compatible queries with time-series extensions?
│   │               ├── Yes → DECISION: TimescaleDB (PostgreSQL + time-series)
│   │               └── No  → DECISION: InfluxDB / QuestDB (purpose-built TSDB)
│   │
│   ├── Full-text search / log analytics (keyword search, faceting, aggregations on text)
│   │   └── Is this purely log analytics (structured log ingestion + querying)?
│   │       ├── Yes → DECISION: OpenSearch / Loki (log-optimized)
│   │       └── No  →
│   │           └── Do you need fuzzy search + relevance ranking?
│   │               ├── Yes → DECISION: Elasticsearch / OpenSearch
│   │               └── No  → DECISION: PostgreSQL full-text search (if data is in PG already)
│   │
│   └── Graph traversals (relationship queries: friends-of-friends, dependency graphs)
│       └── Are relationships the primary query pattern (not a secondary filter)?
│           ├── Yes → DECISION: Neo4j / Amazon Neptune (native graph)
│           └── No  →
│               └── Do you have < 10M nodes and moderate graph queries?
│                   ├── Yes → DECISION: PostgreSQL with recursive CTEs (avoid extra DB)
│                   └── No  → DECISION: Neo4j (scale justifies dedicated graph DB)

Node Details¶

Check 1: Primary Access Pattern¶

How to assess: Write out the 5 most frequent queries your application will execute. Classify each as: key-value (get/set by single key), relational (join 2+ tables or aggregate), document (fetch nested JSON object by ID), time-series (filter by time range + metric name), search (keyword or fuzzy text match), or graph (traverse relationships N hops). What you're looking for: The access pattern that represents 60%+ of query volume, or the pattern your SLO is most sensitive to. One workload may need multiple databases — normalize by the primary query path first. Common pitfall: Choosing a database based on the data format rather than the query pattern. Your data might be JSON, but if you always fetch it by ID with no cross-document queries, Redis or DynamoDB may outperform MongoDB significantly.

Check 2: ACID Transaction Requirements¶

How to assess: Identify operations in your application that must be atomic. Can a partially completed operation leave data in an inconsistent state that cannot be detected or repaired? Example: deducting inventory and creating an order must both succeed or both fail. What you're looking for: Operations where partial failure creates irrecoverable data inconsistency. If you can use idempotent retries or compensating transactions to recover from partial failures, you may not need ACID. Common pitfall: Assuming ACID is always necessary. Most read-heavy workloads — product catalogs, user profiles, content feeds — do not require ACID transactions and can tolerate eventual consistency. Over-indexing on ACID leads to unnecessary coupling and scaling constraints.

Check 3: Scale Thresholds¶

How to assess: Estimate: (a) total data volume in 24 months, (b) peak writes per second, (c) peak reads per second. Use order-of-magnitude estimates. 10TB is the rough threshold where single-node PostgreSQL starts to require significant tuning; 50k writes/sec is where replication lag on a single primary becomes a scaling concern. What you're looking for: Scale that exceeds what a single well-tuned PostgreSQL primary with read replicas can handle. Most applications never reach this scale. Build with the simpler option and migrate when you have actual scale data. Common pitfall: Premature scale optimization. Most startups and mid-size systems never exceed 1TB of structured data or 10k writes/sec. Choosing distributed SQL (CockroachDB, Spanner) for a workload that will live comfortably on a single Postgres instance adds operational complexity with no benefit.

Check 4: Schema Stability¶

How to assess: How often do you expect to add, rename, or remove fields in the data model during the next 6 months? Are these changes driven by product iteration (frequent) or infrastructure requirements (infrequent)? What you're looking for: Frequent schema changes during rapid product iteration. Document databases shine when the schema is in flux. As the data model stabilizes, the relational model's constraints and query capabilities typically provide more value. Common pitfall: Treating "flexible schema" as a permanent benefit. A document database's lack of schema enforcement is valuable during exploration but becomes a liability at scale — you end up with inconsistently shaped documents, application-layer validation burden, and loss of database-level constraint guarantees.

Check 5: Write vs Read Ratio¶

How to assess: Estimate the ratio of write operations to read operations for this workload. Examples: user session data = 1:1 (write on login, read on every request). Audit log = 100:1 (many writes, rare reads). Product catalog = 1:1000 (rare writes, constant reads). What you're looking for: Extreme ratios that favor specialized databases. Write-heavy workloads (> 10:1) benefit from databases optimized for write throughput (Cassandra, DynamoDB). Read-heavy workloads (> 100:1) benefit from caching layers (Redis) or read replicas (PostgreSQL Aurora). Common pitfall: Designing for balanced read/write when the actual ratio is extreme. A 10,000:1 read/write ratio (common for configuration or catalog data) means the write path is not a bottleneck — invest in the read path instead.

Check 6: Horizontal Sharding Requirements¶

How to assess: Based on your scale estimate, determine whether you need to partition data across multiple nodes for write throughput or storage. A single PostgreSQL node with NVMe storage handles ~10-50k writes/sec and multiple TBs without sharding. What you're looking for: Genuine need to distribute writes across multiple nodes, not just to distribute reads (which can be satisfied with read replicas on a single-primary setup). Common pitfall: Adding sharding complexity before exhausting vertical scaling options. A well-tuned PostgreSQL instance on cloud infrastructure with NVMe storage handles workloads that surprise many engineers. Measure before sharding.

Check 7: Team Expertise¶

How to assess: Survey the team: how many engineers have run this database type in production? Have responded to an incident on it? Can write a performant query for the chosen database? Can perform an emergency failover? What you're looking for: At least one engineer with production experience in the chosen database type. Operational complexity of new database technologies is routinely underestimated. Common pitfall: Choosing a database because it is interesting technically, not because the team can operate it under production pressure. A database you know well in production is almost always better than an unfamiliar one that is theoretically optimal.

Terminal Actions¶

Decision: PostgreSQL¶

Choose: PostgreSQL, preferably managed (RDS, Cloud SQL, Supabase, Aurora PostgreSQL-compatible). Why: PostgreSQL is the default relational database for most workloads. It handles ACID transactions, complex joins, JSONB documents, full-text search, and time-series data with extensions — making it the best single database for a team that cannot yet characterize its query patterns precisely. Its extension ecosystem (pgvector, PostGIS, TimescaleDB) allows it to grow into specialized use cases. Next step: Start with managed PostgreSQL. Right-size connection pooling (pgBouncer or RDS Proxy). Plan indexes up front for your primary access patterns. Do not add a second database until PostgreSQL demonstrably cannot satisfy a specific requirement.

Decision: Redis¶

Choose: Redis, managed (ElastiCache, Upstash, Redis Cloud) or self-hosted. Why: Redis is the default cache and session store. Sub-millisecond latency, rich data structures (sorted sets, pub/sub, streams), and simple operational model make it the right choice for caching, session storage, rate limiting, and leaderboards. Next step: Decide on persistence requirements (cache = no persistence, session store = AOF persistence). Configure maxmemory + eviction policy for cache use cases. For session stores, configure at least 2 replicas for HA. Never use Redis as your primary datastore for business-critical data that cannot be reconstructed.

Decision: MongoDB¶

Choose: MongoDB, managed (MongoDB Atlas) preferred. Why: MongoDB is justified for document-centric workloads where schemas are in flux during rapid product iteration, documents are accessed holistically (fetch whole document, rarely join), and the data is naturally nested (product catalogs, CMS content, user-generated content). Next step: Define a document schema even though MongoDB does not enforce it. Use MongoDB Schema Validation to enforce minimum structure at the database level. Plan for the lack of cross-document transactions from day one — design your data model to make multi-document operations unnecessary.

Decision: Cassandra / DynamoDB¶

Choose: Apache Cassandra (self-hosted or Astra DB managed) or DynamoDB (AWS managed). Why: Wide-column stores are the right choice for extremely write-heavy workloads (IoT sensor data, activity feeds, event logs) at horizontal scale. Both provide tunable consistency and linear write throughput scalability. Next step: Design the data model around your query patterns before writing a line of code. Cassandra and DynamoDB do not allow ad-hoc queries — every query must be anticipated in the schema design. Getting this wrong requires a full data migration to fix.

Decision: TimescaleDB or InfluxDB¶

Choose: TimescaleDB for SQL-compatible time-series with complex analytics; InfluxDB or QuestDB for high-ingest, simple time-series queries. Why: Time-series databases optimize for insert throughput and time-range scans at the cost of ad-hoc relational query flexibility. Timestamp-based partitioning and automatic data compression make them dramatically more efficient than general-purpose databases for metric storage. Next step: Define your retention policy and downsampling strategy before ingesting data. Decide on cardinality limits (number of unique tag combinations) — high cardinality is a major performance bottleneck in most TSDBs.

Decision: Elasticsearch / OpenSearch¶

Choose: Elasticsearch or OpenSearch, managed (AWS OpenSearch Service, Elastic Cloud) preferred. Why: Elasticsearch is the right choice when full-text search, faceted navigation, or log analytics at scale is the primary query pattern. Its inverted index structure is orders of magnitude faster than LIKE queries in relational databases for text search. Next step: Never use Elasticsearch as your primary datastore. Use it as a search index alongside a primary relational database. Design a data pipeline that writes to both and handles sync failures gracefully.

Decision: Neo4j¶

Choose: Neo4j or Amazon Neptune, managed preferred. Why: Graph databases are the right choice when the relationships between entities are as important as the entities themselves, and when you need multi-hop traversals (friends of friends, dependency chains, recommendation paths) that would require self-joins or recursive CTEs in a relational database. Next step: Validate that graph queries are genuinely the primary access pattern. If you need graph traversals for < 10% of queries, implement them as recursive CTEs in PostgreSQL and avoid the operational overhead of a dedicated graph database.

Warning: Polyglot Persistence Overhead¶

When: You are considering adding a third or fourth distinct database technology to your stack. Risk: Each database type requires separate expertise, separate monitoring, separate backup procedures, and separate incident runbooks. Operational complexity multiplies. Teams routinely underestimate the cost of maintaining 4+ distinct database technologies. Mitigation: Default to PostgreSQL for new workloads even when a specialized database would be marginally better. Add a specialized database only when PostgreSQL demonstrably cannot meet a specific requirement (e.g., you need sub-millisecond latency → add Redis; you need massive write throughput → add Cassandra).

Warning: Not Validated at Scale¶

When: You have chosen a database based on theoretical performance characteristics without load testing at your expected peak. Risk: Database performance is highly workload-specific. A database that performs excellently in benchmarks for a generic workload may perform poorly for your specific query patterns, cardinality, and data distribution. Mitigation: Load test with realistic data shapes and query distributions before committing to a database for a new workload. Use production-representative data volumes if possible — many database behaviors only emerge at scale.

Edge Cases¶

Multi-model requirements in a single service: If a service genuinely needs both relational and document data, PostgreSQL JSONB columns allow both in one database. Avoid the operational overhead of two separate databases when one well-configured PostgreSQL instance can satisfy both access patterns.
Vector/embedding storage for ML/AI workloads: PostgreSQL with pgvector handles vector similarity search at moderate scale (< 10M vectors). For large-scale vector search (> 100M vectors), purpose-built options (Pinecone, Weaviate, Qdrant) provide better performance. This is an emerging category where the tooling is evolving rapidly.
Existing team expertise overrides optimal choice: If your team has deep PostgreSQL expertise and marginal MongoDB advantage for a workload, PostgreSQL is the correct choice. Operations under pressure require familiarity. A "suboptimal" database you know well beats an "optimal" database you don't.
GDPR / data deletion requirements: If you need to delete user data reliably and completely (GDPR right to erasure), some database choices are better than others. Cassandra tombstone deletion is eventually consistent and can leave data visible for hours. PostgreSQL and most ACID databases provide immediate, verifiable deletion.
Read-your-writes consistency: If your application requires that a user who just wrote data immediately reads their own write, be careful with asynchronous replication setups. PostgreSQL async replicas, DynamoDB eventual consistency mode, and MongoDB secondary reads all violate read-your-writes consistency. Design for this or use synchronous replication where required.

Cross-References¶

Topic Packs: Databases, Data Modeling, Storage
Related trees: Managed vs Self-Hosted, Sync vs Async Communication, Monolith vs Microservices