Skip to content

MongoDB: The Document Database That Surprised Everyone

  • lesson
  • bson-and-the-document-model
  • replica-sets-(elections
  • oplog
  • read-preference)
  • sharding-(shard-key-selection
  • chunks
  • balancer)
  • wiredtiger-storage-engine
  • indexes-(compound
  • covered-queries
  • explain-plans)
  • write-concern-and-read-concern
  • connection-pooling
  • backup/restore
  • aggregation-pipeline ---# MongoDB: The Document Database That Surprised Everyone

Topics: BSON and the document model, replica sets (elections, oplog, read preference), sharding (shard key selection, chunks, balancer), WiredTiger storage engine, indexes (compound, covered queries, explain plans), write concern and read concern, connection pooling, backup/restore, aggregation pipeline, MongoDB's evolution from schemaless hype to ACID transactions Level: L1-L2 (Foundations to Operations) Time: 60-90 minutes Prerequisites: None (everything is explained from scratch)


The Mission

It is 2:17 AM. Your phone lights up with a PagerDuty alert:

CRITICAL: MongoDB replica set rs-prod-0 — secondary mongo3 replication lag exceeding 4 hours. Oplog window is 6 hours.

You do the math in your head. The secondary is 4 hours behind. The oplog only holds 6 hours of history. If the lag keeps growing at this rate, in about 2 hours the oplog will wrap around — the oldest entries the secondary still needs will be overwritten. At that point, mongo3 cannot catch up incrementally. It will need a full initial sync — a complete copy of the entire dataset. On your 800 GB cluster, that takes 12-14 hours and saturates the network link between data centers.

You have a 2-hour window to fix this. The clock is ticking.

But before you can fix it, you need to understand what you are looking at. So let's start from the beginning — and work our way to the 2 AM save.


What Even Is MongoDB?

The 60-Second History

In 2007, Dwight Merriman and Eliot Horowitz — both from the online advertising company DoubleClick (later acquired by Google) — founded a company called 10gen. At DoubleClick, they had struggled to scale relational databases for the firehose of ad impression data. They wanted a database designed for developer speed and horizontal scale from the start.

Name Origin: "MongoDB" comes from "humongous." The name reflects the original design goal: handling massive datasets. The company renamed from 10gen to MongoDB Inc. in 2013 because the database had become more famous than the company.

Trivia: A 2010 satirical YouTube video titled "MongoDB is Web Scale" mocked the NoSQL hype, with characters dismissing relational databases because MongoDB "turns off journaling for speed." The video became so well-known that MongoDB engineers cited it in conference talks as motivation for improving durability guarantees.

The evolution tells the real story:

Year Version What Changed Why It Mattered
2009 1.0 First release Document model, auto-sharding
2012 2.2 Aggregation framework Finally could do GROUP BY without MapReduce
2012 Default write concern changed from "fire and forget" to w:1 Apps had been silently losing writes for years
2014 WiredTiger acquisition Replaced the disastrous MMAPv1 engine
2015 3.2 WiredTiger becomes default, document validation From "schemaless" to "schema when you want it"
2018 4.0 Multi-document ACID transactions The feature everyone said MongoDB would never have
2020 4.4 Hedged reads, refinable shard keys Mature operational tooling
2023 7.0 Queryable encryption, column store indexes Enterprise and analytics features

Mental Model: MongoDB's evolution mirrors the five stages of grief — except the grief was the engineering community's, and the stages were: denial ("we don't need schemas"), anger ("we lost data!"), bargaining ("maybe just validate some fields"), depression ("we need transactions"), acceptance ("it's actually a pretty good database now").


BSON and the Document Model

MongoDB stores data as documents — JSON-like objects with a twist. The actual storage format is BSON (Binary JSON), which adds type information and length prefixes.

// A document in MongoDB
{
  "_id": ObjectId("6579a4b8c3e21a0012345678"),
  "username": "alice",
  "email": "alice@example.com",
  "created_at": ISODate("2024-01-15T10:00:00Z"),
  "profile": {
    "city": "Berlin",
    "department": "platform-eng"
  },
  "tags": ["admin", "beta"],
  "login_count": NumberLong(4217)
}

Notice what is different from a relational row:

  • Nested objects (profile.city) — no JOIN needed
  • Arrays (tags) — a first-class citizen, not a separate table
  • No fixed schema — the next document could have completely different fields
  • _id is auto-generated if you do not provide one: a 12-byte ObjectId encoding a timestamp, machine identifier, process ID, and counter

Under the Hood: BSON is actually larger than equivalent JSON for small documents. {"a": 1} is 7 bytes in JSON but 16 bytes in BSON — the type markers and length prefixes add overhead. BSON pays for itself on larger documents with binary data, dates, and precise number types that JSON cannot natively represent.

Documents live in collections (analogous to tables). Collections live in databases. Since MongoDB 3.2, you can add JSON Schema validation — and you should:

db.createCollection("users", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["username", "email"],
      properties: {
        username: { bsonType: "string" },
        email: { bsonType: "string", pattern: "^.+@.+$" }
      }
    }
  }
})

Gotcha: "Schemaless" was always marketing. Your application code enforces a schema whether you like it or not — every user.email access assumes the field exists. After enough incidents caused by rogue documents missing expected fields, most mature deployments add schema validation. The schema just moved from the database to the application and back.


Replica Sets: How MongoDB Stays Alive

A replica set is a group of mongod processes (typically 3 or 5 — always odd) that maintain the same dataset. One member is the primary (accepts reads and writes). The rest are secondaries (replicate from the primary, can serve reads).

┌──────────┐        oplog tailing        ┌──────────────┐
│ Primary  │ ──────────────────────────── │ Secondary 1  │
│ (mongo1) │                              │ (mongo2)     │
└──────────┘ ──────────────────────────── └──────────────┘
              ─────────────────────────── ┌──────────────┐
                                          │ Secondary 2  │
                                          │ (mongo3)     │
                                          └──────────────┘

Elections: Who Becomes Primary?

When the primary goes down, secondaries detect its absence via heartbeats (every 2 seconds) and hold an election using a Raft-like consensus protocol. A majority vote is required — 2 of 3, or 3 of 5.

// Check the current state of every member
rs.status()
// Key fields: members[n].stateStr (PRIMARY/SECONDARY/RECOVERING),
//             members[n].health (1=up, 0=down),
//             members[n].optimeDate (last applied oplog entry)

Elections typically complete in 10-30 seconds. During that window, no writes are accepted.

Interview Bridge: "How does MongoDB maintain high availability?" Replica sets, majority voting, automatic failover, the 10-30 second election window, and w: "majority" write concern to ensure writes survive failover.

The Oplog: Replication's Backbone

The oplog (operations log) is a capped collection in the local database on each member. Every write to the primary is recorded in the oplog. Secondaries tail this collection to replicate changes — exactly like tailing a log file.

// Check oplog window on the primary
rs.printReplicationInfo()
// Output: "log length start to end: 21600secs (6hrs)"

// Check how far behind each secondary is
rs.printSlaveReplicationInfo()
// Output: "syncedTo: ... 14400 secs (4 hrs) behind the primary"

The oplog window is how far back in time the oplog reaches. If a secondary falls behind by more than the oplog window, it can no longer catch up incrementally — the entries it needs have been overwritten. It must do a full initial sync (copy the entire dataset).

Remember: Oplog window = your safety margin for secondary downtime. Rule of thumb: size the oplog for 2x your longest planned maintenance window. An 8-hour maintenance window means at least a 16-hour oplog window.

Read Preference: Where Do Reads Go?

// In a connection string
"mongodb://mongo1,mongo2,mongo3/mydb?replicaSet=rs0&readPreference=secondaryPreferred"
Read Preference Behavior Use Case
primary All reads from primary (default) Strong consistency
primaryPreferred Primary if available, else secondary Resilient strong consistency
secondary Always from secondaries Analytics, reporting (may be stale)
secondaryPreferred Secondaries preferred Offload reads, accept staleness
nearest Lowest network latency Geo-distributed clusters

Gotcha: Reading from secondaries means reading potentially stale data. If a user writes their profile and immediately reads it back from a secondary, they might see the old value. This is the classic read-after-write consistency problem. For user-facing reads after writes, use primary read preference.


Flashcard Check #1

Cover the answers. Test yourself.

Question Answer
What does BSON stand for? Binary JSON — includes type information and length prefixes
How many votes are needed to elect a primary in a 3-member replica set? 2 (a majority)
What happens when a secondary falls behind the oplog window? It must do a full initial sync — copy the entire dataset
What read preference gives you strong consistency? primary (the default)
What is the default heartbeat interval between replica set members? 2 seconds

Write Concern: The Durability Dial

Write concern controls how many replica set members must acknowledge a write before MongoDB tells your application "success." This is one of the most critical — and most misunderstood — settings in MongoDB.

// w:0 — fire and forget. Returns immediately. No idea if write succeeded.
db.payments.insertOne({ amount: 500, user: "alice" }, { writeConcern: { w: 0 } })

// w:1 — primary acknowledges (default). Lost if primary crashes before replication.
db.payments.insertOne({ amount: 500, user: "alice" }, { writeConcern: { w: 1 } })

// w:"majority" — majority acknowledges. Survives any single-node failure.
db.payments.insertOne(
  { amount: 500, user: "alice" },
  { writeConcern: { w: "majority", wtimeout: 5000 } }
)
Write Concern Durability Latency When to Use
w: 0 None — you do not even know if the server received it Fastest Metrics, logs, telemetry you can afford to drop
w: 1 Survives primary disk failure (journaled) but not primary crash before replication Low Default; acceptable for non-critical data
w: "majority" Survives any single-node failure Higher (~2x) Payments, user records, anything important

War Story: Before 2012, MongoDB's driver default was effectively w: 0 — fire and forget. Applications silently lost writes. The 2012 change to w: 1 was one of the most important changes in MongoDB's history. Many early "MongoDB lost my data" stories trace back to the old default.

Read Concern: The Freshness Dial

Read Concern What You See Trade-off
"local" (default) Latest data on the member you query May read data that gets rolled back
"majority" Only data committed to a majority Safe from rollback, slightly stale
"linearizable" Most recent majority-committed data Strongest guarantee, highest latency

The safe default for most applications: w: "majority" + readConcern: "majority" = no data loss and no reading data that later disappears.


Sharding: When One Replica Set Is Not Enough

When a single replica set cannot handle your data volume or write throughput, you shard — distribute data across multiple replica sets.

               ┌──────────────────────┐
               │   mongos (router)    │  ← your application connects here
               └──────────┬───────────┘
               ┌──────────▼───────────┐
               │   Config Servers     │  ← stores which chunks live on which shard
               │   (3-member RS)      │
               └───────┬──────┬───────┘
                       │      │
          ┌────────────▼┐  ┌──▼───────────┐
          │   Shard 1   │  │   Shard 2    │
          │ (rs-shard1) │  │ (rs-shard2)  │  ← each shard is a replica set
          └─────────────┘  └──────────────┘

The application connects to mongos (the router), which consults the config servers to route queries. This is transparent to the application.

The Shard Key: The Most Important Decision You Will Make

The shard key determines which shard a document lives on. Choose wrong, and you are in for months of pain.

// Enable sharding on a database
sh.enableSharding("mydb")

// Shard with a ranged key
sh.shardCollection("mydb.orders", { customer_id: 1 })

// Shard with a hashed key
sh.shardCollection("mydb.events", { event_id: "hashed" })

Ranged = adjacent key values on the same shard. Great for range queries, terrible for monotonically increasing keys. Hashed = even distribution via hash. Great for write throughput, terrible for range queries (scatter-gather across all shards).

War Story: A team sharded their events collection on { created_at: 1 }. Every new event had a greater timestamp than the last — so 100% of writes landed on one shard. That shard's disk I/O was at 95% while the other three sat at 8%. The fix: resharding the entire 2 TB collection on a compound key, a 36-hour migration running old and new collections in parallel.

Shard Key Choice Inserts Range Queries Danger
{ _id: 1 } (ObjectId) All to last shard Good Hotspot
{ _id: "hashed" } Even Scatter-gather No range queries
{ customer_id: 1, created_at: 1 } Good Great Best general-purpose
{ country: 1 } 30 chunks max Good within country Low cardinality, jumbo chunks

Gotcha: Before MongoDB 5.0, the shard key was immutable after creation. Fixing a bad key meant dump, drop, reshard, restore. MongoDB 5.0's reshardCollection copies the entire dataset. Get the shard key right the first time.

The Balancer

The balancer moves chunks between shards to keep distribution even. Each migration copies data over the network — heavy balancing can saturate inter-shard bandwidth.

sh.isBalancerRunning()              // Is it active?
db.orders.getShardDistribution()    // How even is the spread?
sh.stopBalancer()                   // Pause during bulk loads
sh.startBalancer()                  // Resume

WiredTiger: The Engine Under the Hood

Since MongoDB 3.2, WiredTiger is the default storage engine. It replaced the infamous MMAPv1, which used memory-mapped files and had database-level locking — meaning a write to any collection in a database blocked writes to every other collection.

Trivia: WiredTiger was an independent company created by Keith Bostic (one of the original BSD Unix developers) and Michael Cahill. MongoDB acquired WiredTiger in 2014. The engine brought document-level concurrency control, compression, and a proper cache — transforming MongoDB from a toy into a credible production database.

What it brought: document-level concurrency (two writes to different documents happen in parallel — the single biggest improvement over MMAPv1), compression (snappy by default, zlib/zstd available), and a proper internal cache separate from the OS page cache.

# mongod.conf — WiredTiger tuning
storage:
  engine: wiredTiger
  wiredTiger:
    engineConfig:
      cacheSizeGB: 8        # Default: 50% of RAM minus 1 GB (min 256 MB)
    collectionConfig:
      blockCompressor: snappy  # Options: snappy, zlib, zstd, none
    indexConfig:
      prefixCompression: true
// Check cache pressure at runtime
var cache = db.serverStatus().wiredTiger.cache
print("Cache used: " + Math.round(cache["bytes currently in the cache"] / 1024/1024) + " MB")
print("Cache max:  " + Math.round(cache["maximum bytes configured"] / 1024/1024) + " MB")
print("Dirty bytes: " + Math.round(cache["tracked dirty bytes in the cache"] / 1024/1024) + " MB")
// If dirty bytes are consistently >20% of max, you have write pressure

Under the Hood: When you db.collection.drop() a large collection, WiredTiger reclaims the space internally — but the OS does not see the freed disk space. The data directory stays the same size on disk. To actually reclaim OS-level disk space, you need db.runCommand({ compact: "collection_name" }) (which blocks writes) or a mongodump/mongorestore cycle.


Indexes: Making Queries Fast

Without an index, every query does a COLLSCAN — reads every document. On a hundred million documents, that is not slow; it is an outage.

// Single field — the basics
db.orders.createIndex({ customer_id: 1 })       // ascending
db.orders.createIndex({ created_at: -1 })        // descending

// Compound index — order matters (leftmost prefix rule)
db.orders.createIndex({ customer_id: 1, status: 1, created_at: -1 })
// Covers queries on: (customer_id), (customer_id + status),
//                    (customer_id + status + created_at)
// Does NOT cover: (status alone), (created_at alone)

// TTL index — auto-delete documents after expiry
db.sessions.createIndex({ created_at: 1 }, { expireAfterSeconds: 3600 })

// Partial index — only index a subset of documents
db.orders.createIndex(
  { customer_id: 1, created_at: -1 },
  { partialFilterExpression: { status: "active" } }
)
// Smaller, faster, less memory — only indexes active orders

// Text index — full-text search
db.articles.createIndex({ title: "text", body: "text" })

// Find unused indexes (wasting RAM and slowing writes)
db.orders.aggregate([{ $indexStats: {} }]).toArray()
  .filter(i => i.accesses.ops == 0)

Remember: Compound indexes follow the leftmost prefix rule — like a phone book sorted by (last name, first name, city). You can look up everyone named "Smith," or everyone named "Alice Smith," but you cannot efficiently look up everyone in "Berlin" without scanning the whole book. Put your most-filtered field first.

Reading explain() Output

explain() is your X-ray machine. It shows exactly how MongoDB plans to execute a query.

// Run an explained query
db.orders.explain("executionStats").find({
  customer_id: "cust_abc123",
  status: "pending"
})

The output that matters:

{
  "executionSuccess": true,
  "nReturned": 5,
  "executionTimeMillis": 3,
  "totalKeysExamined": 5,
  "totalDocsExamined": 5,
  "executionStages": {
    "stage": "FETCH",
    "inputStage": {
      "stage": "IXSCAN",
      "keyPattern": { "customer_id": 1, "status": 1 },
      "indexName": "customer_id_1_status_1"
    }
  }
}
Field Good Value Bad Value
stage IXSCAN (index scan) COLLSCAN (full collection scan)
totalKeysExamined vs nReturned Equal or close 10,000 keys examined, 5 returned
executionTimeMillis < 100 for OLTP queries > 1000

Remember: The mnemonic for a healthy explain plan: "Keys equals Returns, Stage says Scan." If totalKeysExamined is much larger than nReturned, your index is scanning more entries than needed — add or refine a compound index.

Covered query — when the index contains all fields the query needs, MongoDB answers entirely from the index without fetching documents. Look for "totalDocsExamined": 0 in the explain output. Use projections that exclude _id and only request indexed fields.


Flashcard Check #2

Question Answer
What is the default write concern in modern MongoDB? w: 1 (acknowledged by primary only)
What write concern should you use for payment records? w: "majority" — survives any single-node failure
Why is sharding on { created_at: 1 } dangerous? Monotonically increasing — all writes go to one shard (hotspot)
What does COLLSCAN mean in an explain plan? Full collection scan — no index is being used
What is a covered query? A query answered entirely from the index, with zero document fetches
What replaced MMAPv1 as MongoDB's storage engine? WiredTiger (default since MongoDB 3.2)

The Aggregation Pipeline

The aggregation pipeline is MongoDB's answer to SQL's GROUP BY, JOIN, and subqueries. Data flows through stages, each transforming the output of the previous one.

// Revenue by status for orders this year
db.orders.aggregate([
  { $match: { created_at: { $gte: ISODate("2025-01-01") } } },
  { $group: { _id: "$status", count: { $sum: 1 }, total: { $sum: "$amount" } } },
  { $sort: { total: -1 } }
])

// Lookup (like a LEFT JOIN)
db.orders.aggregate([
  { $lookup: { from: "customers", localField: "customer_id",
               foreignField: "_id", as: "customer" } },
  { $unwind: "$customer" },
  { $project: { amount: 1, "customer.name": 1 } }
])

Mental Model: Think of the aggregation pipeline like Unix pipes. $match is grep, $group is awk, $sort is sort, $project is cut, $lookup is join. Put $match as early as possible — just like you would put grep early in a pipe to reduce the data flowing through later stages.


Connection Pooling

# Python (pymongo) — connection pool tuning
from pymongo import MongoClient

client = MongoClient(
    "mongodb://user:pass@mongo1,mongo2,mongo3/mydb?replicaSet=rs0",
    maxPoolSize=50,         # max connections per host (default: 100)
    minPoolSize=5,          # pre-warmed connections
    maxIdleTimeMS=60000,    # close idle connections after 60s
    waitQueueTimeoutMS=5000 # error if no connection available after 5s
)

Gotcha: The number one cause of pool exhaustion: creating a new MongoClient per request instead of sharing one across the application lifecycle. Each client opens maxPoolSize connections to every member. One client per process. That is it.


Backup and Restore

# Dump from a secondary (never hammer the primary with backup reads)
mongodump \
  --uri="mongodb://mongo1,mongo2,mongo3/mydb?replicaSet=rs0&readPreference=secondary" \
  --oplog --out=/backup/$(date +%F_%H%M)

# Restore to a different cluster for verification
mongorestore --uri="mongodb://test-mongo:27017" \
  --oplogReplay --drop /backup/2026-03-22_0300/
Tool Speed Best For
mongodump/mongorestore Slow Small/medium datasets, cross-version migration
Filesystem snapshots (LVM/EBS/ZFS) Fast Large datasets, quick recovery
Atlas continuous backup Transparent Atlas deployments (PITR built-in)
Ops Manager / Cloud Manager Incremental Self-hosted enterprise

Gotcha: mongodump on a large collection reads everything through the buffer cache. Running it on the primary during business hours evicts hot pages and spikes read latency. Always dump from a secondary with readPreference=secondary.


Back to the Mission: The Oplog Is Closing In

Now you have the knowledge. Let's save mongo3.

Step 1: Assess the Damage

// Connect to the primary
mongosh "mongodb://mongo1:27017/admin"

// Check replica set status
rs.status().members.map(m => ({
  name: m.name,
  state: m.stateStr,
  lag: m.optimeDate,
  lagSec: m.stateStr === "SECONDARY"
    ? Math.round((new Date() - m.optimeDate) / 1000)
    : "N/A"
}))

Output:

[
  { name: "mongo1:27017", state: "PRIMARY",   lag: ..., lagSec: "N/A" },
  { name: "mongo2:27017", state: "SECONDARY", lag: ..., lagSec: 12 },
  { name: "mongo3:27017", state: "SECONDARY", lag: ..., lagSec: 14832 }
]

mongo3 is 14,832 seconds behind — about 4.1 hours. Let's check the oplog window:

// On the primary — how much oplog history do we have?
rs.printReplicationInfo()
// configured oplog size:   51200MB
// log length start to end: 22140secs (6.15hrs)

6.15 hours of oplog, 4.1 hours of lag. That gives us ~2 hours before the oplog wraps.

Step 2: Find What Is Causing the Lag

// Connect to mongo3 directly and check what is consuming resources
mongosh "mongodb://mongo3:27017/admin"
db.currentOp({ secs_running: { $gt: 5 } })

// You find a long-running index build — kicked off on the primary 6 hours ago
db.currentOp({ "msg": { $regex: /Index Build/ } })
// "msg": "Index Build: scanning collection"
// "progress": { "done": 412000000, "total": 890000000 }

An index build on 890 million documents, 46% complete on the secondary. It is consuming all the I/O bandwidth, blocking oplog application.

Step 3: Buy Time — Resize the Oplog

You cannot speed up the index build, but you can make the oplog bigger so it does not wrap before the build finishes.

// Connect to the primary
mongosh "mongodb://mongo1:27017/admin"

// Resize oplog from 50 GB to 100 GB (online, no restart required — MongoDB 3.6+)
db.adminCommand({ replSetResizeOplog: 1, size: 102400 })
// { "ok": 1 }

// Verify
rs.printReplicationInfo()
// configured oplog size:   102400MB
// log length start to end: 22140secs (6.15hrs) — will grow as new writes accumulate

This buys you time. The oplog will now hold roughly 12 hours of history at current write rates.

Step 4: Reduce Write Pressure and Monitor

Pause any batch jobs or ETL processes generating writes. Then watch the lag:

// Monitor lag every 30 seconds on the primary
rs.status().members.filter(m => m.name === "mongo3:27017").map(m => ({
  lagSec: Math.round((new Date() - m.optimeDate) / 1000),
  state: m.stateStr
}))

Once the index build completes on mongo3, oplog application speeds up dramatically and the lag shrinks. With the larger oplog, you have a comfortable margin.

Gotcha: On MongoDB before 4.2, foreground index builds on the primary blocked all operations on the secondary during replay. MongoDB 4.2+ uses hybrid index builds that allow concurrent reads/writes, but they still consume significant I/O. Always check db.currentOp() for index builds when diagnosing replication lag.


Flashcard Check #3

Question Answer
What command resizes the oplog without a restart? db.adminCommand({ replSetResizeOplog: 1, size: <MB> }) (MongoDB 3.6+)
What happens when a secondary falls behind the oplog window? It needs a full initial sync — a complete copy of the dataset
What is a common cause of sudden replication lag spikes? Long-running index builds, large bulk writes, or I/O saturation on the secondary
How do you check replication lag numerically? rs.status().members — compare optimeDate to current time
Where should you run mongodump — primary or secondary? Always from a secondary to avoid impacting primary performance

Exercises

Exercise 1: Read an explain() plan (2 minutes)

Given this explain output, identify the problems:

{
  "nReturned": 3,
  "totalKeysExamined": 0,
  "totalDocsExamined": 4200000,
  "executionStages": { "stage": "COLLSCAN" }
}
Answer Three problems: 1. **COLLSCAN** — no index is being used at all 2. **totalKeysExamined: 0** — confirms no index involvement 3. **4.2 million docs examined to return 3** — scanning the entire collection for 3 results Fix: create an index on the fields used in the query's filter.

Exercise 2: Choose a shard key (5 minutes)

You have an events collection with these fields: - event_id (UUID) - tenant_id (500 possible values) - created_at (timestamp) - event_type (20 possible values)

Most queries filter by tenant_id and a created_at range. Write throughput is 50,000 inserts/second. Which shard key would you choose and why?

Answer Best choice: `{ tenant_id: 1, created_at: 1 }` (compound key) Why: - **tenant_id first**: matches the most common query filter, so most queries target a single shard - **created_at second**: provides high cardinality within each tenant for chunk splitting - **Not `created_at` alone**: monotonically increasing = hotspot - **Not `tenant_id` alone**: only 500 values = low cardinality, jumbo chunks - **Not hashed**: would destroy range query performance on `created_at` The compound key gives you targeted queries AND even write distribution.

Exercise 3: Diagnose a write concern problem (5 minutes)

Your application uses default settings. The primary crashes and a secondary is elected. After recovery, the last 4 seconds of payment records are missing. What went wrong? How do you fix it?

Answer Default write concern `w: 1` means writes are acknowledged by the primary only. Those writes were confirmed but never replicated before the crash. Fix: use `w: "majority"`:
db.payments.insertOne(
  { amount: 500, user: "alice" },
  { writeConcern: { w: "majority", wtimeout: 5000 } }
)
// Or set as client default: mongodb://host/db?w=majority&wtimeout=5000

Cheat Sheet

Quick Diagnosis

What to Check Command
Replica set health rs.status()
Replication lag rs.printSlaveReplicationInfo()
Oplog window rs.printReplicationInfo()
Slow operations db.currentOp({ secs_running: { $gt: 5 } })
Cache pressure db.serverStatus().wiredTiger.cache
Connection pool db.serverStatus().connections
Collection sizes db.collection.stats({ scale: 1048576 })
Shard distribution sh.status() / db.coll.getShardDistribution()
Index usage db.coll.aggregate([{ $indexStats: {} }])
Profiler db.setProfilingLevel(1, { slowms: 100 })

Write Concern Quick Reference

Level Meaning Data Loss Risk
w: 0 Fire and forget Total
w: 1 Primary acknowledged Lost on primary crash before replication
w: "majority" Majority acknowledged Survives any single-node failure

Emergency Commands

// Kill all slow queries over 30 seconds
db.currentOp({ secs_running: { $gt: 30 }, ns: { $not: /^admin|local/ } }).inprog
  .forEach(op => db.killOp(op.opid))

// Resize oplog online (MongoDB 3.6+)
db.adminCommand({ replSetResizeOplog: 1, size: 102400 })

// Step down primary / force reconfig (data loss risk!)
rs.stepDown(120)
rs.reconfig({ _id: "rs0", members: [{ _id: 0, host: "mongo1:27017" }] }, { force: true })

Takeaways

  • Write concern w: "majority" is not optional for data you care about. The default w: 1 means acknowledged writes can vanish if the primary crashes before replication.

  • The shard key is the most permanent decision in your schema. A monotonically increasing key creates a write hotspot. A low-cardinality key creates jumbo chunks. A compound key with your most common filter field first is usually the right answer.

  • The oplog is your replication safety net. Size it for 2x your longest maintenance window. Monitor it. When a secondary falls behind the oplog window, the recovery path is a full initial sync that can take hours or days.

  • COLLSCAN on a large collection is not a performance issue — it is an incident. Use explain("executionStats") to verify index usage. The compound index leftmost prefix rule is the key to efficient queries.

  • MongoDB's history matters for operations. Understanding that it lacked transactions until 4.0, that write concern defaulted to fire-and-forget until 2012, and that MMAPv1 had database-level locking explains why so many legacy deployments have configuration problems.

  • One MongoClient per application process. Creating clients per request is the most common cause of connection pool exhaustion.