MongoDB: The Document Database That Surprised Everyone
- lesson
- bson-and-the-document-model
- replica-sets-(elections
- oplog
- read-preference)
- sharding-(shard-key-selection
- chunks
- balancer)
- wiredtiger-storage-engine
- indexes-(compound
- covered-queries
- explain-plans)
- write-concern-and-read-concern
- connection-pooling
- backup/restore
- aggregation-pipeline ---# MongoDB: The Document Database That Surprised Everyone
Topics: BSON and the document model, replica sets (elections, oplog, read preference), sharding (shard key selection, chunks, balancer), WiredTiger storage engine, indexes (compound, covered queries, explain plans), write concern and read concern, connection pooling, backup/restore, aggregation pipeline, MongoDB's evolution from schemaless hype to ACID transactions Level: L1-L2 (Foundations to Operations) Time: 60-90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
It is 2:17 AM. Your phone lights up with a PagerDuty alert:
CRITICAL: MongoDB replica set rs-prod-0 — secondary mongo3 replication lag exceeding 4 hours. Oplog window is 6 hours.
You do the math in your head. The secondary is 4 hours behind. The oplog only holds 6 hours
of history. If the lag keeps growing at this rate, in about 2 hours the oplog will wrap
around — the oldest entries the secondary still needs will be overwritten. At that point,
mongo3 cannot catch up incrementally. It will need a full initial sync — a complete
copy of the entire dataset. On your 800 GB cluster, that takes 12-14 hours and saturates the
network link between data centers.
You have a 2-hour window to fix this. The clock is ticking.
But before you can fix it, you need to understand what you are looking at. So let's start from the beginning — and work our way to the 2 AM save.
What Even Is MongoDB?¶
The 60-Second History¶
In 2007, Dwight Merriman and Eliot Horowitz — both from the online advertising company DoubleClick (later acquired by Google) — founded a company called 10gen. At DoubleClick, they had struggled to scale relational databases for the firehose of ad impression data. They wanted a database designed for developer speed and horizontal scale from the start.
Name Origin: "MongoDB" comes from "humongous." The name reflects the original design goal: handling massive datasets. The company renamed from 10gen to MongoDB Inc. in 2013 because the database had become more famous than the company.
Trivia: A 2010 satirical YouTube video titled "MongoDB is Web Scale" mocked the NoSQL hype, with characters dismissing relational databases because MongoDB "turns off journaling for speed." The video became so well-known that MongoDB engineers cited it in conference talks as motivation for improving durability guarantees.
The evolution tells the real story:
| Year | Version | What Changed | Why It Mattered |
|---|---|---|---|
| 2009 | 1.0 | First release | Document model, auto-sharding |
| 2012 | 2.2 | Aggregation framework | Finally could do GROUP BY without MapReduce |
| 2012 | — | Default write concern changed from "fire and forget" to w:1 | Apps had been silently losing writes for years |
| 2014 | — | WiredTiger acquisition | Replaced the disastrous MMAPv1 engine |
| 2015 | 3.2 | WiredTiger becomes default, document validation | From "schemaless" to "schema when you want it" |
| 2018 | 4.0 | Multi-document ACID transactions | The feature everyone said MongoDB would never have |
| 2020 | 4.4 | Hedged reads, refinable shard keys | Mature operational tooling |
| 2023 | 7.0 | Queryable encryption, column store indexes | Enterprise and analytics features |
Mental Model: MongoDB's evolution mirrors the five stages of grief — except the grief was the engineering community's, and the stages were: denial ("we don't need schemas"), anger ("we lost data!"), bargaining ("maybe just validate some fields"), depression ("we need transactions"), acceptance ("it's actually a pretty good database now").
BSON and the Document Model¶
MongoDB stores data as documents — JSON-like objects with a twist. The actual storage format is BSON (Binary JSON), which adds type information and length prefixes.
// A document in MongoDB
{
"_id": ObjectId("6579a4b8c3e21a0012345678"),
"username": "alice",
"email": "alice@example.com",
"created_at": ISODate("2024-01-15T10:00:00Z"),
"profile": {
"city": "Berlin",
"department": "platform-eng"
},
"tags": ["admin", "beta"],
"login_count": NumberLong(4217)
}
Notice what is different from a relational row:
- Nested objects (
profile.city) — no JOIN needed - Arrays (
tags) — a first-class citizen, not a separate table - No fixed schema — the next document could have completely different fields
_idis auto-generated if you do not provide one: a 12-byte ObjectId encoding a timestamp, machine identifier, process ID, and counter
Under the Hood: BSON is actually larger than equivalent JSON for small documents.
{"a": 1}is 7 bytes in JSON but 16 bytes in BSON — the type markers and length prefixes add overhead. BSON pays for itself on larger documents with binary data, dates, and precise number types that JSON cannot natively represent.
Documents live in collections (analogous to tables). Collections live in databases. Since MongoDB 3.2, you can add JSON Schema validation — and you should:
db.createCollection("users", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["username", "email"],
properties: {
username: { bsonType: "string" },
email: { bsonType: "string", pattern: "^.+@.+$" }
}
}
}
})
Gotcha: "Schemaless" was always marketing. Your application code enforces a schema whether you like it or not — every
user.emailaccess assumes the field exists. After enough incidents caused by rogue documents missing expected fields, most mature deployments add schema validation. The schema just moved from the database to the application and back.
Replica Sets: How MongoDB Stays Alive¶
A replica set is a group of mongod processes (typically 3 or 5 — always odd) that
maintain the same dataset. One member is the primary (accepts reads and writes). The rest
are secondaries (replicate from the primary, can serve reads).
┌──────────┐ oplog tailing ┌──────────────┐
│ Primary │ ──────────────────────────── │ Secondary 1 │
│ (mongo1) │ │ (mongo2) │
└──────────┘ ──────────────────────────── └──────────────┘
─────────────────────────── ┌──────────────┐
│ Secondary 2 │
│ (mongo3) │
└──────────────┘
Elections: Who Becomes Primary?¶
When the primary goes down, secondaries detect its absence via heartbeats (every 2 seconds) and hold an election using a Raft-like consensus protocol. A majority vote is required — 2 of 3, or 3 of 5.
// Check the current state of every member
rs.status()
// Key fields: members[n].stateStr (PRIMARY/SECONDARY/RECOVERING),
// members[n].health (1=up, 0=down),
// members[n].optimeDate (last applied oplog entry)
Elections typically complete in 10-30 seconds. During that window, no writes are accepted.
Interview Bridge: "How does MongoDB maintain high availability?" Replica sets, majority voting, automatic failover, the 10-30 second election window, and
w: "majority"write concern to ensure writes survive failover.
The Oplog: Replication's Backbone¶
The oplog (operations log) is a capped collection in the local database on each
member. Every write to the primary is recorded in the oplog. Secondaries tail this
collection to replicate changes — exactly like tailing a log file.
// Check oplog window on the primary
rs.printReplicationInfo()
// Output: "log length start to end: 21600secs (6hrs)"
// Check how far behind each secondary is
rs.printSlaveReplicationInfo()
// Output: "syncedTo: ... 14400 secs (4 hrs) behind the primary"
The oplog window is how far back in time the oplog reaches. If a secondary falls behind by more than the oplog window, it can no longer catch up incrementally — the entries it needs have been overwritten. It must do a full initial sync (copy the entire dataset).
Remember: Oplog window = your safety margin for secondary downtime. Rule of thumb: size the oplog for 2x your longest planned maintenance window. An 8-hour maintenance window means at least a 16-hour oplog window.
Read Preference: Where Do Reads Go?¶
// In a connection string
"mongodb://mongo1,mongo2,mongo3/mydb?replicaSet=rs0&readPreference=secondaryPreferred"
| Read Preference | Behavior | Use Case |
|---|---|---|
primary |
All reads from primary (default) | Strong consistency |
primaryPreferred |
Primary if available, else secondary | Resilient strong consistency |
secondary |
Always from secondaries | Analytics, reporting (may be stale) |
secondaryPreferred |
Secondaries preferred | Offload reads, accept staleness |
nearest |
Lowest network latency | Geo-distributed clusters |
Gotcha: Reading from secondaries means reading potentially stale data. If a user writes their profile and immediately reads it back from a secondary, they might see the old value. This is the classic read-after-write consistency problem. For user-facing reads after writes, use
primaryread preference.
Flashcard Check #1¶
Cover the answers. Test yourself.
| Question | Answer |
|---|---|
| What does BSON stand for? | Binary JSON — includes type information and length prefixes |
| How many votes are needed to elect a primary in a 3-member replica set? | 2 (a majority) |
| What happens when a secondary falls behind the oplog window? | It must do a full initial sync — copy the entire dataset |
| What read preference gives you strong consistency? | primary (the default) |
| What is the default heartbeat interval between replica set members? | 2 seconds |
Write Concern: The Durability Dial¶
Write concern controls how many replica set members must acknowledge a write before MongoDB tells your application "success." This is one of the most critical — and most misunderstood — settings in MongoDB.
// w:0 — fire and forget. Returns immediately. No idea if write succeeded.
db.payments.insertOne({ amount: 500, user: "alice" }, { writeConcern: { w: 0 } })
// w:1 — primary acknowledges (default). Lost if primary crashes before replication.
db.payments.insertOne({ amount: 500, user: "alice" }, { writeConcern: { w: 1 } })
// w:"majority" — majority acknowledges. Survives any single-node failure.
db.payments.insertOne(
{ amount: 500, user: "alice" },
{ writeConcern: { w: "majority", wtimeout: 5000 } }
)
| Write Concern | Durability | Latency | When to Use |
|---|---|---|---|
w: 0 |
None — you do not even know if the server received it | Fastest | Metrics, logs, telemetry you can afford to drop |
w: 1 |
Survives primary disk failure (journaled) but not primary crash before replication | Low | Default; acceptable for non-critical data |
w: "majority" |
Survives any single-node failure | Higher (~2x) | Payments, user records, anything important |
War Story: Before 2012, MongoDB's driver default was effectively
w: 0— fire and forget. Applications silently lost writes. The 2012 change tow: 1was one of the most important changes in MongoDB's history. Many early "MongoDB lost my data" stories trace back to the old default.
Read Concern: The Freshness Dial¶
| Read Concern | What You See | Trade-off |
|---|---|---|
"local" (default) |
Latest data on the member you query | May read data that gets rolled back |
"majority" |
Only data committed to a majority | Safe from rollback, slightly stale |
"linearizable" |
Most recent majority-committed data | Strongest guarantee, highest latency |
The safe default for most applications: w: "majority" + readConcern: "majority" = no
data loss and no reading data that later disappears.
Sharding: When One Replica Set Is Not Enough¶
When a single replica set cannot handle your data volume or write throughput, you shard — distribute data across multiple replica sets.
┌──────────────────────┐
│ mongos (router) │ ← your application connects here
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Config Servers │ ← stores which chunks live on which shard
│ (3-member RS) │
└───────┬──────┬───────┘
│ │
┌────────────▼┐ ┌──▼───────────┐
│ Shard 1 │ │ Shard 2 │
│ (rs-shard1) │ │ (rs-shard2) │ ← each shard is a replica set
└─────────────┘ └──────────────┘
The application connects to mongos (the router), which consults the config servers to route queries. This is transparent to the application.
The Shard Key: The Most Important Decision You Will Make¶
The shard key determines which shard a document lives on. Choose wrong, and you are in for months of pain.
// Enable sharding on a database
sh.enableSharding("mydb")
// Shard with a ranged key
sh.shardCollection("mydb.orders", { customer_id: 1 })
// Shard with a hashed key
sh.shardCollection("mydb.events", { event_id: "hashed" })
Ranged = adjacent key values on the same shard. Great for range queries, terrible for monotonically increasing keys. Hashed = even distribution via hash. Great for write throughput, terrible for range queries (scatter-gather across all shards).
War Story: A team sharded their events collection on
{ created_at: 1 }. Every new event had a greater timestamp than the last — so 100% of writes landed on one shard. That shard's disk I/O was at 95% while the other three sat at 8%. The fix: resharding the entire 2 TB collection on a compound key, a 36-hour migration running old and new collections in parallel.
| Shard Key Choice | Inserts | Range Queries | Danger |
|---|---|---|---|
{ _id: 1 } (ObjectId) |
All to last shard | Good | Hotspot |
{ _id: "hashed" } |
Even | Scatter-gather | No range queries |
{ customer_id: 1, created_at: 1 } |
Good | Great | Best general-purpose |
{ country: 1 } |
30 chunks max | Good within country | Low cardinality, jumbo chunks |
Gotcha: Before MongoDB 5.0, the shard key was immutable after creation. Fixing a bad key meant dump, drop, reshard, restore. MongoDB 5.0's
reshardCollectioncopies the entire dataset. Get the shard key right the first time.
The Balancer¶
The balancer moves chunks between shards to keep distribution even. Each migration copies data over the network — heavy balancing can saturate inter-shard bandwidth.
sh.isBalancerRunning() // Is it active?
db.orders.getShardDistribution() // How even is the spread?
sh.stopBalancer() // Pause during bulk loads
sh.startBalancer() // Resume
WiredTiger: The Engine Under the Hood¶
Since MongoDB 3.2, WiredTiger is the default storage engine. It replaced the infamous MMAPv1, which used memory-mapped files and had database-level locking — meaning a write to any collection in a database blocked writes to every other collection.
Trivia: WiredTiger was an independent company created by Keith Bostic (one of the original BSD Unix developers) and Michael Cahill. MongoDB acquired WiredTiger in 2014. The engine brought document-level concurrency control, compression, and a proper cache — transforming MongoDB from a toy into a credible production database.
What it brought: document-level concurrency (two writes to different documents happen in parallel — the single biggest improvement over MMAPv1), compression (snappy by default, zlib/zstd available), and a proper internal cache separate from the OS page cache.
# mongod.conf — WiredTiger tuning
storage:
engine: wiredTiger
wiredTiger:
engineConfig:
cacheSizeGB: 8 # Default: 50% of RAM minus 1 GB (min 256 MB)
collectionConfig:
blockCompressor: snappy # Options: snappy, zlib, zstd, none
indexConfig:
prefixCompression: true
// Check cache pressure at runtime
var cache = db.serverStatus().wiredTiger.cache
print("Cache used: " + Math.round(cache["bytes currently in the cache"] / 1024/1024) + " MB")
print("Cache max: " + Math.round(cache["maximum bytes configured"] / 1024/1024) + " MB")
print("Dirty bytes: " + Math.round(cache["tracked dirty bytes in the cache"] / 1024/1024) + " MB")
// If dirty bytes are consistently >20% of max, you have write pressure
Under the Hood: When you
db.collection.drop()a large collection, WiredTiger reclaims the space internally — but the OS does not see the freed disk space. The data directory stays the same size on disk. To actually reclaim OS-level disk space, you needdb.runCommand({ compact: "collection_name" })(which blocks writes) or a mongodump/mongorestore cycle.
Indexes: Making Queries Fast¶
Without an index, every query does a COLLSCAN — reads every document. On a hundred million documents, that is not slow; it is an outage.
// Single field — the basics
db.orders.createIndex({ customer_id: 1 }) // ascending
db.orders.createIndex({ created_at: -1 }) // descending
// Compound index — order matters (leftmost prefix rule)
db.orders.createIndex({ customer_id: 1, status: 1, created_at: -1 })
// Covers queries on: (customer_id), (customer_id + status),
// (customer_id + status + created_at)
// Does NOT cover: (status alone), (created_at alone)
// TTL index — auto-delete documents after expiry
db.sessions.createIndex({ created_at: 1 }, { expireAfterSeconds: 3600 })
// Partial index — only index a subset of documents
db.orders.createIndex(
{ customer_id: 1, created_at: -1 },
{ partialFilterExpression: { status: "active" } }
)
// Smaller, faster, less memory — only indexes active orders
// Text index — full-text search
db.articles.createIndex({ title: "text", body: "text" })
// Find unused indexes (wasting RAM and slowing writes)
db.orders.aggregate([{ $indexStats: {} }]).toArray()
.filter(i => i.accesses.ops == 0)
Remember: Compound indexes follow the leftmost prefix rule — like a phone book sorted by (last name, first name, city). You can look up everyone named "Smith," or everyone named "Alice Smith," but you cannot efficiently look up everyone in "Berlin" without scanning the whole book. Put your most-filtered field first.
Reading explain() Output¶
explain() is your X-ray machine. It shows exactly how MongoDB plans to execute a query.
// Run an explained query
db.orders.explain("executionStats").find({
customer_id: "cust_abc123",
status: "pending"
})
The output that matters:
{
"executionSuccess": true,
"nReturned": 5,
"executionTimeMillis": 3,
"totalKeysExamined": 5,
"totalDocsExamined": 5,
"executionStages": {
"stage": "FETCH",
"inputStage": {
"stage": "IXSCAN",
"keyPattern": { "customer_id": 1, "status": 1 },
"indexName": "customer_id_1_status_1"
}
}
}
| Field | Good Value | Bad Value |
|---|---|---|
stage |
IXSCAN (index scan) |
COLLSCAN (full collection scan) |
totalKeysExamined vs nReturned |
Equal or close | 10,000 keys examined, 5 returned |
executionTimeMillis |
< 100 for OLTP queries | > 1000 |
Remember: The mnemonic for a healthy explain plan: "Keys equals Returns, Stage says Scan." If
totalKeysExaminedis much larger thannReturned, your index is scanning more entries than needed — add or refine a compound index.
Covered query — when the index contains all fields the query needs, MongoDB answers
entirely from the index without fetching documents. Look for "totalDocsExamined": 0 in the
explain output. Use projections that exclude _id and only request indexed fields.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| What is the default write concern in modern MongoDB? | w: 1 (acknowledged by primary only) |
| What write concern should you use for payment records? | w: "majority" — survives any single-node failure |
Why is sharding on { created_at: 1 } dangerous? |
Monotonically increasing — all writes go to one shard (hotspot) |
| What does COLLSCAN mean in an explain plan? | Full collection scan — no index is being used |
| What is a covered query? | A query answered entirely from the index, with zero document fetches |
| What replaced MMAPv1 as MongoDB's storage engine? | WiredTiger (default since MongoDB 3.2) |
The Aggregation Pipeline¶
The aggregation pipeline is MongoDB's answer to SQL's GROUP BY, JOIN, and subqueries.
Data flows through stages, each transforming the output of the previous one.
// Revenue by status for orders this year
db.orders.aggregate([
{ $match: { created_at: { $gte: ISODate("2025-01-01") } } },
{ $group: { _id: "$status", count: { $sum: 1 }, total: { $sum: "$amount" } } },
{ $sort: { total: -1 } }
])
// Lookup (like a LEFT JOIN)
db.orders.aggregate([
{ $lookup: { from: "customers", localField: "customer_id",
foreignField: "_id", as: "customer" } },
{ $unwind: "$customer" },
{ $project: { amount: 1, "customer.name": 1 } }
])
Mental Model: Think of the aggregation pipeline like Unix pipes.
$matchisgrep,$groupisawk,$sortissort,$projectiscut,$lookupisjoin. Put$matchas early as possible — just like you would putgrepearly in a pipe to reduce the data flowing through later stages.
Connection Pooling¶
# Python (pymongo) — connection pool tuning
from pymongo import MongoClient
client = MongoClient(
"mongodb://user:pass@mongo1,mongo2,mongo3/mydb?replicaSet=rs0",
maxPoolSize=50, # max connections per host (default: 100)
minPoolSize=5, # pre-warmed connections
maxIdleTimeMS=60000, # close idle connections after 60s
waitQueueTimeoutMS=5000 # error if no connection available after 5s
)
Gotcha: The number one cause of pool exhaustion: creating a new
MongoClientper request instead of sharing one across the application lifecycle. Each client opensmaxPoolSizeconnections to every member. One client per process. That is it.
Backup and Restore¶
# Dump from a secondary (never hammer the primary with backup reads)
mongodump \
--uri="mongodb://mongo1,mongo2,mongo3/mydb?replicaSet=rs0&readPreference=secondary" \
--oplog --out=/backup/$(date +%F_%H%M)
# Restore to a different cluster for verification
mongorestore --uri="mongodb://test-mongo:27017" \
--oplogReplay --drop /backup/2026-03-22_0300/
| Tool | Speed | Best For |
|---|---|---|
mongodump/mongorestore |
Slow | Small/medium datasets, cross-version migration |
| Filesystem snapshots (LVM/EBS/ZFS) | Fast | Large datasets, quick recovery |
| Atlas continuous backup | Transparent | Atlas deployments (PITR built-in) |
| Ops Manager / Cloud Manager | Incremental | Self-hosted enterprise |
Gotcha:
mongodumpon a large collection reads everything through the buffer cache. Running it on the primary during business hours evicts hot pages and spikes read latency. Always dump from a secondary withreadPreference=secondary.
Back to the Mission: The Oplog Is Closing In¶
Now you have the knowledge. Let's save mongo3.
Step 1: Assess the Damage¶
// Connect to the primary
mongosh "mongodb://mongo1:27017/admin"
// Check replica set status
rs.status().members.map(m => ({
name: m.name,
state: m.stateStr,
lag: m.optimeDate,
lagSec: m.stateStr === "SECONDARY"
? Math.round((new Date() - m.optimeDate) / 1000)
: "N/A"
}))
Output:
[
{ name: "mongo1:27017", state: "PRIMARY", lag: ..., lagSec: "N/A" },
{ name: "mongo2:27017", state: "SECONDARY", lag: ..., lagSec: 12 },
{ name: "mongo3:27017", state: "SECONDARY", lag: ..., lagSec: 14832 }
]
mongo3 is 14,832 seconds behind — about 4.1 hours. Let's check the oplog window:
// On the primary — how much oplog history do we have?
rs.printReplicationInfo()
// configured oplog size: 51200MB
// log length start to end: 22140secs (6.15hrs)
6.15 hours of oplog, 4.1 hours of lag. That gives us ~2 hours before the oplog wraps.
Step 2: Find What Is Causing the Lag¶
// Connect to mongo3 directly and check what is consuming resources
mongosh "mongodb://mongo3:27017/admin"
db.currentOp({ secs_running: { $gt: 5 } })
// You find a long-running index build — kicked off on the primary 6 hours ago
db.currentOp({ "msg": { $regex: /Index Build/ } })
// "msg": "Index Build: scanning collection"
// "progress": { "done": 412000000, "total": 890000000 }
An index build on 890 million documents, 46% complete on the secondary. It is consuming all the I/O bandwidth, blocking oplog application.
Step 3: Buy Time — Resize the Oplog¶
You cannot speed up the index build, but you can make the oplog bigger so it does not wrap before the build finishes.
// Connect to the primary
mongosh "mongodb://mongo1:27017/admin"
// Resize oplog from 50 GB to 100 GB (online, no restart required — MongoDB 3.6+)
db.adminCommand({ replSetResizeOplog: 1, size: 102400 })
// { "ok": 1 }
// Verify
rs.printReplicationInfo()
// configured oplog size: 102400MB
// log length start to end: 22140secs (6.15hrs) — will grow as new writes accumulate
This buys you time. The oplog will now hold roughly 12 hours of history at current write rates.
Step 4: Reduce Write Pressure and Monitor¶
Pause any batch jobs or ETL processes generating writes. Then watch the lag:
// Monitor lag every 30 seconds on the primary
rs.status().members.filter(m => m.name === "mongo3:27017").map(m => ({
lagSec: Math.round((new Date() - m.optimeDate) / 1000),
state: m.stateStr
}))
Once the index build completes on mongo3, oplog application speeds up dramatically and
the lag shrinks. With the larger oplog, you have a comfortable margin.
Gotcha: On MongoDB before 4.2, foreground index builds on the primary blocked all operations on the secondary during replay. MongoDB 4.2+ uses hybrid index builds that allow concurrent reads/writes, but they still consume significant I/O. Always check
db.currentOp()for index builds when diagnosing replication lag.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What command resizes the oplog without a restart? | db.adminCommand({ replSetResizeOplog: 1, size: <MB> }) (MongoDB 3.6+) |
| What happens when a secondary falls behind the oplog window? | It needs a full initial sync — a complete copy of the dataset |
| What is a common cause of sudden replication lag spikes? | Long-running index builds, large bulk writes, or I/O saturation on the secondary |
| How do you check replication lag numerically? | rs.status().members — compare optimeDate to current time |
| Where should you run mongodump — primary or secondary? | Always from a secondary to avoid impacting primary performance |
Exercises¶
Exercise 1: Read an explain() plan (2 minutes)¶
Given this explain output, identify the problems:
{
"nReturned": 3,
"totalKeysExamined": 0,
"totalDocsExamined": 4200000,
"executionStages": { "stage": "COLLSCAN" }
}
Answer
Three problems: 1. **COLLSCAN** — no index is being used at all 2. **totalKeysExamined: 0** — confirms no index involvement 3. **4.2 million docs examined to return 3** — scanning the entire collection for 3 results Fix: create an index on the fields used in the query's filter.Exercise 2: Choose a shard key (5 minutes)¶
You have an events collection with these fields:
- event_id (UUID)
- tenant_id (500 possible values)
- created_at (timestamp)
- event_type (20 possible values)
Most queries filter by tenant_id and a created_at range. Write throughput is 50,000
inserts/second. Which shard key would you choose and why?
Answer
Best choice: `{ tenant_id: 1, created_at: 1 }` (compound key) Why: - **tenant_id first**: matches the most common query filter, so most queries target a single shard - **created_at second**: provides high cardinality within each tenant for chunk splitting - **Not `created_at` alone**: monotonically increasing = hotspot - **Not `tenant_id` alone**: only 500 values = low cardinality, jumbo chunks - **Not hashed**: would destroy range query performance on `created_at` The compound key gives you targeted queries AND even write distribution.Exercise 3: Diagnose a write concern problem (5 minutes)¶
Your application uses default settings. The primary crashes and a secondary is elected. After recovery, the last 4 seconds of payment records are missing. What went wrong? How do you fix it?
Answer
Default write concern `w: 1` means writes are acknowledged by the primary only. Those writes were confirmed but never replicated before the crash. Fix: use `w: "majority"`:Cheat Sheet¶
Quick Diagnosis¶
| What to Check | Command |
|---|---|
| Replica set health | rs.status() |
| Replication lag | rs.printSlaveReplicationInfo() |
| Oplog window | rs.printReplicationInfo() |
| Slow operations | db.currentOp({ secs_running: { $gt: 5 } }) |
| Cache pressure | db.serverStatus().wiredTiger.cache |
| Connection pool | db.serverStatus().connections |
| Collection sizes | db.collection.stats({ scale: 1048576 }) |
| Shard distribution | sh.status() / db.coll.getShardDistribution() |
| Index usage | db.coll.aggregate([{ $indexStats: {} }]) |
| Profiler | db.setProfilingLevel(1, { slowms: 100 }) |
Write Concern Quick Reference¶
| Level | Meaning | Data Loss Risk |
|---|---|---|
w: 0 |
Fire and forget | Total |
w: 1 |
Primary acknowledged | Lost on primary crash before replication |
w: "majority" |
Majority acknowledged | Survives any single-node failure |
Emergency Commands¶
// Kill all slow queries over 30 seconds
db.currentOp({ secs_running: { $gt: 30 }, ns: { $not: /^admin|local/ } }).inprog
.forEach(op => db.killOp(op.opid))
// Resize oplog online (MongoDB 3.6+)
db.adminCommand({ replSetResizeOplog: 1, size: 102400 })
// Step down primary / force reconfig (data loss risk!)
rs.stepDown(120)
rs.reconfig({ _id: "rs0", members: [{ _id: 0, host: "mongo1:27017" }] }, { force: true })
Takeaways¶
-
Write concern
w: "majority"is not optional for data you care about. The defaultw: 1means acknowledged writes can vanish if the primary crashes before replication. -
The shard key is the most permanent decision in your schema. A monotonically increasing key creates a write hotspot. A low-cardinality key creates jumbo chunks. A compound key with your most common filter field first is usually the right answer.
-
The oplog is your replication safety net. Size it for 2x your longest maintenance window. Monitor it. When a secondary falls behind the oplog window, the recovery path is a full initial sync that can take hours or days.
-
COLLSCAN on a large collection is not a performance issue — it is an incident. Use
explain("executionStats")to verify index usage. The compound index leftmost prefix rule is the key to efficient queries. -
MongoDB's history matters for operations. Understanding that it lacked transactions until 4.0, that write concern defaulted to fire-and-forget until 2012, and that MMAPv1 had database-level locking explains why so many legacy deployments have configuration problems.
-
One
MongoClientper application process. Creating clients per request is the most common cause of connection pool exhaustion.
Related Lessons¶
- MySQL Operations: The Database You Inherited — the relational counterpart: replication lag, EXPLAIN, schema migrations
- The Split-Brain Nightmare — when two nodes both think they are primary
- The Backup Nobody Tested — "we have backups" is not "we can restore"
- Understanding Distributed Systems Without a PhD — consensus, replication, partition tolerance