Quiz: Distributed Systems Fundamentals¶

7 questions

L1 (4 questions)¶

1. What is the CAP theorem and how does it apply to choosing a database for a microservices architecture?

Show answer

CAP states a distributed system can guarantee at most two of: Consistency (every read sees the latest write), Availability (every request gets a response), Partition tolerance (system works despite network splits). Since network partitions are unavoidable, the real choice is CP (consistent but may reject requests during partition, e.g., etcd, ZooKeeper) vs AP (available but may serve stale data, e.g., Cassandra, DynamoDB in eventual-consistency mode). Choose based on your tolerance for stale reads vs failed writes.

2. What is the split-brain problem and how do distributed systems prevent it?

Show answer

Split-brain occurs when a network partition causes two subsets of nodes to each believe they are the active cluster, leading to conflicting writes and data divergence. Prevention mechanisms: quorum-based voting (majority required to accept writes — a 3-node cluster tolerates 1 failure), fencing (STONITH/fence agents power off unreachable nodes), lease-based leadership (leader must renew lease within timeout), and witness/tiebreaker nodes in even-numbered clusters.

3. What are idempotent operations and why are they critical in distributed systems?

Show answer

An idempotent operation produces the same result regardless of how many times it is executed. Critical because in distributed systems, network failures force retries — without idempotency, a retry could create a duplicate order, charge a card twice, or increment a counter extra times. Techniques: use unique request IDs (idempotency keys) that the server deduplicates, prefer PUT (set to value) over POST (create new), design state machines where re-applying a transition is a no-op if already applied.

4. What is a circuit breaker pattern and when should you use it in a microservices architecture?

Show answer

A circuit breaker monitors calls to a downstream service and trips open when failures exceed a threshold (e.g., 50% failure rate over 10 seconds). When open, calls fail immediately without contacting the service, preventing cascade failures and giving the downstream time to recover. After a timeout, it enters half-open state and allows a test request. Use it for any synchronous call to another service, external API, or database. Libraries: Hystrix (deprecated), resilience4j, Polly. Always pair with fallback behavior (cached data, degraded response, queue for later).

L2 (3 questions)¶

1. What is the difference between Raft and Paxos consensus algorithms, and why is Raft more commonly used in modern infrastructure tools?

Show answer

Both achieve consensus in distributed systems but differ in design philosophy. Paxos is theoretically elegant but notoriously hard to implement correctly — it separates the protocol into phases that are difficult to map to real code. Raft was designed for understandability: it breaks consensus into leader election, log replication, and safety, with a single strong leader. Raft is used in etcd, Consul, CockroachDB, and TiKV because engineers can reason about and debug it. The performance difference is negligible for most workloads.

2. Explain the difference between strong consistency, eventual consistency, and causal consistency. When would you choose each?

Show answer

Strong consistency: reads always return the latest write (linearizability). Use for financial transactions, distributed locks. Eventual consistency: given no new writes, all replicas converge to the same value. Use for DNS, social media feeds, caches. Causal consistency: preserves cause-and-effect ordering (if A causes B, everyone sees A before B) but allows concurrent operations to be seen in different orders. Use for collaborative editing, comment threads. Each step down trades consistency for lower latency and higher availability.

3. What is a vector clock and how does it differ from a Lamport timestamp for tracking causality?

Show answer

Lamport timestamps assign a single incrementing counter to events — they establish a total order but cannot distinguish concurrent events from causally related ones. Vector clocks maintain a counter per node (e.g., [A:3, B:2, C:1]) so you can determine if event X happened-before event Y (X's vector <= Y's component-wise) or if they are concurrent (neither dominates). Trade-off: vector clocks grow with the number of nodes. Used in Dynamo-style databases for conflict detection. Modern alternatives include dotted version vectors and hybrid logical clocks.