Distributed Systems Interview Questions: A Deep Dive for Software Engineers
Distributed systems interviews are where the gap between experience and coursework shows most clearly. A candidate who has shipped a real replication protocol will pause when asked about leader election, because they know the subtle failure modes; a candidate who has only read about it will rattle off Raft terminology without noticing the ambiguity in the question.
This guide focuses on what senior interviewers actually probe: consistency semantics under partition, the cost of coordination, how real systems implement idempotency and exactly-once semantics, and the difference between theory and the compromises that show up in production code. You will see sample questions, concrete examples, and the traps that separate engineers who have read the papers from those who understand them.
Table of Contents
- CAP, PACELC, and Real-World Trade-Offs
- Consistency Models from Linearizable to Eventual
- Consensus: Raft, Paxos, and Why They Exist
- Replication Strategies
- Partitioning and Sharding
- Leader Election and Split Brain
- Clock Skew, Hybrid Logical Clocks, and Ordering
- Idempotency and Retries with Backoff
- Sagas, 2PC, and 3PC
- Event Sourcing and CQRS
- Common Mistakes Candidates Make
- FAQ
- Conclusion
CAP, PACELC, and Real-World Trade-Offs
Sample question: "Explain CAP. Is a single-region Postgres cluster CP or AP?"
CAP says that in the presence of a network partition, a distributed system can provide either consistency or availability, not both. The common trap is that CAP is about partition behavior only. When the network is healthy, every system can be both consistent and available. The interesting question is always: what do you sacrifice when a partition happens?
PACELC extends this: if there is a Partition, choose between Availability and Consistency; Else, choose between Latency and Consistency. Most production systems have opinions on both axes. DynamoDB is PA/EL, meaning it favors availability during partitions and low latency otherwise, at the cost of consistency on both axes. Spanner is PC/EC: it favors consistency always, paying the latency of consensus even during normal operation.
A single-region Postgres with synchronous replicas is CP: if the replica becomes unreachable, the primary stops accepting writes, sacrificing availability for consistency. With asynchronous replicas it is more like AP, because you can lose acknowledged writes on failover.
The best candidates resist simplistic labels. Real systems are rarely purely CP or AP; they expose knobs (quorum size, synchronous vs asynchronous replication) that let operators slide along the spectrum per workload.
Consistency Models from Linearizable to Eventual
Sample question: "Rank these in strength: causal, linearizable, sequential, eventual. What operations can you perform on a linearizable register but not a sequentially consistent one?"
From strongest to weakest: linearizable, sequential, causal, eventual.
Linearizability says every operation appears to take effect instantaneously at some point between its invocation and its response, and operations appear in real-time order. It is the gold standard. A compare-and-swap on a linearizable register gives you a lock; on a sequentially consistent one, it does not, because sequential consistency does not respect real-time order across clients.
Sequential consistency says there is some total order consistent with each client's program order, but that order need not match real time. Causal consistency says operations that are causally related (in the happens-before sense) appear in the same order to all observers, but concurrent operations may appear in different orders. Eventual consistency just says replicas converge if writes stop.
Client A: write(x, 1) ----ack----
Client B: read(x) -> ?Under linearizability, if A's ack arrived before B's read started (in real time), B must see x = 1. Under causal consistency, B can see x = 0 if no causal message passed between the clients. This is why many production systems expose read-your-writes and monotonic-reads as weaker but useful guarantees, because full linearizability is expensive.
Consensus: Raft, Paxos, and Why They Exist
Sample question: "Walk me through Raft leader election. What happens if two candidates get the same number of votes?"
Consensus is how N nodes agree on a single value despite up to F of them crashing, where N >= 2F + 1. Paxos solved it first; Raft solved it in a way humans can understand. Both have the same safety properties: agreement, validity, and termination under synchrony assumptions.
Raft has three roles: follower, candidate, leader. Followers timeout, become candidates, increment term, vote for themselves, and send RequestVote RPCs. A candidate wins if it gets a majority. Two candidates can split the vote; in that case both time out (with randomized timeouts to reduce the chance of a repeat) and a new election happens with a higher term.
State transitions in Raft
follower -------- election timeout --------> candidate
candidate ------- majority votes ----------> leader
candidate ------- higher term seen --------> follower
leader ---------- higher term seen --------> followerThe critical correctness property is the Log Matching Property: if two logs contain an entry with the same index and term, the logs are identical in all preceding entries. Raft enforces this by having the leader send the previous entry's index and term in AppendEntries; followers reject if they do not match, and the leader decrements and retries.
Paxos has a similar core but separates proposers, acceptors, and learners, and its multi-decree variant Multi-Paxos optimizes by letting a stable leader skip the prepare phase. The famous result is that Paxos is harder to implement correctly because the paper describes single-decree consensus and leaves multi-decree and leader election to the reader.
Replication Strategies
Sample question: "Compare synchronous, asynchronous, and semi-synchronous replication. What does each give up?"
Synchronous replication waits for all replicas (or a quorum) to acknowledge before responding to the client. It preserves durability at the cost of latency and availability: if a replica is slow or unreachable, writes stall.
Asynchronous replication acknowledges as soon as the primary persists the write, then forwards to replicas. It is fast and available but you can lose acknowledged writes if the primary dies before replication catches up.
Semi-synchronous is a middle ground used by MySQL: wait for at least one replica to acknowledge, but not all. This preserves durability against single-node failure while avoiding the worst-case latency of full synchronous.
Quorum-based systems (Dynamo, Cassandra) generalize this. With N replicas, write quorum W, and read quorum R, you get strong consistency when W + R > N. With N=3, W=2, R=2 you have strong consistency and tolerate one failure. Tuning W and R lets operators trade latency for consistency per operation.
# Quorum write pseudo-code
def quorum_write(key, value, replicas, W):
acks = 0
for r in replicas:
try:
r.write(key, value, timeout=100)
acks += 1
if acks >= W:
return OK
except Timeout:
continue
return ERROR_QUORUM_NOT_METReal systems add hinted handoff (the coordinator accepts a write even if a replica is down, with the intent to replay it later) and read repair (on a read, if replicas disagree, write back the latest value).
Partitioning and Sharding
Sample question: "You have a user table growing by a terabyte a month. Design a sharding strategy. What goes wrong with naive hash partitioning?"
Partitioning splits data across nodes so no single node is the bottleneck. The three common strategies are range, hash, and directory.
Range partitioning (by user_id, by date) preserves locality for range queries but tends to create hot shards when new data concentrates at one end of the range. Time-series databases often use range-by-time plus hash-by-entity to avoid the tail hotspot.
Hash partitioning distributes evenly but destroys locality. Range queries require scatter-gather across all shards.
Consistent hashing mitigates the biggest problem of naive hash partitioning: when you add or remove a node, only a small fraction of keys move, not every key. The trick is to hash both keys and nodes onto a ring and assign each key to the next node clockwise.
# Consistent hashing with virtual nodes
class ConsistentHash:
def __init__(self, nodes, vnodes=150):
self.ring = {}
for node in nodes:
for v in range(vnodes):
h = hash(f"{node}#{v}")
self.ring[h] = node
self.sorted_keys = sorted(self.ring.keys())
def get(self, key):
h = hash(key)
idx = bisect.bisect(self.sorted_keys, h) % len(self.sorted_keys)
return self.ring[self.sorted_keys[idx]]Virtual nodes smooth out imbalance when the number of physical nodes is small. Without them, a three-node ring would be badly imbalanced because uniform hashing only approaches uniform distribution in the limit.
Resharding is where candidates trip. Moving data while the system is live requires double-writing, backfill, cutover, and cleanup, with monitoring at each stage. The hardest part is not the data move; it is ensuring the application does not read from one shard and write to another during the transition.
Leader Election and Split Brain
Sample question: "How do you prevent split brain in a primary-replica database when a network partition separates the primary from the replica?"
Split brain happens when both sides of a partition think they are the leader, accept writes, and diverge. The standard defense is a quorum: no side should act as leader without a majority of nodes acknowledging it. With three nodes split 2-1, only the side with 2 can continue.
Fencing tokens make this robust. When a leader is elected, it gets a monotonically increasing token. Every write to storage includes the token; storage rejects writes with an older token. If an old leader tries to write after being demoted, storage rejects it regardless of whether the old leader noticed its demotion.
Leader A (token 5) writes to storage.
Network partition. New leader B elected with token 6.
A recovers, tries to write with token 5.
Storage rejects because 5 < 6. No split brain.Zookeeper and etcd are popular external coordinators for leader election precisely because implementing this yourself and getting it right across every edge case is painful.
Clock Skew, Hybrid Logical Clocks, and Ordering
Sample question: "Why cannot you use Date.now() on different machines to order events?"
Clocks on different machines drift. NTP keeps them within tens of milliseconds typically, but gaps of seconds happen, and smearing during leap seconds can make a clock run slower or faster than wall time. An event stamped t+10ms on machine A can appear before an event stamped t-5ms on machine B even if A's event really happened later.
Logical clocks (Lamport timestamps) fix ordering but do not match wall time: each node increments its counter on every event and takes the max with any received message's counter. Vector clocks go further and detect concurrency: if clock X dominates clock Y, X happened after Y; if neither dominates, they are concurrent.
Hybrid logical clocks (HLC) are the best of both: a combined (physical, logical) pair that is close to wall clock but monotonic and respects causality. They are used in CockroachDB and YugabyteDB to avoid Spanner's TrueTime requirement for closing timestamps.
# Simplified HLC
class HLC:
def __init__(self):
self.pt = 0 # physical time
self.lt = 0 # logical counter
def now(self):
wall = time.time_ns()
if wall > self.pt:
self.pt, self.lt = wall, 0
else:
self.lt += 1
return (self.pt, self.lt)
def update(self, remote):
wall = time.time_ns()
max_pt = max(self.pt, remote[0], wall)
if max_pt == self.pt == remote[0]:
self.lt = max(self.lt, remote[1]) + 1
elif max_pt == self.pt:
self.lt += 1
elif max_pt == remote[0]:
self.lt = remote[1] + 1
else:
self.lt = 0
self.pt = max_pt
return (self.pt, self.lt)Idempotency and Retries with Backoff
Sample question: "You are building a payments API. How do you make retries safe?"
Network failures mean retries are unavoidable. Retries without idempotency double-charge customers. The standard pattern is idempotency keys: the client generates a unique key per logical operation and includes it on every retry. The server stores the key with the result of the first successful execution and returns that result on replay.
POST /charges HTTP/1.1
Idempotency-Key: 8f2a1e0c-01af-4d8e-9a3f-8f3a1b2c4d5e
Content-Type: application/json
{"amount": 2999, "currency": "usd", "customer": "cus_1"}The server must persist the idempotency key atomically with the side effect, otherwise a crash between the two leaves the system in a state where a retry causes duplicate work. Stripe's approach is to store (key, request body hash, response) and fail if a retry comes in with the same key but a different body.
Retry policy matters as much as idempotency. Exponential backoff with jitter prevents thundering herds when many clients retry simultaneously after a service recovers.
def backoff_with_jitter(attempt, base=0.1, cap=10.0):
return random.uniform(0, min(cap, base * 2 ** attempt))Full jitter (above) often outperforms equal jitter in practice because it spreads retries more uniformly across the window. Do not retry non-idempotent requests. Do not retry on 4xx (client errors); they will not succeed next time. Do retry on 429 (respecting Retry-After) and 5xx.
Sagas, 2PC, and 3PC
Sample question: "Why is 2PC not used in microservices? What replaced it?"
Two-phase commit coordinates a distributed transaction across resources. Phase one: coordinator asks all participants to prepare. Phase two: if all vote yes, coordinator sends commit; otherwise abort. It works but blocks on coordinator failure: if the coordinator crashes between phases, participants hold locks until it recovers. That is unacceptable for long-running microservice workflows.
Three-phase commit adds a pre-commit phase to avoid the blocking problem, but assumes synchronous networks and fails under partitions. It is rarely used in practice.
Sagas are the pattern that replaced 2PC for microservices. A saga is a sequence of local transactions with a compensating transaction for each step. If step N fails, execute compensations for steps 1 to N-1 in reverse order. Sagas accept that the system is eventually consistent and design compensations into the domain.
Booking saga:
1. reserve_hotel() compensation: release_hotel()
2. charge_credit_card() compensation: refund()
3. reserve_flight() compensation: release_flight()
4. send_confirmation()
If step 3 fails: call refund(), then release_hotel().Implementing sagas requires a coordinator (choreography via events or orchestration via a central service). Orchestration is easier to reason about and debug; choreography decouples services but is harder to trace.
Event Sourcing and CQRS
Sample question: "Explain event sourcing. What is the benefit over storing current state?"
Event sourcing stores every state change as an immutable event. Current state is the fold of all past events. The benefits: complete audit trail, ability to replay events to rebuild projections, time travel debugging, and natural integration with other services via event streams.
# Event-sourced account
class Account:
def __init__(self):
self.balance = 0
def apply(self, event):
if event.type == "Deposited": self.balance += event.amount
elif event.type == "Withdrawn": self.balance -= event.amount
@classmethod
def from_events(cls, events):
a = cls()
for e in events: a.apply(e)
return aCQRS (Command Query Responsibility Segregation) pairs well with event sourcing. Commands produce events; queries read from projections built from those events. The projections can be denormalized per query, optimized for reads, and rebuilt from events when requirements change.
The downsides are real. Schema evolution is hard: you cannot change old events, so you must version them and handle old versions forever. Eventually consistent projections confuse users who expect to read their writes. Long event streams need snapshots. Interviewers love to probe these trade-offs.
Common Mistakes Candidates Make
Mistake one: treating CAP as prescriptive. It tells you what you give up during partitions, not that every system must pick a side forever.
Mistake two: saying "use Raft" without explaining the cost. Consensus has a round-trip per decision; putting it in the hot path kills performance. Real systems use consensus for metadata and leader election, not for every write.
Mistake three: assuming exactly-once is possible end-to-end. It is not, in general. What you can build is effectively-once via idempotent consumers plus at-least-once delivery. Saying "exactly-once delivery" to a careful interviewer is a yellow flag.
Mistake four: forgetting about the failure mode where nodes are slow but not dead. Most Byzantine-ish real-world failures are gray failures. Heartbeats can be gamed, fast networks can be briefly slow, and timeouts cause false positives that trigger bad leader elections. Healthy systems have jittered timeouts and careful phi-accrual detectors.
Mistake five: scaling write throughput with more replicas. Replicas add read capacity and durability, not write capacity. To scale writes, you partition.
FAQ
How deep should I go on Paxos in an interview?
Deep enough to explain why single-decree Paxos is hard and why Multi-Paxos adds a stable leader to make it practical. You rarely need to recite the full protocol. Raft is usually more than sufficient vocabulary.
Is "strong consistency" the same as linearizability?
Often used interchangeably but not quite. Strong consistency usually means linearizability of single objects. Strict serializability (linearizable transactions) is stronger and is what Spanner provides.
How do I get hands-on experience without working at a big company?
Implement Raft from the paper. Run it with Jepsen-style tests. Read the Cassandra or CockroachDB source to see how real systems handle the details the paper glosses over.
Do I need to know Kafka internals?
If you are interviewing for backend or data infrastructure roles, yes. Replication protocol (ISR), the role of the controller, how exactly-once semantics work via idempotent producers and transactional writes, and log compaction are all fair game.
Conclusion
Distributed systems interviews reward clarity about what your system is giving up. Every design choice is a trade-off, and the strongest candidates name the trade-off explicitly. If you pick synchronous replication, say so and say why latency is acceptable. If you pick eventual consistency, explain the user-visible consequences and how you mitigate them with read-your-writes or monotonic reads.
Practice by designing a system you have actually used. How does your favorite database replicate? How does your payment provider handle idempotency? If you can trace a real request from client through coordination, replication, and acknowledgment, and explain where it could fail and what would happen, you will outperform most candidates.
The final tip is to slow down. Distributed systems questions are open-ended by design. Interviewers give you 45 minutes because they want to see how you reason, not how fast you can name-drop Paxos. Start by asking what consistency, availability, and durability the system needs, then derive the design from those requirements.