Database Internals Interview Questions for Software Engineers
Database internals is the interview topic most likely to punish hand-waving. An interviewer who works on storage or distributed systems can tell in one follow-up whether you have actually thought about fsync, page latches, or serializable snapshot isolation, or whether you are reciting summaries. This guide covers the questions that consistently come up at companies where durability and throughput are first-class concerns.
At phantomcode.co we see candidates collapse most often on the concurrency control and replication sections. The rest of this post walks through the topics from the storage engine up to sharding, with the mental models and quick examples you need to hold your ground under cross-examination.
Table of Contents
- B-trees vs LSM-trees
- Write-Ahead Logging and Durability
- MVCC and Snapshot Isolation
- Isolation Levels and Anomalies
- Two-Phase Locking (2PL)
- Optimistic Concurrency Control
- Indexing Strategies
- Query Planning and Execution
- Replication: Async, Semi-sync, Sync
- Partitioning and Sharding
- Common Mistakes Candidates Make
- FAQ
- Conclusion
1. B-trees vs LSM-trees
Sample question: "Compare B-tree and LSM-tree storage engines. When would you pick each?"
B-trees (really B+-trees in practice) organize data in fixed-size pages, each a sorted array of keys with pointers to children or values. Reads are O(log n) with small constants; a lookup in a table of a billion rows typically touches 4 to 5 pages. Writes update pages in place: you locate the page, dirty it, and eventually flush. This is great for read-heavy workloads but pays a write-amplification cost: changing one byte requires rewriting the whole page.
LSM-trees (Log-Structured Merge trees) batch writes in an in-memory table (the memtable), flush it to sorted immutable files on disk (SSTables), and merge these files in the background (compaction). Writes are effectively append-only and can sustain very high throughput. Reads are more expensive: a point lookup may have to check the memtable plus several SSTables; bloom filters and block-level indexes help. Range scans merge across levels.
Pick a B-tree when read latency is the priority and the working set fits in memory (OLTP classics like PostgreSQL and MySQL/InnoDB). Pick an LSM-tree for write-heavy workloads, large key-value stores, and time-series data (RocksDB, Cassandra, ScyllaDB, LevelDB).
# Inspect RocksDB levels.
rocksdb_ldb --db=/var/lib/myapp/rocksdb dump_live_files
# PostgreSQL page statistics.
SELECT relname, n_live_tup, n_dead_tup, last_vacuum
FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;Follow-up: "What is write amplification?" The ratio of bytes written to the device per byte written by the application. B-trees amplify by page size (a 1-byte update writes 8 KB). LSMs amplify by compaction: a value may be rewritten 10 or more times as it migrates down levels. Tiered compaction trades lower write amp for higher read amp; leveled compaction does the opposite.
2. Write-Ahead Logging and Durability
Sample question: "Why does almost every relational database use a write-ahead log, and what does fsync guarantee?"
The durability contract is: once a transaction commits, the database must recover that commit even after a crash. You cannot write every update to its final location on commit because those locations are scattered and the IO cost is prohibitive. Instead, you write the change to an append-only log, fsync it, and tell the client "committed." The data pages themselves are updated later (lazily) and flushed on a checkpoint.
On recovery, you replay the log from the last checkpoint. ARIES is the canonical algorithm: write redo records before modifying the page (WAL rule), log compensation records during undo, and use LSNs (log sequence numbers) to track page state. PostgreSQL's WAL is ARIES-inspired; InnoDB's is too.
fsync asks the kernel to flush the file's dirty pages and the file's metadata to stable storage. Without it, the OS will happily report "written" while data is still in the page cache. fdatasync is a cheaper variant that skips metadata flushes when the file's size has not changed. On databases the cost of fsync on commit is the single biggest throughput lever; group commit batches many transactions into one fsync.
# PostgreSQL synchronous commit tuning.
SHOW synchronous_commit; -- on | local | remote_write | remote_apply | off
SHOW wal_sync_method; -- fdatasync is common on Linux.
SHOW commit_delay; -- groups commits to amortize fsync.
# Benchmark fsync latency on your drive.
pg_test_fsyncFollow-up: "What was fsyncgate?" A 2018 bug where PostgreSQL (and others) assumed that a failed fsync would still allow retry. On Linux, some filesystems mark the page clean after an IO error, so a retry could succeed without the data ever landing. Fix was to panic on fsync failure. The lesson: durability depends on an end-to-end chain including the firmware of the drive.
3. MVCC and Snapshot Isolation
Sample question: "Explain how MVCC lets readers avoid blocking writers."
Multi-Version Concurrency Control stores multiple versions of each row. Each version is tagged with the transaction that created it and, eventually, the transaction that obsoleted it. A reader sees a snapshot: the set of committed versions as of some point in time (the snapshot timestamp or transaction ID). Writers create new versions without modifying the ones readers see.
PostgreSQL stores versions inline in the table heap; obsolete versions are cleaned by VACUUM. InnoDB stores undo records in a separate rollback segment; live rows reference old versions through a linked list. Both approaches support snapshot isolation: reads are consistent, repeatable within a transaction, and never block writers.
-- Start a snapshot.
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM orders WHERE customer_id = 42;
-- Even as writers commit new orders, this SELECT sees the same snapshot.
-- Inspect bloat in Postgres.
SELECT relname, n_dead_tup, pg_size_pretty(pg_table_size(oid))
FROM pg_class JOIN pg_stat_user_tables USING (oid)
ORDER BY n_dead_tup DESC LIMIT 10;Follow-up: "What is the write skew anomaly under snapshot isolation?" Two transactions read an overlapping set of rows, each decides to update a disjoint subset based on that read, and their combined effect violates an invariant neither transaction alone would violate. Snapshot isolation does not detect this. Serializable Snapshot Isolation (SSI) in PostgreSQL adds predicate dependency tracking to abort one of the transactions.
4. Isolation Levels and Anomalies
Sample question: "Name the SQL isolation levels and the anomalies each prevents."
The standard levels and anomalies (simplified):
- READ UNCOMMITTED: allows dirty reads (seeing uncommitted data). Rare in modern engines.
- READ COMMITTED: prevents dirty reads. Each statement sees its own snapshot. Still allows non-repeatable reads and phantoms.
- REPEATABLE READ: prevents non-repeatable reads. In Postgres it also prevents phantoms via snapshots; in MySQL/InnoDB it uses gap locks to prevent phantoms but still allows write skew.
- SERIALIZABLE: prevents all standard anomalies; executions are equivalent to some serial order.
The real-world catch: "REPEATABLE READ" means different things in different engines. Postgres' repeatable read is snapshot isolation (which SQL-92 did not distinguish from true repeatable read). True serializability is expensive; most OLTP systems default to READ COMMITTED for throughput.
Follow-up: "How is Serializable Snapshot Isolation cheaper than 2PL?" SSI lets reads proceed under snapshot isolation and detects dangerous structures (rw-antidependencies forming a cycle) to abort offenders at commit time. No locks are held across the transaction lifetime. For workloads with few conflicts this is much cheaper than 2PL.
5. Two-Phase Locking (2PL)
Sample question: "Walk through strict 2PL and how it relates to deadlocks."
In 2PL a transaction acquires locks as it proceeds and releases them at commit or abort (strict 2PL). Shared locks for reads, exclusive for writes. The order in which transactions acquire locks determines the serial order they are equivalent to. 2PL produces serializable schedules.
Deadlocks are inherent: if T1 holds lock A and waits for B while T2 holds B and waits for A, neither proceeds. Databases detect deadlocks (PostgreSQL runs a periodic wait-for-graph scan; InnoDB does similar) and abort one transaction. Alternatively, timeouts are used (innodb_lock_wait_timeout).
-- Force a deadlock for demo purposes.
-- Session 1:
BEGIN;
UPDATE accounts SET balance = balance - 10 WHERE id = 1;
-- Session 2:
BEGIN;
UPDATE accounts SET balance = balance + 10 WHERE id = 2;
-- Back in session 1:
UPDATE accounts SET balance = balance + 10 WHERE id = 2; -- waits
-- Session 2:
UPDATE accounts SET balance = balance - 10 WHERE id = 1; -- deadlock detectedFollow-up: "Why do long-running transactions cause pain under 2PL?" Locks are held until commit, so every conflicting transaction queues behind them. Under MVCC, long transactions also hold back garbage collection (VACUUM, purge), bloating the database. Either way: keep transactions short.
6. Optimistic Concurrency Control
Sample question: "When is optimistic concurrency control (OCC) a better fit than pessimistic locking?"
OCC lets transactions run without locking, then validates at commit that no conflicting writes landed since they read. If validation fails, they abort and retry. OCC wins when conflicts are rare: reads vastly outnumber writes, or writes target disjoint keys. It loses when conflicts are common: transactions do work only to throw it away.
Postgres offers OCC-like semantics at SERIALIZABLE via SSI. FoundationDB uses pure OCC with resolver nodes that check read conflict ranges against committed writes. Spanner uses a mix: reads get a timestamp from TrueTime and writes acquire locks under a 2PC protocol.
Application-level OCC is common too: a version column or ETag. Read the row with version N, send the update with "WHERE version = N". If zero rows updated, someone raced you; re-read and retry.
UPDATE documents
SET body = $1, version = version + 1
WHERE id = $2 AND version = $3
RETURNING version;
-- If RETURNING is empty, conflict: re-read and retry.Follow-up: "How do you bound retry storms in OCC?" Exponential backoff with jitter, caps on retry count, and fall back to pessimistic locking or single-writer serialization for known hotspots.
7. Indexing Strategies
Sample question: "How does a composite index on (a, b) differ from two separate indexes on (a) and (b)?"
A composite index on (a, b) is sorted by a first, then by b within each a. It can serve queries on a, on (a, b), and on (a, b_prefix). It cannot directly serve queries on b alone.
Two separate indexes on (a) and (b) let the planner do index intersection (Postgres bitmap AND). It is less efficient than a composite for WHERE a = ? AND b = ? because each index is scanned separately.
Other index types:
- Hash: O(1) point lookup, no range scans. Useful for equality-only.
- GIN / inverted: for arrays, JSONB, full-text.
- BRIN: block-range indexes; tiny but useful for sorted-on-disk data like time series.
- Partial indexes:
WHEREclause limits the indexed rows; smaller and sometimes drastically faster. - Covering (INCLUDE): stores extra columns in the index so the query can be served from the index alone.
-- Partial index example: only index active users.
CREATE INDEX idx_active_users_email
ON users(email) WHERE status = 'active';
-- Covering index example.
CREATE INDEX idx_orders_user_covered
ON orders(user_id) INCLUDE (status, total_cents);Follow-up: "What is an index-only scan, and when does it not work?" It is when Postgres serves a query entirely from an index (possibly plus visibility info from the visibility map). It fails when the visibility map cannot confirm all-visible pages, forcing a heap fetch. VACUUM maintains the visibility map; skipping VACUUM disables many index-only scans.
8. Query Planning and Execution
Sample question: "Walk through how a database plans and executes a join."
The planner converts SQL into a relational tree, normalizes predicates, and explores alternative plans. For each plan, it estimates cost using statistics (row counts, NDV, histograms, correlations). The cheapest plan wins.
Join algorithms:
- Nested loop: for each outer row, scan the inner. O(N*M) unless the inner is indexed, in which case O(N log M). Great for small outer sets.
- Hash join: build an in-memory hash table on the smaller side, probe with the larger. Great for medium sets, no indexes needed.
- Sort-merge: sort both inputs by the join key, then merge. Great for large inputs already sorted (e.g., by an index) or when ORDER BY matches.
Execution is usually pipelined (Volcano-style iterators) or push-based with vectorized batches (ClickHouse, DuckDB). Modern analytics databases compile the plan to machine code or use SIMD-friendly batch operations.
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.email, SUM(o.total_cents)
FROM users u JOIN orders o ON o.user_id = u.id
WHERE u.country = 'IN' AND o.created_at > now() - interval '7 days'
GROUP BY u.email;Read the plan from the innermost operator outward. Look for:
- Estimated rows vs actual rows: large mismatch means statistics are stale (run ANALYZE).
- Sequential scans where you expect index scans: predicates not sargable, or index missing, or planner preferred a seq scan because the table is small.
- Hash spills to disk (
Bucketsvswork_mem).
Follow-up: "Why did adding an index make a query slower?" Because the planner picked the new index based on stale statistics or a misleading selectivity estimate. The index scan did many random reads where a sequential scan would have been cheaper. Fix: update statistics, raise effective_cache_size, or hint with pg_hint_plan.
9. Replication: Async, Semi-sync, Sync
Sample question: "Contrast async, semi-sync, and sync replication and the failure modes each has."
Async: the primary commits locally and ships WAL to replicas in the background. Low latency for writes but replicas lag. A primary crash can lose the last few seconds of committed data.
Semi-sync: the primary waits for at least one replica to acknowledge receipt (not necessarily apply) before returning commit success. Trades some latency for durability; loss requires both primary and all acknowledging replicas to fail.
Sync: the primary waits for one or more replicas to apply and confirm. Highest latency and the strongest durability. If all replicas are unreachable, writes stall.
Common nuances:
- "Synchronous" in Postgres's
synchronous_commit = remote_applymeans remote apply, the strongest.remote_writeis weaker: the replica has received the WAL but may not have flushed it yet. - MySQL's default semi-sync waits for one replica to ack receipt (not apply).
- Group replication and Raft-based systems (CockroachDB, YugabyteDB) replicate writes across a quorum; commit requires a majority of replicas to persist. This gives automatic failover without data loss.
-- Postgres: observe replication lag.
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS bytes_behind
FROM pg_stat_replication;Follow-up: "What is split brain and how do quorum systems avoid it?" Split brain is when two nodes both believe they are the primary and accept writes, creating irreconcilable divergence. Quorum systems elect a leader with a majority vote; a partitioned minority cannot elect a new leader and therefore cannot accept writes. This is why Raft-based databases need an odd number of voting members.
10. Partitioning and Sharding
Sample question: "Design a sharding strategy for a 50-terabyte orders table."
Partitioning (within one database) splits a table into smaller, often by range (by date) or list (by region). It helps with maintenance (drop a partition to prune old data) and query performance (partition pruning). It does not give you horizontal scale beyond one node.
Sharding splits data across many nodes. Strategies:
- Range sharding: contiguous key ranges per shard. Good for range scans, bad for hotspots (one shard owns the most recent data, takes all writes).
- Hash sharding: hash the shard key, take modulo or use consistent hashing. Spreads load evenly. Range scans become scatter-gather.
- Directory sharding: a lookup service maps a key to its shard. Most flexible, adds a dependency.
Key choice dominates everything. A monotonically increasing primary key (created_at) under hash sharding is fine for writes but terrible for recent-data scans. A user_id key keeps a user's data colocated but can create whales: one user produces most of the traffic.
-- Native Postgres range partitioning.
CREATE TABLE orders (id bigint, created_at timestamptz NOT NULL, ...)
PARTITION BY RANGE (created_at);
CREATE TABLE orders_2026_04 PARTITION OF orders
FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');
-- Cross-shard queries typically live at the application or proxy layer
-- (Citus, Vitess, Hudi, ProxySQL, or a custom router).Follow-up: "How do you rebalance shards without downtime?" You double-write to old and new shards, backfill historical data, switch reads once the new shard is consistent, then stop the old write. Vitess and Citus automate much of this. Consistent hashing reduces the volume of data that has to move on each rebalance.
11. Common Mistakes Candidates Make
- Confusing durability with replication.
fsyncis about surviving a crash of one machine. Replication is about surviving the loss of a machine. - Saying "ACID is handled by the database, I don't need to think about it." Isolation defaults are usually READ COMMITTED. Write skew and phantom reads bite applications that assume SERIALIZABLE.
- Treating secondary indexes as free. Every index adds write amplification and can slow down writes and VACUUM.
- Not understanding the difference between page cache and buffer pool. In Postgres, the shared buffer pool is in front of the kernel page cache; double buffering is real.
- Picking a sharding key that looks good for the initial workload and becomes a pain in 12 months. Think about hot-shard risk early.
- Underestimating replication lag. Read-your-writes semantics on an async replica is a footgun.
- Claiming that EXPLAIN plans are prescriptive. They are estimates; ANALYZE, and watch for estimate vs actual gaps.
- Believing that LSM-trees are universally "better for writes." They accept writes faster but pay compaction CPU and write-amp later. On flash with high write traffic, this matters.
12. FAQ
How deep do companies go into internals at a senior backend interview? Deep enough that you cannot bluff. Expect to describe MVCC in the database you claim to use daily, and to reason about at least one replication topology end to end. If you say you used Postgres for five years, expect a VACUUM question.
Do I need to know specific engines (MySQL, Postgres, Cassandra) or just the concepts? Concepts first, but interviewers respect concrete examples. "In Postgres, vacuum handles this; in InnoDB, purge threads do" shows range.
What about cloud-native databases like Aurora, Spanner, CockroachDB? Worth knowing at a high level. Aurora decouples compute from a replicated log-structured storage layer; Spanner uses TrueTime for global timestamps; CockroachDB is Spanner-inspired but clock-uncertainty tolerant. If you interview at a cloud-scale company, expect at least one of these.
Should I bother with NewSQL and distributed SQL? Yes if you interview at anyone doing horizontally scaled OLTP. Understand Raft, leader election, single-shard transactions vs 2PC across shards, and clock-based timestamp ordering.
What resources actually matter?
Alex Petrov's Database Internals, the Postgres internals documentation, the Jepsen analyses (especially Cassandra, MongoDB, CockroachDB), and CMU's database courses on YouTube. Reading the Postgres README files in src/backend/access/heap/, src/backend/storage/lmgr/, and src/backend/access/transam/ is surprisingly approachable.
How do I practice replication and sharding questions?
Spin up a three-node cluster (Patroni for Postgres, Vitess for MySQL, or CockroachDB single-binary). Break it: kill the leader, partition the network with iptables, induce lag with tc. Watch how the system behaves. The answers will then come from experience, not memorization.
13. Conclusion
Database internals questions reward candidates who have felt the sharp edges: watched VACUUM fall behind, debugged a replication lag cascade at 3 AM, or tracked a slow query to a missing stat. Nothing substitutes for running the systems you claim to understand.
The throughline across every topic in this post is the interplay between durability, concurrency, and latency. B-tree vs LSM, WAL group commit, MVCC, OCC, and sync vs async replication are all knobs on that triangle. When you can name the knob the interviewer is turning, you can answer any follow-up.
If you want structured mock interviews that drill exactly these scenarios with cross-examination from experienced backend and storage engineers, phantomcode.co runs targeted sessions on database internals, concurrency, and distributed systems.