Operating Systems Interview Questions for Software Engineers
Operating systems questions separate candidates who have only used their laptop from candidates who know what actually happens when their program runs. At phantomcode.co we consistently see OS fundamentals come up in FAANG, HFT, and infrastructure interviews, because any non-trivial backend eventually collides with scheduling, memory, or filesystem behavior. This guide covers the topics interviewers actually probe, with the mental models and code you need to answer confidently.
This is not a textbook summary. Each section mirrors the way a senior interviewer actually digs: a direct question, a precise answer, a follow-up that trips most candidates, and a snippet you can run locally on a Linux box to validate your understanding.
Table of Contents
- Processes vs Threads
- Context Switches and Their Real Cost
- CPU Scheduling: Round-Robin, CFS, and Beyond
- Virtual Memory and Paging
- Memory Management: Heap, Stack, and mmap
- File Systems: Inodes, Dentries, and Journaling
- Syscalls and the User/Kernel Boundary
- Signals and Signal Handling Pitfalls
- Zombie and Orphan Processes
- Common Mistakes Candidates Make
- FAQ
- Conclusion
1. Processes vs Threads
Sample question: "Explain the difference between a process and a thread, and when you would prefer one over the other."
A process is an isolated address space with its own page tables, file descriptor table, and kernel accounting structure. A thread is a schedulable execution context that shares the address space of its parent process. On Linux there is no true thread abstraction in the kernel: both are task_struct, and threads are simply processes created with CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND.
Prefer processes when you need fault isolation (a crash should not poison sibling work), strong security boundaries (seccomp, namespaces, per-process credentials), or when you have CPU-bound work on a GIL-constrained runtime like CPython. Prefer threads when you need low-latency shared state, fast communication via shared memory, and when the cost of serialization across IPC would dominate.
// fork() gives a new address space; the child has its own heap.
pid_t pid = fork();
if (pid == 0) {
// child: writes to global vars will NOT be seen by parent (copy-on-write).
global_counter++;
_exit(0);
}Follow-up that trips candidates: "How much does fork actually copy?" Answer: on modern Linux it copies page tables, not pages. Physical pages are marked copy-on-write. The real cost of fork scales with the size of your page tables, which is why a 64 GB JVM can take hundreds of milliseconds to fork even when it writes nothing. This is exactly why Redis replaced fork-based BGSAVE with more incremental snapshotting on large instances.
2. Context Switches and Their Real Cost
Sample question: "Walk me through what happens on a context switch and where the cost comes from."
A context switch saves the register state of the running thread into its task_struct, flips the kernel stack pointer, updates the scheduler run queue, and restores the register state of the chosen next thread. If the new thread lives in a different address space, the kernel also reloads CR3 (on x86) to switch page tables.
The direct cost is small, usually 1 to 5 microseconds. The hidden cost is cache pollution. The new thread touches different code and data, so L1/L2 caches miss, TLB entries evict, and branch predictor state is effectively reset. On a memory-bound workload the effective cost can exceed 30 microseconds.
# Measure context switch rate on Linux.
vmstat 1 5
# 'cs' column is context switches per second.
# Per-process:
pidstat -w -p <PID> 1
# cswch/s (voluntary) vs nvcswch/s (involuntary).A high nvcswch/s means the scheduler is preempting your thread before it yields. That usually indicates you are CPU-bound and oversubscribed. A high cswch/s means the thread is blocking, often on I/O or a lock.
Follow-up: "Why can a context switch be more expensive on a virtualized system?" Because you may also pay for a VM exit, where the hypervisor intercepts the CR3 write or the IPI used to reschedule a vCPU.
3. CPU Scheduling: Round-Robin, CFS, and Beyond
Sample question: "How does the Linux Completely Fair Scheduler work, and how does it differ from round-robin?"
Round-robin assigns every runnable task a fixed-length time slice and cycles through them. It is simple, predictable, and the basis for SCHED_RR in Linux real-time policies. Its weakness is that it treats every task equally; a burst-heavy task and a CPU hog share the same slice.
CFS (the default SCHED_OTHER policy until EEVDF replaced it in newer kernels) models an idealized multitasking CPU where every task runs at an equal fraction of CPU. It tracks vruntime per task, a weighted cumulative runtime where higher nice values make vruntime advance faster. The scheduler always picks the task with the smallest vruntime, stored in a red-black tree ordered by vruntime. Nice values map to weights, and a task's slice length depends on how many peers are runnable.
// Check and set scheduling policy.
#include <sched.h>
struct sched_param sp = { .sched_priority = 0 };
sched_setscheduler(0, SCHED_BATCH, &sp);
// SCHED_BATCH hints that this task is non-interactive.
// SCHED_IDLE runs only when nothing else wants the CPU.
// SCHED_FIFO / SCHED_RR are real-time and preempt SCHED_OTHER.Follow-up: "What is EEVDF and why did Linux move to it?" Earliest Eligible Virtual Deadline First replaces CFS in kernel 6.6+. It adds a deadline per task, derived from requested latency, so interactive tasks explicitly get lower latency without hacks like sched_wakeup_granularity_ns. It is still fair in the long run but respects latency budgets in the short run.
4. Virtual Memory and Paging
Sample question: "Explain virtual memory end to end, from a pointer dereference to the data in DRAM."
Every process sees a private virtual address space. When a thread dereferences a pointer, the CPU splits the virtual address into a page number and an offset. The MMU consults the TLB (Translation Lookaside Buffer) first. On a hit, it forms the physical address in one cycle. On a miss, the page table walker reads up to four (on x86-64) or five (with LA57) levels of page tables from memory to produce a PTE. If the PTE is marked present and the access is allowed, the physical address is formed and the cache hierarchy supplies the data. If the PTE is not present, the CPU raises a page fault.
The kernel's page fault handler then does one of: allocate a zero page (anonymous first-touch), read a page from disk (file-backed or swap), fire a copy-on-write duplication, or deliver SIGSEGV.
# Observe page faults.
/usr/bin/time -v ./myprog
# Major (page faults): blocked on I/O to bring the page in.
# Minor (page faults): no I/O, just allocation or COW.
# Huge pages help reduce TLB pressure on large heaps.
cat /sys/kernel/mm/transparent_hugepage/enabledFollow-up: "Why does a 100 GB malloc succeed instantly on Linux?" Because malloc calls mmap/brk which only reserves virtual memory. Pages are not allocated until you touch them. The kernel uses demand paging and overcommits by default, controlled via /proc/sys/vm/overcommit_memory. This is also how the OOM killer can surprise you: your process passed malloc but dies minutes later on first write.
5. Memory Management: Heap, Stack, and mmap
Sample question: "What are the differences between heap and mmap allocations, and when does glibc pick one over the other?"
glibc's malloc uses brk for small allocations (grows the heap linearly) and mmap for large ones (a threshold around 128 KB by default, tunable via M_MMAP_THRESHOLD). Heap-style allocations are cheap to reuse but suffer from fragmentation: a single long-lived allocation can anchor the heap top and prevent shrinking. mmap allocations are independent regions, unmapped cleanly on free, and good for large buffers.
The stack is a special region grown by the kernel's guard page mechanism. Each thread has its own fixed-size stack, default 8 MB on Linux, set via pthread_attr_setstacksize. Blowing the stack triggers SIGSEGV because the guard page below it is not readable.
#include <malloc.h>
// Force malloc to use mmap for anything above 64 KB.
mallopt(M_MMAP_THRESHOLD, 64 * 1024);
// Inspect current arena.
malloc_info(0, stdout);Follow-up: "Why can RSS grow even though your program frees memory?" Because free returns memory to the allocator, not the kernel. Heap memory is only returned to the OS when the top of the heap is free. Use malloc_trim(0) or jemalloc with background_thread:true to encourage return.
6. File Systems: Inodes, Dentries, and Journaling
Sample question: "Describe how a filesystem resolves the path /var/log/syslog and what structures are involved."
The kernel starts at the root inode, which is pinned in memory. It looks up var in the root directory's data blocks, which are a list of (name, inode) pairs. It then loads the var inode, repeats for log, then syslog. Each lookup hits the dentry cache first, a per-filesystem hash table that memoizes path components to inodes. A complete path lookup without cache involves multiple disk reads and inode table lookups.
An inode holds metadata (size, mode, uid, gid, timestamps) and block pointers. A filename is not part of the inode; it lives in the parent directory's entries. This is why hard links are cheap: they are just additional directory entries pointing to the same inode.
Journaling protects metadata (or data, depending on mode) against crashes. ext4's default ordered mode writes data blocks first, then journals the metadata, then commits the metadata to its final location. After a crash, the journal is replayed, so metadata never points into garbage. data=journal mode journals everything but halves write bandwidth. data=writeback gives up ordering for speed.
# Inspect an inode directly.
stat /etc/hostname
ls -li /etc/hostname # first column is the inode number.
# Check journal mode.
mount | grep " on / "
# See filesystem debug info on ext4.
sudo debugfs -R "stat <130023>" /dev/nvme0n1p2Follow-up: "Why does copying 1 million small files take so much longer than one 1 GB file even on SSD?" Because each file involves at least two synchronous metadata operations (create, close) and individual inode updates, plus fsync barriers the application may issue. You are limited by metadata IOPS and journal commits, not by sequential bandwidth.
7. Syscalls and the User/Kernel Boundary
Sample question: "What happens on a syscall like read, and why does it cost more than a function call?"
User code issues syscall on x86-64, which traps into the kernel at the MSR-configured entry point. The CPU switches to kernel mode (CPL 0), loads a kernel stack, saves user registers, and dispatches through the syscall table. The kernel validates arguments (because pointers from userspace can be malicious or invalid), does the work, copies results back with copy_to_user (which handles page faults safely), and returns with sysret.
The cost is not the mode switch alone. It is: stack switch, register save/restore, argument validation, and now also the KPTI page table switch introduced after Meltdown. A no-op syscall on modern x86-64 with mitigations is roughly 200 to 700 ns.
// Measure syscall overhead.
#include <unistd.h>
for (int i = 0; i < 1000000; i++) {
getppid(); // cheapest common syscall.
}Follow-up: "How do vDSO and io_uring reduce this cost?" vDSO maps a small piece of kernel code into every process so trivially safe calls like clock_gettime do not need a real trap. io_uring uses two shared ring buffers (submission and completion) so userspace can queue thousands of I/O operations with zero or at most one syscall, and the kernel can process them in batches.
8. Signals and Signal Handling Pitfalls
Sample question: "Can you write a correct handler for SIGINT that flushes a log buffer?"
Probably not, if you have not done it before. Signal handlers run in the context of the interrupted thread. Almost nothing is async-signal-safe. You cannot call malloc, printf, or any function that touches a lock that the main thread might hold. The conventional pattern is to set a sig_atomic_t flag and let the main loop observe it.
#include <signal.h>
#include <stdatomic.h>
static volatile sig_atomic_t stop_requested = 0;
static void handle_sigint(int sig) {
stop_requested = 1; // async-signal-safe.
}
int main(void) {
struct sigaction sa = { .sa_handler = handle_sigint };
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_RESTART; // restart interrupted syscalls where possible.
sigaction(SIGINT, &sa, NULL);
while (!stop_requested) {
do_work();
}
flush_logs(); // safe here, on the main thread.
}Follow-up: "What does SA_RESTART not cover?" read on a socket with a timeout set via SO_RCVTIMEO will still return EINTR. poll, select, and epoll_wait are never restarted. If you rely on SA_RESTART you must still check for EINTR and handle it.
A better modern pattern is signalfd, which converts signals into readable file descriptors and integrates cleanly with an event loop:
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGINT);
sigprocmask(SIG_BLOCK, &mask, NULL);
int sfd = signalfd(-1, &mask, SFD_CLOEXEC);
// Now read struct signalfd_siginfo from sfd inside your epoll loop.9. Zombie and Orphan Processes
Sample question: "A long-running daemon is accumulating zombie children. What is happening and how do you fix it?"
A zombie is a terminated process whose parent has not yet called wait or waitpid. The kernel retains the exit status and accounting so the parent can retrieve them. Zombies hold a process table entry, not memory. If the parent never reaps, the entries accumulate and eventually you run out of PIDs.
Fixes in order of preference:
- Have the parent call
waitpid(-1, ..., WNOHANG)in a loop on SIGCHLD. - Set SIGCHLD to
SA_NOCLDWAIT. The kernel will auto-reap with no zombie. - Double-fork. The grandchild is re-parented to init (PID 1), which reaps it.
struct sigaction sa = { .sa_handler = SIG_DFL, .sa_flags = SA_NOCLDWAIT };
sigaction(SIGCHLD, &sa, NULL);
// Any children that exit are auto-reaped by the kernel.An orphan is a live process whose parent has died. The kernel re-parents it to PID 1 (or a subreaper, if one was set via PR_SET_CHILD_SUBREAPER). Orphans are normal and harmless. They only become a problem if they were holding a process group and now run unsupervised.
Follow-up: "Why do containers need an init process?" Because in a PID namespace, PID 1 inherits all reaping. Running a single binary like python app.py as PID 1 means signals get special semantics (SIGTERM is ignored by default unless a handler is installed) and zombies have nowhere to go. tini or Docker's --init flag inserts a small init that reaps children and forwards signals.
10. Common Mistakes Candidates Make
These show up repeatedly in phantomcode.co mock interviews.
- Saying processes "are slower" than threads without quantifying it. Fork is cheap on modern Linux. The real costs are IPC and duplicated working set.
- Confusing physical memory with RSS. RSS is the resident portion of virtual memory; a mapped-but-not-touched region counts as zero RSS.
- Believing
volatileis sufficient for multithreaded synchronization. It is not; you need atomics or locks.volatileis useful for memory-mapped hardware and for signal handlers. - Forgetting that
signal(the old API) has undefined portability for handler reinstallation andSA_RESTART. Usesigaction. - Explaining the page table walk without mentioning the TLB. Interviewers will stop you and ask where the TLB fits.
- Saying "thread context switches are free within the same process." They are cheaper (no CR3 reload, no TLB flush in most cases) but not free. Cache effects still apply.
- Describing CFS as round-robin. It is not; it is weighted fair queuing over vruntime.
- Calling
printffrom a signal handler in a code sample. Instant red flag.
11. FAQ
How deep do OS questions go at L5 and above? At senior level you should be able to reason about lock contention under the scheduler, cache coherence (MESI), and trade-offs between copy-on-write and pre-zeroed allocation. At staff level, expect questions on NUMA placement, kernel bypass (DPDK, io_uring, SPDK), and the interaction between cgroups v2 and the scheduler.
Do I need to know x86 specifics? Knowing enough to discuss page table levels, CR3, and the TLB is useful. You do not need to memorize opcodes. ARM specifics (e.g., ASID-tagged TLBs) come up only at hardware-adjacent teams.
How should I practice?
Run strace, perf, ftrace, and bpftrace against real programs. Watch what syscalls Redis, Postgres, or nginx make under load. Read a small Linux subsystem end to end: the signal code in kernel/signal.c is a great one-week project.
What resources are worth the time?
Operating Systems: Three Easy Pieces (free), Robert Love's Linux Kernel Development, and the kernel's own Documentation tree. For practical drills, work through Julia Evans' zines and recreate her experiments.
Will interviewers ask me to write a scheduler? Rarely on a whiteboard. Occasionally on a take-home. More commonly they will show you a scheduler skeleton and ask you to reason about starvation, priority inversion, or fairness.
12. Conclusion
Strong OS answers are grounded in specifics: the actual struct, the actual syscall, the actual cost in nanoseconds. Memorizing definitions will get you past a phone screen but will collapse under follow-ups. The fastest way to internalize this material is to instrument a real program, break it, and watch the kernel's reaction in strace, perf, and dmesg.
If you can explain what happens when your program calls read with the same confidence you explain what happens when it calls a function, you will be ahead of the vast majority of candidates. For structured practice with live feedback on exactly these topics, the mock interview platform at phantomcode.co drills OS fundamentals with realistic follow-ups.