Phantom CodePhantom Code
Earn with UsBlogsHelp Center
Earn with UsBlogsMy WorkspaceFeedbackPricingHelp Center
Home/Blog/Data Engineer Interview Guide: SQL, Modeling, Pipelines, and Streaming
By PhantomCode Team·Published April 22, 2026·Last reviewed April 29, 2026·16 min read
TL;DR

Data engineering interviews reward operability, not diagrams. Senior loops weight correctness under duplicates and late data, partition and shuffle reasoning, and the on-call story for every pipeline you propose. Drill window functions and sessionization until automatic, master Kimball plus event-centric modeling, articulate ELT tradeoffs with dbt and Spark, and ground every quality answer in contracts, freshness signals, and severity-based response plans.

Data Engineer Interview Guide: SQL, Modeling, Pipelines, and Streaming

Data engineering loops feel like backend interviews with the volume cranked up. You still have to design systems, but the systems must stay correct under duplicate events, late-arriving data, schema drift, and petabyte-scale scans. A hiring manager is trying to figure out whether you can make trustworthy data flow through the organization when the upstream producers are careless, the consumers are sensitive, and the audit team is paying attention. That is a very different optimization target than ordinary application development.

The craft has also become more contested. Lakehouse architectures have eaten into dedicated warehouses, streaming and batch have converged under incremental frameworks, and transformation has shifted almost entirely into SQL-first tools. Interviews have adjusted accordingly. Expect far more questions about modeling for change over time, correctness under retries, and the operational maturity of your pipelines. This guide walks through each round of a typical data engineering loop with concrete question shapes, answer scaffolds, and the traps that sink strong candidates.

Table of Contents

  • Loop Structure and Scoring Dimensions
  • SQL Depth: What Seniors Are Expected to Know
  • Data Modeling Beyond the Kimball Cliffs
  • ETL vs ELT Design Decisions
  • Streaming with Kafka and Flink
  • Warehousing Choices (Snowflake, BigQuery, Redshift)
  • dbt and the Transformation Layer
  • Data Quality, Contracts, and Observability
  • Common Mistakes That Sink Loops
  • Sample Questions with Answer Scaffolds
  • FAQ
  • Conclusion

Loop Structure and Scoring Dimensions

A typical data engineering loop has a phone screen, a SQL and coding round, a data modeling round, a system design round focused on pipelines, and a behavioral round. Many companies add a scenario round where you debug a broken pipeline or reason about a correctness failure in production.

The scoring dimensions usually break into four buckets. Correctness covers whether your answers handle duplicates, nulls, late data, and schema drift. Performance covers scan size, join strategy, partition pruning, and shuffle cost. Operability covers how easy your pipeline is to debug at 2 AM. Communication covers how you negotiate requirements and describe tradeoffs without hiding behind jargon.

Senior candidates distinguish themselves on the operability dimension. Junior candidates produce correct pipelines on paper that are impossible to operate. If your design does not answer questions like "how would an on-call engineer find out the pipeline is two hours late" or "how do we backfill without duplicating outputs," you are leaving points on the table.

SQL Depth: What Seniors Are Expected to Know

SQL rounds separate candidates fast. The bar for a senior role is comfort with window functions, recursive CTEs, lateral joins, pivot and unpivot idioms, and set-based reasoning on problems that naive solutions handle with loops. You should be able to switch between row-by-row narratives and set-based rewrites when the interviewer nudges.

Window functions are the single most-tested feature. Know the difference between ROW_NUMBER, RANK, and DENSE_RANK, when you need PARTITION BY versus a subquery, and how to compose multiple windows in one query. Running totals, gaps and islands, sessionization by inactivity, and top-N per group are patterns you should be able to write from memory. Expect a question that combines two of these, such as computing active sessions per user with both gap-based chunking and a filter on minimum session length.

Time-series reasoning is another frequent theme. Be comfortable generating date spines, joining event data against them, and explaining why LEFT JOIN against a spine handles gaps better than a raw aggregation. Understand how to express retention, cohort, and funnel analyses in SQL, including the pitfalls of timezone handling and the differences between event time and processing time.

Join strategy is the second most-tested area. Know how to reason about hash joins, merge joins, and broadcast joins, why a small dimension table might be broadcast, why a skewed key is the single most common cause of a query stalling forever, and how to detect skew from the query profile. If you are interviewing at a company on Snowflake or BigQuery, learn their specific notions of clustering and partition pruning and be able to reason about micro-partitions or column stats.

Expect at least one question that checks whether you can reason about duplicates. A common prompt is: you have a table where a primary key should be unique but is not. Write a query that finds duplicates, decide which record to keep, and produce a deduplicated view without breaking downstream consumers. The subtle point is that the answer depends on an agreement with the source team about what a duplicate means; the candidate who asks that clarifying question scores above the candidate who guesses.

Data Modeling Beyond the Kimball Cliffs

Kimball dimensional modeling is still the vocabulary most interviewers use. You should be fluent in facts, dimensions, grain, additive and semi-additive measures, slowly changing dimensions type 1 through type 6, conformed dimensions, and the difference between star and snowflake layouts. If a fact table has an ambiguous grain, you should catch it in the first two minutes of a modeling round.

Beyond Kimball, expect questions about event-centric and wide-table modeling. Many modern data platforms denormalize fact tables into wide tables that include dimensional attributes at the row level because storage is cheap and analysts prefer flat shapes. The tradeoff is that point-in-time accuracy requires careful reasoning about effective dates and historical snapshots. A senior candidate knows when to choose a classical dimensional model versus a one-big-table approach and can articulate the consequences for bi-temporal queries.

Data vault shows up at larger enterprises. You do not need to implement a full Data Vault 2.0 pipeline in an interview, but know the roles of hubs, links, and satellites, why the pattern exists (auditable source-agnostic storage), and when its operational overhead is worth paying.

A particularly good signal is how you handle slowly changing dimensions. Candidates who default to type 2 for everything lose points to candidates who justify each choice. If a downstream analyst cares about the current state of a customer, type 1 may be the right call; if historical accuracy matters for revenue attribution, type 2; if both matter, an effective-dated slowly changing dimension or a separate history table. Name the consumer's question and the tradeoff.

ETL vs ELT Design Decisions

The ETL versus ELT debate has mostly resolved in favor of ELT for analytical workloads because cheap warehouse compute has made pushing transformation to the sink economical. But the interview question is rarely "which is better;" it is "how would you design this specific pipeline." Good answers treat extraction, loading, and transformation as independent decisions.

For extraction, think about pull versus push, the cadence (event, micro-batch, batch), the idempotency story, and the backfill story. A common probe: how would you incrementally extract from a high-volume transactional source without missing rows when the clock skews. Strong answers invoke change data capture via the database's replication log, not application-level timestamps, because timestamps are vulnerable to out-of-order commits and clock drift.

For loading, consider the landing zone format, partitioning strategy, and schema evolution. Columnar formats such as Parquet with zstd compression are the default for analytic loads. Hive-style partitioning by ingestion date is standard; partitioning by a high-cardinality column is an anti-pattern that produces small files.

For transformation, choose between SQL-first tools such as dbt, a dataframe engine such as Spark, or a streaming engine such as Flink. Match the tool to the semantics: dbt for deterministic batch transforms modeled as SELECT statements, Spark for heavy joins and complex logic not expressible in SQL, Flink for event-time streaming joins and windowing. Candidates who default to Spark for everything look out of step with 2026 practice; dbt has become the standard for the middle layer of most organizations.

Idempotency is the hidden theme. Every stage must be safely re-runnable without producing duplicates or corrupting downstream state. Use overwrite-by-partition patterns, deterministic identifiers, and merge semantics rather than append-and-hope.

Streaming with Kafka and Flink

Streaming rounds test whether you understand event time, windowing, state, and exactly-once semantics. The vocabulary is unforgiving. If you conflate event time and processing time in a window definition, you will produce incorrect results and the interviewer will press you.

Kafka-focused questions probe your understanding of partitioning, ordering guarantees, consumer groups, rebalance behavior, and the tradeoff between log compaction and time-based retention. Know why a partition key determines the unit of parallelism, why a skewed key is the silent killer of throughput, and how to design keys that preserve the ordering semantics consumers need. Be able to reason about the transactional producer, idempotent producer, and the cost of exactly-once guarantees end to end.

Flink-focused questions push into windowing. Tumbling, sliding, session, and global windows each have specific use cases. A strong answer includes the watermark strategy, the allowed lateness, the state backend, and how to handle late data without breaking downstream consumers. Savepoint management, state schema evolution, and checkpoint tuning show up in deeper rounds.

For system design prompts, expect variations on: design a fraud detection pipeline, design a real-time recommendation feature store, or design a clickstream aggregation system. The scaffold is consistent. Start from the SLA (latency, freshness, completeness), identify the event source and partitioning strategy, choose a stream processor, define the state, define the output sink, explain the exactly-once story, and discuss how you would replay historical data through the same pipeline. Mentioning the lambda versus kappa architecture debate is fine; arguing that a single engine with proper semantics can replace lambda scores higher in 2026.

Warehousing Choices (Snowflake, BigQuery, Redshift)

Interviewers want to know you can reason about the architecture of the warehouse you propose, not just name it. Snowflake's separation of storage and compute, its multi-cluster virtual warehouses, time travel, zero-copy clones, and micro-partitions are all fair game. BigQuery's serverless model, slot reservations, streaming ingestion, and the difference between INSERT-based and LOAD-based ingestion should be familiar. Redshift has modernized with RA3 nodes and separation of compute and storage; be ready to compare its distribution styles and sort keys.

Cost reasoning is the senior filter. A mid-level candidate says "we use BigQuery." A senior candidate says "we use BigQuery with reservations for the core pipelines and on-demand for ad hoc, we enforce slot quotas per team, and we have a weekly review of the top ten most expensive queries with recommendations." The operational maturity signal is obvious.

Know the failure modes of each platform. Snowflake's default settings suspend and resume warehouses liberally, which can cause cold starts in latency-sensitive paths. BigQuery's streaming inserts have a buffer that can delay deduplication and is a common source of "my row disappeared" tickets. Redshift's vacuum and analyze operations are no longer mandatory on RA3 but still affect performance in edge cases. Being able to name a specific operational foot gun per platform signals that you have run these systems, not just read the docs.

Iceberg, Delta Lake, and Hudi have moved the conversation toward open table formats on object storage. Be ready to explain why ACID transactions over Parquet on S3 or GCS are possible, how time travel works, and why interop between Spark, Trino, and warehouses has improved. If the role is a lakehouse role specifically, go deep on schema evolution, hidden partitioning in Iceberg, and compaction strategies.

dbt and the Transformation Layer

dbt has become the default medium for warehouse transformations, and interviewers expect fluency. The concepts are straightforward but the operational patterns are where candidates differentiate.

Know how to structure a project with staging, intermediate, and marts layers. Staging models should be thin renames and casts over raw sources with no joins. Intermediate models should encapsulate reusable business logic. Marts should be the final consumer-facing shapes with clear grain statements. Candidates who collapse the middle layer or put business logic in staging produce unmaintainable projects.

Testing is a strong signal. Generic tests such as not null, unique, accepted values, and relationships are baseline. Singular tests for business rules and dbt tests that check reconciliation against a source of truth separate senior from mid-level candidates. Exposing test results as a monitored metric and alerting when freshness degrades is an operability move interviewers notice.

Incremental models are a classic probe. Be able to explain when to use an incremental strategy versus a full refresh, how to handle late-arriving data, how merge versus append versus delete-insert strategies differ on different warehouses, and what the unique key guarantees mean. The most common bug is an incremental model that silently drops rows because the filter on event time is too strict; being able to name that bug preemptively signals experience.

Macros and packages are useful but often over-applied. Interviewers like candidates who resist writing a macro for everything and who can articulate when a macro helps versus hurts maintainability. Keep the bar high: if a macro hides an important transformation, a reviewer has to jump into its definition to understand the model, which is a net negative.

Data Quality, Contracts, and Observability

Data quality has moved from a nice-to-have to a first-class interview topic. Expect explicit questions about how you would define and enforce quality in a pipeline.

Contracts are the current industry direction. A data contract is an agreement between a producer and consumers about the schema, semantics, SLA, and evolution rules of a dataset. Be able to articulate the contents of a contract, how you would enforce it at ingestion (schema registry, CI checks on dbt source definitions, typed producers), and how you would handle contract violations (reject, quarantine, fallback).

Observability tooling has matured. Expect questions about freshness monitoring, volume anomaly detection, schema drift detection, and distribution monitoring for key columns. The answer pattern is: identify the signal, choose a reasonable detection approach, define the severity, and specify the response. For example, for freshness, alert on SLA breach with auto-page for pipelines classified as critical, auto-ticket for others.

Lineage is worth a dedicated answer. Column-level lineage is becoming standard in mature stacks because it supports impact analysis before schema changes and root cause analysis during incidents. Interviewers appreciate candidates who can articulate the value of lineage without turning the question into a vendor pitch.

A classic probe: a downstream dashboard is showing a drop in revenue. Walk through your investigation. Strong answers start from the semantic layer, check lineage, check freshness and volume across upstream models, compare recent schema changes, and only then dive into specific column distributions. Weak answers start by opening the raw data without a hypothesis.

Common Mistakes That Sink Loops

The most common failure is treating pipeline design as a drawing exercise rather than an operability exercise. Candidates produce elegant diagrams that cannot be debugged. Always describe the monitoring, alerting, backfill, and replay stories explicitly.

Another common mistake is underestimating scale. Interviewers often set up prompts with ambiguously large numbers to see if candidates make scale assumptions explicit. If the question says "billions of events per day," name the partitioning, the shuffle cost, the state size for any streaming components, and the storage cost per year. If the question says "thousands per day," resist the urge to over-engineer; a batch job on a modest warehouse is the right call.

Candidates often confuse streaming and real-time. Streaming is a processing paradigm; real-time is a latency SLA. You can build streaming pipelines that produce hourly outputs and batch pipelines that deliver within minutes. State that distinction clearly when asked.

Finally, candidates often skip the data governance angle. Privacy, retention, access control, and audit are increasingly part of senior loops. Be able to describe how you would classify data on ingestion, how you would apply row and column-level access controls, and how you would honor deletion requests across a pipeline that includes a warehouse, a lake, and derived marts.

Sample Questions with Answer Scaffolds

Sample one: design a pipeline that computes daily active users and sessionized engagement metrics from clickstream events. Scaffold: define the event contract, partition by event date, land as Parquet on object storage, define a session with a 30-minute inactivity gap using window functions or Flink session windows, materialize a wide session fact, build a user-day fact on top, expose marts in the warehouse with grain statements, and describe backfill behavior for late-arriving events.

Sample two: you are given a slowly changing dimension table that is tracking customer attributes with type 2. Analysts complain that joins to a fact table produce too many rows. Diagnose. Scaffold: check the join condition for a missing effective-dated range filter, check for overlapping rows in the dimension that signal a data bug, propose a view that exposes only the currently active row for each natural key, and articulate when the analyst should be joining against a snapshot view instead.

Sample three: design a Kafka-based pipeline that produces exactly-once aggregates to a warehouse. Scaffold: use an idempotent producer, partition by the aggregate key, use Flink with a transactional sink into a staging table, merge into the target with a deterministic key, document the failure modes, and explain how savepoints are used for schema evolution.

Sample four: write a SQL query that finds the top three products by revenue for each category, breaking ties by most recent sale. Scaffold: use a window function with ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC, last_sale_at DESC), filter to rank less than or equal to three, and discuss how you would index or cluster the underlying table to make this efficient.

Sample five: how would you model revenue in a company that offers subscriptions, one-time purchases, and refunds. Scaffold: identify the grain (transaction event), separate billable events from accounting recognition, use a financial calendar dimension, handle timezones explicitly, use a signed amount to represent refunds as negative revenue, expose a mart that supports both cash and accrual views, and layer retention and cohort views on top.

Sample six: a dashboard shows a 20 percent drop in signups today. Walk me through the investigation. Scaffold: confirm the metric definition, check data freshness for the signup source, compare row counts for the last seven days, check for schema changes in the upstream source, correlate with deploys, inspect a sample of today's events for drops in specific countries or platforms, and escalate to the product team with a hypothesis rather than a guess.

FAQ

Q1: Is Spark still an expected skill in 2026?

Yes, but the weight has shifted. Most shops run SQL-first transformation for analytics and reserve Spark for heavy custom logic, machine learning feature pipelines, and some streaming. Be comfortable reading and writing PySpark or Scala Spark; you do not need to be a performance tuning expert unless the role is explicitly a Spark role.

Q2: How much should I prepare on machine learning topics?

For a pure data engineering role, know how to build and operate feature stores, how to serve features online with low latency, and the governance concerns around training and inference data. You are not expected to design models, but feature engineering pipelines are fair game.

Q3: Do I need to know a specific cloud provider?

Target the cloud the company uses, but have baseline fluency in at least two. Most teams use managed warehouses and object storage; the differences at the pipeline layer are smaller than the differences at the analytical platform layer.

Q4: How do I prepare for take-home pipeline challenges?

Treat them as production code. Include tests, clear documentation on how to run, a description of assumptions you made about the input, and a postmortem of limitations. A clean, small submission beats a sprawling one every time.

Q5: What is the right way to talk about data quality during an interview?

Ground it in consumer expectations. Quality without a defined consumer is just opinion. For each key dataset, name the consumer, the contract, the observable signals, the severity classification, and the response plan. Interviewers love that kind of structure because it mirrors how mature data teams actually operate.

Q6: Are certifications useful?

They can be a tiebreaker for an initial screen but they rarely move a loop outcome. A portfolio of concrete pipelines, postmortems of incidents you handled, and well-reasoned opinions on tooling do more. If you do go for a certification, make it one that maps to the warehouse the target company uses.

Conclusion

Data engineering interviews reward the candidate who thinks like an owner of the pipeline rather than a producer of diagrams. Practice SQL until window functions and sessionization feel automatic. Build at least one end-to-end pipeline in a sandbox with real upstream messiness. Know the warehouse and streaming platforms well enough to reason about specific operational foot guns. Talk about data quality in terms of contracts, signals, and response plans rather than generic best practices.

Walk into every round ready to explain not just how you would build something, but how you would operate it at 2 AM when the upstream system starts sending malformed records. Candidates who pair that operational mindset with fluent SQL and clear modeling instincts close the loop comfortably.

Frequently Asked Questions

What SQL topics are most frequently tested in senior data engineering interviews?
Window functions are the single most-tested feature: ROW_NUMBER vs RANK vs DENSE_RANK, gaps and islands, sessionization by inactivity gap, and top-N per group. Expect time-series reasoning with date spines, retention and cohort patterns, join strategy questions covering hash, merge, and broadcast joins, and at least one duplicate-detection prompt where you must clarify what a duplicate means before deduplicating.
How do I choose between dbt, Spark, and Flink for a transformation in an interview?
Match the tool to the semantics. Use dbt for deterministic batch SELECTs that live close to the warehouse, Spark for heavy joins and feature pipelines that exceed SQL expressivity, and Flink for event-time streaming joins, watermarks, and windowed aggregates. Defaulting to Spark for everything looks out of step with 2026 practice; dbt is the standard middle layer at most modern data orgs.
When should I pick a Type 2 slowly changing dimension over Type 1?
Pick Type 2 when downstream consumers need historical accuracy, like revenue attribution or point-in-time customer state. Pick Type 1 when only the current value matters and history would clutter joins. If both audiences exist, expose an effective-dated SCD with a current-row view for analysts and a history table for auditors. Defaulting to Type 2 for everything is a common interview anti-pattern.
What does a strong answer to a streaming pipeline design prompt include?
Start from the SLA (latency, freshness, completeness), name the event source and partition key, choose a stream processor with explicit watermark and lateness strategy, define the state backend and checkpointing, specify the exactly-once story end-to-end, and close with the replay and backfill plan. Mentioning kappa over lambda lands well in 2026 if you can justify a single engine with proper semantics.
How should I talk about data quality in a data engineering interview?
Ground it in consumer expectations and contracts. For each key dataset, name the consumer, the schema and SLA contract, the observable signals (freshness, volume, distribution, schema drift), the severity tiering, and the response plan (page, ticket, quarantine). Interviewers love this structure because it mirrors how mature data teams actually run quality programs.

Ready to Ace Your Next Interview?

Phantom Code provides real-time AI assistance during technical interviews. Solve DSA problems, system design questions, and more with instant AI-generated solutions.

Get Started

Related Articles

10 Things Great Candidates Do Differently in Technical Interviews

Ten behaviors that separate offer-winning candidates from average ones, from clarifying questions to optimizing without being asked.

From 5 Rejections to a Google Offer: One Engineer's Story

How a mid-level engineer turned five Google rejections into an L5 offer by fixing communication, system design depth, and exceptional reasoning.

Advanced SQL Interview Questions for Senior Engineers (2026)

Basic SQL gets you through L3. Senior roles require window functions, CTEs, execution plans, and real optimization know-how. Here is the complete advanced playbook.

Salary Guide|Resume Templates|LeetCode Solutions|FAQ|All Blog Posts
Phantom CodePhantom Code
Phantom Code is an undetectable desktop application to help you pass your Leetcode interviews.
All systems online

Legal

Refund PolicyTerms of ServiceCancellation PolicyPrivacy Policy

Pages

Contact SupportHelp CenterFAQBlogPricingBest AI Interview Assistants 2026FeedbackLeetcode ProblemsLoginCreate Account

Compare

Interview Coder AlternativeFinal Round AI AlternativeUltraCode AI AlternativeParakeet AI AlternativeAI Apply AlternativeCoderRank AlternativeInterviewing.io AlternativeShadeCoder Alternative

Resources

Salary GuideResume TemplatesWhat Is PhantomCodeIs PhantomCode Detectable?Use PhantomCode in HackerRankvs LeetCode PremiumIndia Pricing (INR)

Interview Types

Coding InterviewSystem Design InterviewDSA InterviewLeetCode InterviewAlgorithms InterviewData Structure InterviewSQL InterviewOnline Assessment

© 2026 Phantom Code. All rights reserved.