Top 25 Data Engineering Interview Questions in 2026
The 25 most asked data engineering interview questions in 2026, covering SQL, data pipelines, ETL/ELT, Spark, Kafka, data modeling, and system design with detailed answers.

Data engineering interview questions in 2026 span far beyond SQL basics. Hiring teams now probe system design, real-time architectures, cost optimization, and AI-readiness. This guide covers the 25 questions that appear most frequently in data engineer interviews at companies ranging from startups to FAANG, with answers written for practitioners.
Modern data engineering interviews prioritize problem-solving over tool memorization. Expect questions on trade-offs (batch vs streaming, star vs snowflake schema), not just syntax recall. Demonstrating clear reasoning matters more than a perfect answer.
SQL and Query Optimization Questions
SQL remains the foundation of every data engineering interview. Even senior candidates face SQL questions, typically focused on performance rather than syntax.
1. What is the difference between a window function and GROUP BY?
GROUP BY collapses rows into aggregated results, reducing the row count. Window functions compute values across a set of rows related to the current row without collapsing them. The output retains every original row.
-- GROUP BY: one row per department
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
-- Window function: every row preserved, avg added
SELECT
employee_id,
department,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;The window function version is essential when the query needs both row-level detail and aggregate context simultaneously, a common requirement in data quality checks and reporting pipelines.
2. How would you optimize a slow query on a 500M-row fact table?
A structured approach works best:
- Check the execution plan (
EXPLAIN ANALYZE) to identify full table scans, hash joins on unindexed columns, or spills to disk. - Partition the table by date or another high-cardinality column that aligns with query filters.
- Add targeted indexes on frequently filtered columns, but avoid over-indexing (each index adds write overhead).
- Materialize intermediate results if the same subquery runs repeatedly across multiple dashboards.
- Push predicates down to filter early, reducing data shuffled between stages.
The key signal interviewers look for: understanding that optimization is iterative and context-dependent, not a checklist.
3. Explain CTEs vs subqueries vs temporary tables. When do you use each?
CTEs (Common Table Expressions) improve readability and allow recursive queries. Most query engines inline them, so performance matches subqueries. Temporary tables physically materialize data, which helps when the same intermediate result feeds multiple downstream queries in a session. Subqueries work well for simple, one-off transformations.
Rule of thumb: CTE for clarity, temp table for reuse, subquery for trivial inline filters.
For a deeper dive into advanced SQL patterns, see SQL for Data Analysts: Window Functions, CTEs and Advanced Queries.
ETL vs ELT and Data Pipeline Architecture
Pipeline design questions test architectural thinking. Interviewers want to see trade-off analysis, not tool advocacy.
4. ETL vs ELT: when would you choose one over the other?
| Criteria | ETL | ELT | |----------|-----|-----| | Transform location | Before loading (external compute) | After loading (warehouse compute) | | Best for | Legacy systems, strict schemas | Cloud warehouses (BigQuery, Snowflake) | | Schema flexibility | Low (transform locks schema early) | High (raw data available for re-transformation) | | Cost model | Compute outside warehouse | Warehouse compute costs scale with transforms | | Data freshness | Higher latency (transform step) | Lower latency (load first, transform on demand) |
ELT dominates in modern cloud-native stacks because storage is cheap, compute scales elastically, and keeping raw data enables schema evolution. ETL remains relevant when regulatory requirements mandate data transformation before it enters the warehouse (PII scrubbing, for example).
5. Design an idempotent data pipeline. Why does idempotency matter?
Idempotency guarantees that running a pipeline multiple times with the same input produces the same result without duplicating data. This matters because pipelines fail and get retried.
Implementation strategies:
- Upsert patterns (
MERGEorINSERT ... ON CONFLICT) keyed on natural or composite keys - Partition overwrite: replace entire date partitions instead of appending
- Deduplication at write: assign deterministic IDs (hash of business key + event timestamp)
# partition_overwrite.py
# Idempotent write: overwrite the entire partition for a given date
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("idempotent_load").getOrCreate()
df = spark.read.parquet("s3://raw-events/2026-04-11/")
# Transform: deduplicate by event_id, keep latest
df_deduped = (
df.orderBy("event_timestamp", ascending=False)
.dropDuplicates(["event_id"])
)
# Overwrite only the target partition
df_deduped.write \
.mode("overwrite") \
.partitionBy("event_date") \
.parquet("s3://warehouse/events/")This approach ensures a retry of the April 11 run replaces the April 11 partition rather than appending duplicate rows.
6. What is data lineage and why is it critical in production pipelines?
Data lineage tracks the origin, movement, and transformation of data across the pipeline. It answers: where did this number come from, what transformations were applied, and what downstream systems depend on it.
Practical value: when a dashboard shows unexpected values, lineage enables root-cause analysis in minutes instead of hours. Tools like OpenLineage, dbt docs, and cloud-native lineage (BigQuery column-level lineage) automate this.
Ready to ace your Data Engineering interviews?
Practice with our interactive simulators, flashcards, and technical tests.
Streaming and Real-Time Data Processing
Streaming questions have shifted from "what is Kafka" to architectural decisions about when and how to use streaming.
7. Batch vs streaming: how do you decide which to use?
The decision depends on latency requirements, not technology preference:
- Batch: acceptable when data consumers tolerate minutes-to-hours delay. Simpler to build, test, and debug. Lower operational cost for most workloads.
- Streaming: required when business logic depends on sub-minute freshness: fraud detection, real-time pricing, live dashboards.
- Micro-batch: a pragmatic middle ground (Spark Structured Streaming, for example) that provides near-real-time with batch-like simplicity.
The strongest answer acknowledges that most pipelines should start as batch and move to streaming only when a concrete latency SLA demands it.
8. Explain Kafka's architecture: topics, partitions, consumer groups
Kafka organizes data into topics (logical channels). Each topic splits into partitions (ordered, immutable logs) distributed across brokers. Producers write records to partitions (round-robin or key-based). Consumer groups parallelize reads: each partition is assigned to exactly one consumer within a group, enabling horizontal scaling.
Key trade-off: more partitions increase parallelism but add overhead to the broker (file handles, replication). A typical starting point is 6-12 partitions per topic, scaling based on throughput requirements.
9. How do you handle late-arriving data in a streaming pipeline?
Late data is inevitable in distributed systems. Three approaches:
- Watermarks: define a threshold (e.g., 10 minutes) beyond which late data is dropped. Apache Flink and Spark Structured Streaming support this natively.
- Reprocessing windows: periodically re-run aggregations for recent time windows to capture stragglers.
- Delta/upsert sinks: write to a mutable store (Delta Lake, Apache Iceberg) that supports updates, allowing late records to correct previous results.
The right approach depends on whether downstream consumers need exactly-once semantics or can tolerate eventual consistency.
Data Modeling and Schema Design
Data modeling questions reveal whether a candidate thinks about how data will be consumed, not just how it is stored.
10. Star schema vs snowflake schema: trade-offs?
Star schema denormalizes dimension tables into flat, wide tables joined to a central fact table. Snowflake schema normalizes dimensions into sub-dimensions.
| Aspect | Star | Snowflake | |--------|------|-----------| | Query performance | Faster (fewer joins) | Slower (more joins) | | Storage | Higher (redundant data) | Lower (normalized) | | Complexity | Simple to understand | More complex relationships | | Maintenance | Harder to update dimensions | Easier to update sub-dimensions | | Best for | BI/reporting (read-heavy) | Situations requiring strict normalization |
In practice, star schemas dominate in analytics warehouses (BigQuery, Snowflake, Redshift) because query performance and simplicity outweigh storage savings. Explore more in the data modeling interview module.
11. What is a slowly changing dimension (SCD)? Explain Type 1, 2, and 3.
SCDs track how dimension attributes change over time:
- Type 1: Overwrite the old value. No history. Simple but lossy.
- Type 2: Add a new row with version tracking (
valid_from,valid_to,is_current). Full history preserved. Most common in production warehouses. - Type 3: Add a column for the previous value (
current_city,previous_city). Limited history (only one change tracked).
Type 2 is the default choice for most business dimensions (customer address, product category, employee department) because audit trails and historical reporting require it.
12. How would you model event data for both real-time analytics and long-term storage?
A dual-layer approach:
- Hot layer: stream events into a real-time store (Apache Druid, ClickHouse, or Materialized Views in a warehouse) optimized for low-latency aggregations on recent data.
- Cold layer: land raw events in a lakehouse (Delta Lake, Iceberg) partitioned by date, retained indefinitely for ad-hoc analysis and ML feature engineering.
The hot layer uses pre-aggregated or denormalized schemas for speed. The cold layer retains full granularity. A reconciliation job periodically validates that both layers agree.
Apache Spark and Distributed Processing
Spark questions focus on understanding parallelism and performance, not just API calls.
13. What causes data skew in Spark, and how do you fix it?
Data skew occurs when one partition holds disproportionately more data than others, causing a single task to bottleneck the entire stage.
Common causes: joining on a key with a few dominant values (null keys, a single popular product ID), or partitioning by a low-cardinality column.
Fixes:
- Salting: append a random suffix to the skewed key, join on the salted key, then aggregate to remove the salt.
- Broadcast join: if one side is small enough (<200MB typically), broadcast it to avoid shuffle entirely.
- AQE (Adaptive Query Execution): Spark 3.x+ can auto-detect and split skewed partitions at runtime.
# salting_technique.py
# Fix skewed join by salting the large table's key
from pyspark.sql import functions as F
SALT_BUCKETS = 10
# Add salt to the large (skewed) table
large_df = large_df.withColumn(
"salted_key",
F.concat(F.col("join_key"), F.lit("_"), (F.rand() * SALT_BUCKETS).cast("int"))
)
# Explode the small table to match all salt values
small_df = small_df.crossJoin(
spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
).withColumn(
"salted_key",
F.concat(F.col("join_key"), F.lit("_"), F.col("salt"))
).drop("salt")
# Join on salted key (evenly distributed)
result = large_df.join(small_df, "salted_key")This distributes the workload evenly across executors, eliminating the bottleneck.
14. Explain the difference between transformations and actions in Spark.
Transformations (map, filter, select, join) are lazy: they build a logical plan but execute nothing. Actions (count, collect, write) trigger the actual computation by submitting the plan to the cluster.
This lazy evaluation enables Spark's Catalyst optimizer to reorder operations, push predicates down, and eliminate unnecessary shuffles before any data moves. Understanding this distinction is essential for debugging: a transformation that seems slow is actually fine; the bottleneck is in the action that triggers it.
15. What is the difference between repartition() and coalesce()?
repartition(n) performs a full shuffle to create exactly n partitions, distributing data evenly. Use it to increase parallelism or rebalance skewed partitions.
coalesce(n) reduces partitions without a shuffle by merging adjacent partitions. Use it to decrease partition count before writing (avoiding many small files).
Rule: coalesce for reducing, repartition for increasing or rebalancing.
Orchestration and Pipeline Management
Orchestration questions test operational maturity: how pipelines run, fail, and recover in production.
16. Compare Airflow, Dagster, and Prefect. When would you choose each?
| Feature | Airflow | Dagster | Prefect | |---------|---------|---------|--------| | Maturity | Most mature, largest community | Growing, strong in ML/data teams | Cloud-first, strong DX | | Core abstraction | DAGs of tasks | Assets (data-centric) | Flows and tasks | | Testing | Harder (DAG-level) | Built-in asset testing | Good task-level testing | | Local dev | Needs Docker/containers | Native local execution | Native local execution | | Best for | Complex, long-running ETL | Data mesh / asset-oriented teams | Teams wanting minimal infra |
Airflow remains the industry standard for mature data platforms. Dagster fits teams that think in terms of data assets rather than task sequences. Prefect appeals to teams that want Airflow-like orchestration with less operational overhead. For Airflow-specific interview prep, see the Airflow fundamentals module.
17. How do you handle pipeline failures and alerting in production?
A production-grade failure strategy includes:
- Retry with backoff: transient failures (network timeouts, API rate limits) resolve with exponential retry.
- Dead-letter queues: poison messages route to a separate queue for manual inspection without blocking the pipeline.
- Circuit breakers: after N consecutive failures, pause the pipeline and alert rather than flooding downstream systems with bad data.
- Observability: structured logs, metrics (task duration, row counts, error rates), and data quality checks (Great Expectations, dbt tests) at each pipeline stage.
- Alerting tiers: P1 (pipeline down, data missing) pages on-call. P2 (data quality drift) sends Slack notification for next business day.
18. What makes a pipeline "production-ready" vs a prototype?
Production readiness means: idempotent runs, automated retries, monitoring and alerting, data validation at ingestion and transformation stages, documentation of data contracts, version-controlled pipeline definitions, and a tested rollback path. A prototype has none of these. The gap between the two is where most pipeline technical debt accumulates.
Cloud Data Platforms and Lakehouse Architecture
Cloud platform questions assess whether candidates understand cost, scale, and architecture trade-offs beyond vendor-specific syntax.
19. What is a lakehouse, and why has it become the dominant architecture?
A lakehouse combines the flexibility of a data lake (schema-on-read, raw data storage, multiple file formats) with the reliability of a data warehouse (ACID transactions, schema enforcement, query optimization). Technologies like Delta Lake, Apache Iceberg, and Apache Hudi enable this by adding a metadata/transaction layer on top of object storage (S3, GCS, ADLS).
The lakehouse wins because it eliminates the costly ETL layer between lake and warehouse, supports both BI queries and ML workloads on the same data, and leverages cheap object storage.
20. How do you optimize cloud warehouse costs (BigQuery, Snowflake, Redshift)?
Cost optimization strategies:
- Partition and cluster tables to reduce bytes scanned per query
- Set appropriate warehouse sizes (Snowflake) or slot reservations (BigQuery) based on workload patterns
- Auto-suspend idle resources: Snowflake warehouses should auto-suspend after 1-5 minutes of inactivity
- Monitor per-query costs: BigQuery's
INFORMATION_SCHEMA.JOBSview reveals expensive queries - Use materialized views for repeatedly computed aggregations
- Archive cold data to cheaper storage tiers with lifecycle policies
The interview signal: candidates who think about cost as a first-class engineering constraint, not an afterthought.
Data Quality and Governance
Data quality questions separate engineers who ship pipelines from those who maintain trusted data platforms.
21. How do you implement data quality checks in a pipeline?
Data quality checks belong at three stages:
- Ingestion: validate schema conformance, check for null primary keys, verify row count thresholds against source.
- Transformation: assert business rules (e.g., revenue > 0, dates in valid range), check referential integrity across tables.
- Serving: monitor metric drift (sudden 50% drop in daily active users likely indicates a pipeline issue, not a business change).
Tools: dbt tests (schema and custom), Great Expectations, Soda Core, Monte Carlo (data observability). The best approach integrates checks directly into the pipeline DAG so failures block downstream processing.
22. What is a data contract, and how does it prevent pipeline breakage?
A data contract is a formal agreement between a data producer and its consumers specifying: schema (column names, types, nullability), freshness SLA (data available by 6 AM UTC), volume expectations (between 1M and 10M rows daily), and semantic rules (status field contains only ACTIVE, INACTIVE, SUSPENDED).
When producers change their schema without updating the contract, automated validation catches the mismatch before it propagates downstream. This shifts pipeline failures from silent data corruption to explicit, actionable errors.
Ready to ace your Data Engineering interviews?
Practice with our interactive simulators, flashcards, and technical tests.
System Design and Architecture
System design rounds are standard for mid-to-senior data engineering roles. These test end-to-end thinking.
23. Design a real-time analytics pipeline for an e-commerce platform.
A solid design addresses:
Ingestion: clickstream events and transaction events flow into Kafka topics, partitioned by user_id for ordering guarantees within a user session.
Processing: a Flink or Spark Structured Streaming job consumes from Kafka, enriches events with product catalog data (broadcast join from a dimension table), computes session-level and 5-minute aggregations.
Serving: aggregated metrics land in a real-time OLAP store (ClickHouse or Apache Druid) for dashboard queries with sub-second latency. Raw events land in Delta Lake/Iceberg for historical analysis.
Data quality: schema registry (Confluent or AWS Glue) validates event schemas at ingestion. Streaming data quality checks flag anomalies (sudden drop in event volume) and route to a dead-letter topic.
Failure handling: Kafka's consumer offsets enable exactly-once processing with checkpointing. The streaming job auto-restarts from the last checkpoint on failure.
24. How would you migrate a legacy ETL pipeline to a modern stack?
Migration follows a strangler fig pattern:
- Audit: catalog all existing pipelines, their sources, transformations, and consumers. Identify dependencies and SLAs.
- Parallel run: build the new pipeline alongside the old one. Both consume the same sources and write to separate targets.
- Validation: compare outputs between old and new pipelines. Automated reconciliation queries flag discrepancies.
- Cutover: once outputs match within tolerance for 2+ weeks, switch consumers to the new target. Keep the old pipeline in read-only mode for rollback.
- Decommission: after a stability period, shut down the legacy pipeline.
The key insight: never do a big-bang migration. Run both systems in parallel until confidence is established.
25. A critical dashboard shows incorrect numbers. Walk through your debugging process.
A systematic approach:
- Scope the problem: which metrics are wrong? Since when? All dimensions or specific filters?
- Check the serving layer: query the warehouse directly to confirm whether the issue is in the dashboard tool (caching, filter logic) or the data itself.
- Trace lineage upstream: follow the data from the dashboard table back through each transformation step. Check row counts and key metrics at each stage.
- Identify the break point: the stage where expected values diverge from actual values is the root cause location.
- Check for common culprits: schema changes in source systems, failed pipeline runs that went unalerted, late-arriving data that missed the processing window, or a deployment that changed transformation logic.
- Fix and prevent: patch the immediate issue, add a data quality check that would have caught it, and update the runbook.
This question tests incident response maturity. Interviewers want to see structured thinking, not random guessing.
Preparing for the Data Engineering Interview
The data engineering interview preparation track on SharpSkill covers these topics with interactive practice across ETL/ELT patterns, data modeling, Airflow, BigQuery, and more.
Conclusion
- SQL window functions, CTEs, and query optimization remain foundational regardless of seniority level
- ETL vs ELT decisions should be driven by latency requirements, data governance needs, and infrastructure costs rather than trend-following
- Idempotency, data contracts, and data quality checks separate production-grade pipelines from prototypes
- Streaming architecture questions focus on trade-offs (batch vs streaming vs micro-batch) rather than tool-specific knowledge
- Data modeling choices (star vs snowflake, SCD types) should align with consumption patterns, not theoretical best practices
- System design answers should address failure handling, cost optimization, and observability alongside the happy path
- The strongest candidates demonstrate structured problem-solving and honest reasoning about trade-offs
Start practicing!
Test your knowledge with our interview simulators and technical tests.
Tags
Share
Related articles

dbt in 2026: Data Transformations, Testing and Interview Questions
Master dbt (data build tool) with this hands-on tutorial covering SQL transformations, layered modeling, testing strategies, and real interview questions for data engineering roles in 2026.

Apache Spark 4 in 2026: New Features, Structured Streaming and Interview Questions
A comprehensive guide to Apache Spark 4.x covering ANSI mode, VARIANT type, Real-Time Mode streaming, Spark Connect, and common data engineering interview questions with code examples.

Apache Kafka for Data Engineers: Streaming, Partitions and Interview Questions
Apache Kafka deep dive for data engineers covering streaming architecture, partition strategies, consumer groups, and common interview questions with practical examples using Kafka 4.x and KRaft.