Data Engineering

Data Lake - Architecture and Ingestion

Data Lake architecture, zones (raw/refined/curated), formats (Parquet, Avro, JSON), ingestion, partitioning

20 interview questionsยท
Mid-Level
1

What is a Data Lake?

Answer

A Data Lake is a centralized storage system capable of storing raw data in its native format, whether structured, semi-structured, or unstructured. Unlike a Data Warehouse which imposes a schema at write time (schema-on-write), a Data Lake applies schema at read time (schema-on-read), offering maximum flexibility for data exploration and analysis.

2

What is the main difference between schema-on-write and schema-on-read?

Answer

Schema-on-write enforces data validation and transformation before storage, ensuring consistent structure but limiting flexibility. Schema-on-read stores data in its raw format and applies schema only during reads, offering maximum ingestion flexibility but requiring processing when accessing data.

3

What are the three classic zones of a Data Lake?

Answer

The standard Data Lake architecture comprises three zones: Raw (Bronze) for untransformed raw data, Refined (Silver) for cleaned and normalized data, and Curated (Gold) for aggregated data ready for consumption. This layered organization facilitates governance, traceability, and data quality.

4

What is the role of the Raw (Bronze) zone in a Data Lake?

5

Which file format is best suited for storing large analytical data in a Data Lake?

+17 interview questions

Master Data Engineering for your next interview

Access all questions, flashcards, technical tests, code review exercises and interview simulators.

Start for free