
Data Lake - Architecture and Ingestion
Data Lake architecture, zones (raw/refined/curated), formats (Parquet, Avro, JSON), ingestion, partitioning
1What is a Data Lake?
What is a Data Lake?
Answer
A Data Lake is a centralized storage system capable of storing raw data in its native format, whether structured, semi-structured, or unstructured. Unlike a Data Warehouse which imposes a schema at write time (schema-on-write), a Data Lake applies schema at read time (schema-on-read), offering maximum flexibility for data exploration and analysis.
2What is the main difference between schema-on-write and schema-on-read?
What is the main difference between schema-on-write and schema-on-read?
Answer
Schema-on-write enforces data validation and transformation before storage, ensuring consistent structure but limiting flexibility. Schema-on-read stores data in its raw format and applies schema only during reads, offering maximum ingestion flexibility but requiring processing when accessing data.
3What are the three classic zones of a Data Lake?
What are the three classic zones of a Data Lake?
Answer
The standard Data Lake architecture comprises three zones: Raw (Bronze) for untransformed raw data, Refined (Silver) for cleaned and normalized data, and Curated (Gold) for aggregated data ready for consumption. This layered organization facilitates governance, traceability, and data quality.
What is the role of the Raw (Bronze) zone in a Data Lake?
Which file format is best suited for storing large analytical data in a Data Lake?
+17 interview questions
Other Data Engineering interview topics
Linux & Shell - Fundamentals
Git & GitHub - Fundamentals
Advanced Python for Data Engineering
Docker - Fundamentals
Google Cloud Platform - Fundamentals
CI/CD and Code Quality
Docker Compose
FastAPI - Data APIs
Advanced SQL for Data Engineering
BigQuery for Data Engineering
PostgreSQL - Administration
Data Modeling for Data Engineering
Fivetran & Airbyte - Data Ingestion
dbt - Fundamentals
Apache Airflow - Fundamentals
Kubernetes - Fundamentals
dbt - Advanced Features
ETL / ELT / ETLT Patterns
Apache Airflow - Advanced
Airflow + dbt - Pipeline Orchestration
PySpark - Large-Scale Processing
Google Pub/Sub - Data Streaming
Apache Beam & Dataflow
Kubernetes - Production and Scaling
Terraform - Infrastructure as Code
NoSQL Databases
Modern Data Architecture
Monitoring and Observability
IAM and Data Security
Master Data Engineering for your next interview
Access all questions, flashcards, technical tests, code review exercises and interview simulators.
Start for free