Data Engineering

๋ชจ๋˜ Data Architecture

Data Lake vs Data Warehouse vs Lakehouse, Data Mesh, Data Contracts, schema registry, ADR, ๊ฑฐ๋ฒ„๋„Œ์Šค, data catalog, lineage

20 ๋ฉด์ ‘ ์งˆ๋ฌธยท
Senior
1

Data Lake์™€ Data Warehouse์˜ ๊ทผ๋ณธ์ ์ธ ์ฐจ์ด์ ์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

๋‹ต๋ณ€

Data Lake๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋„ค์ดํ‹ฐ๋ธŒ(raw) ํ˜•์‹์œผ๋กœ ์ €์žฅํ•˜๊ณ  ์ฝ๊ธฐ ์‹œ์ ์— schema๋ฅผ ์ ์šฉ(schema-on-read)ํ•˜์—ฌ ํƒ์ƒ‰์— ํฐ ์œ ์—ฐ์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Data Warehouse๋Š” ์“ฐ๊ธฐ ์‹œ์ ์— ๊ตฌ์กฐํ™”๋œ schema๋ฅผ ๊ฐ•์ œ(schema-on-write)ํ•˜๊ณ  ๋ถ„์„์— ์ตœ์ ํ™”๋œ ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ด€ํ•ฉ๋‹ˆ๋‹ค. Data Lake๋Š” ์œ ์—ฐ์„ฑ๊ณผ ์ €๋น„์šฉ ๋Œ€์šฉ๋Ÿ‰ ์Šคํ† ๋ฆฌ์ง€๋ฅผ ์šฐ์„ ์‹œํ•˜๊ณ , Data Warehouse๋Š” ์ฟผ๋ฆฌ ์„ฑ๋Šฅ๊ณผ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์„ ์šฐ์„ ์‹œํ•ฉ๋‹ˆ๋‹ค.

2

Data Lake์™€ Data Warehouse๋ฅผ ๋ถ„๋ฆฌํ•œ ์•„ํ‚คํ…์ฒ˜์™€ ๋น„๊ตํ•˜์—ฌ Lakehouse ์•„ํ‚คํ…์ฒ˜์˜ ์ฃผ์š” ์žฅ์ ์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

๋‹ต๋ณ€

Lakehouse ์•„ํ‚คํ…์ฒ˜๋Š” ์–‘์ชฝ ์„ธ๊ณ„์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค: Data Lake์˜ ์œ ์—ฐํ•˜๊ณ  ๊ฒฝ์ œ์ ์ธ ์Šคํ† ๋ฆฌ์ง€์™€ Data Warehouse์˜ ACID ๊ธฐ๋Šฅ, ์ฟผ๋ฆฌ ์„ฑ๋Šฅ, ๊ฑฐ๋ฒ„๋„Œ์Šค๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹œ์Šคํ…œ ๊ฐ„ ๋ฐ์ดํ„ฐ ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜๊ณ  ๋™๊ธฐํ™” ๋น„์šฉ๊ณผ ๋ณต์žก์„ฑ์„ ์ค„์ด๋ฉด์„œ Delta Lake, Iceberg, Hudi ๊ฐ™์€ ์˜คํ”ˆ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ํ”Œ๋žซํผ์—์„œ BI ๋ฐ ML ์›Œํฌ๋กœ๋“œ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

3

Data Lake์—์„œ ACID ํŠธ๋žœ์žญ์…˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์˜คํ”ˆ ํ…Œ์ด๋ธ” ํ˜•์‹์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

๋‹ต๋ณ€

Delta Lake, Apache Iceberg, Apache Hudi๋Š” Data Lake์—์„œ ACID ํŠธ๋žœ์žญ์…˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์˜คํ”ˆ ํ…Œ์ด๋ธ” ํ˜•์‹์ž…๋‹ˆ๋‹ค. Databricks๊ฐ€ ๊ฐœ๋ฐœํ•œ Delta Lake๋Š” transaction log๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ atomicity์™€ consistency๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. Netflix๊ฐ€ ๋งŒ๋“  Iceberg๋Š” ๊ณ ๊ธ‰ ํŒŒํ‹ฐ์…˜ ๊ด€๋ฆฌ์™€ schema evolution์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Uber๊ฐ€ ๊ฐœ๋ฐœํ•œ Hudi๋Š” upsert ๋ฐ CDC ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜•์‹์€ ๋‹จ์ˆœํ•œ ๊ฐ์ฒด ์Šคํ† ๋ฆฌ์ง€๋ฅผ ํŠธ๋žœ์žญ์…˜ ๋ณด์žฅ์ด ์žˆ๋Š” Lakehouse๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

4

Data Mesh์˜ ๊ทผ๋ณธ ์›์น™์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

5

Data Mesh ์ปจํ…์ŠคํŠธ์—์„œ Data Contract๋ž€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

+17 ๋ฉด์ ‘ ์งˆ๋ฌธ

๊ธฐํƒ€ Data Engineering ๋ฉด์ ‘ ์ฃผ์ œ

Linux & Shell - ๊ธฐ์ดˆ

Junior
20๊ฐœ ์งˆ๋ฌธ

Git & GitHub - ๊ธฐ์ดˆ

Junior
20๊ฐœ ์งˆ๋ฌธ

๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง์„ ์œ„ํ•œ ๊ณ ๊ธ‰ Python

Junior
25๊ฐœ ์งˆ๋ฌธ

Docker - ๊ธฐ์ดˆ

Junior
25๊ฐœ ์งˆ๋ฌธ

Google Cloud Platform - ๊ธฐ์ดˆ

Junior
20๊ฐœ ์งˆ๋ฌธ

CI/CD ๋ฐ ์ฝ”๋“œ ํ’ˆ์งˆ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Docker Compose

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

FastAPI - ๋ฐ์ดํ„ฐ API

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Data Engineering์„ ์œ„ํ•œ ๊ณ ๊ธ‰ SQL

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Data Lake - ์•„ํ‚คํ…์ฒ˜ ๋ฐ ์ˆ˜์ง‘

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง์„ ์œ„ํ•œ BigQuery

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

PostgreSQL - ๊ด€๋ฆฌ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Data Engineering์„ ์œ„ํ•œ Data Modeling

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Fivetran & Airbyte - ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

dbt - ๊ธฐ์ดˆ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Apache Airflow - ๊ธฐ์ดˆ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Kubernetes - ๊ธฐ์ดˆ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

dbt - ๊ณ ๊ธ‰ ๊ธฐ๋Šฅ

Senior
20๊ฐœ ์งˆ๋ฌธ

ETL / ELT / ETLT ํŒจํ„ด

Senior
20๊ฐœ ์งˆ๋ฌธ

Apache Airflow - ๊ณ ๊ธ‰

Senior
20๊ฐœ ์งˆ๋ฌธ

Airflow + dbt - ํŒŒ์ดํ”„๋ผ์ธ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜

Senior
20๊ฐœ ์งˆ๋ฌธ

PySpark - ๋Œ€๊ทœ๋ชจ ์ฒ˜๋ฆฌ

Senior
20๊ฐœ ์งˆ๋ฌธ

Google Pub/Sub - ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆฌ๋ฐ

Senior
20๊ฐœ ์งˆ๋ฌธ

Apache Beam & Dataflow

Senior
20๊ฐœ ์งˆ๋ฌธ

Kubernetes - ํ”„๋กœ๋•์…˜ ๋ฐ ์Šค์ผ€์ผ๋ง

Senior
20๊ฐœ ์งˆ๋ฌธ

Terraform - Infrastructure as Code

Senior
20๊ฐœ ์งˆ๋ฌธ

NoSQL ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค

Senior
20๊ฐœ ์งˆ๋ฌธ

๋ชจ๋‹ˆํ„ฐ๋ง ๋ฐ ๊ด€์ฐฐ ๊ฐ€๋Šฅ์„ฑ

Senior
20๊ฐœ ์งˆ๋ฌธ

IAM ๋ฐ ๋ฐ์ดํ„ฐ ๋ณด์•ˆ

Senior
20๊ฐœ ์งˆ๋ฌธ

๋‹ค์Œ ๋ฉด์ ‘์„ ์œ„ํ•ด Data Engineering์„ ๋งˆ์Šคํ„ฐํ•˜์„ธ์š”

๋ชจ๋“  ์งˆ๋ฌธ, flashcards, ๊ธฐ์ˆ  ํ…Œ์ŠคํŠธ, ์ฝ”๋“œ ๋ฆฌ๋ทฐ ์—ฐ์Šต, ๋ฉด์ ‘ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์— ์ ‘๊ทผํ•˜์„ธ์š”.

๋ฌด๋ฃŒ๋กœ ์‹œ์ž‘ํ•˜๊ธฐ