Data Engineering

Apache Airflow - ๊ธฐ์ดˆ

DAG, operator (Bash, Python, SQL), ์Šค์ผ€์ค„๋ง, ํƒœ์Šคํฌ ์˜์กด์„ฑ, Airflow UI, connection, variable, trigger rule

20 ๋ฉด์ ‘ ์งˆ๋ฌธยท
Mid-Level
1

Apache Airflow์—์„œ DAG๋ž€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

๋‹ต๋ณ€

DAG (Directed Acyclic Graph)๋Š” ์˜์กด์„ฑ๊ณผ ๊ด€๊ณ„๋กœ ๊ตฌ์„ฑ๋œ ํƒœ์Šคํฌ์˜ ๋ชจ์Œ์œผ๋กœ, ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋น„์ˆœํ™˜์ ์ด๋ผ๋Š” ๊ฒƒ์€ ์˜์กด์„ฑ ๊ทธ๋ž˜ํ”„์— ๋ฃจํ”„๊ฐ€ ์žˆ์„ ์ˆ˜ ์—†์Œ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์ด๋Š” ๊ฐ ํƒœ์Šคํฌ๊ฐ€ ์‹คํ–‰๋‹น ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ์‹คํ–‰๋จ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. DAG๋Š” ํƒœ์Šคํฌ๊ฐ€ ์–ธ์ œ ์–ด๋–ป๊ฒŒ ์‹คํ–‰๋˜์–ด์•ผ ํ•˜๋Š”์ง€ ์ •์˜ํ•˜์ง€๋งŒ, ๊ตฌ์ฒด์ ์œผ๋กœ ๋ฌด์—‡์„ ํ•˜๋Š”์ง€๋Š” ์ •์˜ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

2

Scheduler๊ฐ€ ์‹คํ–‰ ์Šค์ผ€์ค„๋ง์„ ์‹œ์ž‘ํ•˜๋Š” ๋‚ ์งœ๋ฅผ ์ •์˜ํ•˜๋Š” DAG ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

๋‹ต๋ณ€

start_date ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” Airflow๊ฐ€ DAG ์‹คํ–‰ ์Šค์ผ€์ค„๋ง์„ ์‹œ์ž‘ํ•˜๋Š” ๋‚ ์งœ๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‚ ์งœ๋Š” schedule_interval๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ data interval์„ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ค‘์š”ํ•œ ์ : start_date๊ฐ€ ๊ณผ๊ฑฐ์ด๋ฉด, catchup=False๊ฐ€ ์„ค์ •๋˜์ง€ ์•Š์€ ํ•œ Airflow๋Š” ๋†“์นœ ์‹คํ–‰์„ ๋”ฐ๋ผ์žก๊ธฐ ์œ„ํ•ด backfill์„ ํŠธ๋ฆฌ๊ฑฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3

Airflow DAG์—์„œ Python ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์–ด๋–ค operator๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

๋‹ต๋ณ€

PythonOperator๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด Airflow DAG์—์„œ Python callable ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•จ์ˆ˜๋Š” python_callable ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ†ตํ•ด ์ „๋‹ฌ๋˜๋ฉฐ op_args (๋ฆฌ์ŠคํŠธ) ๋˜๋Š” op_kwargs (๋”•์…”๋„ˆ๋ฆฌ)๋ฅผ ํ†ตํ•ด ์ธ์ˆ˜๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. PythonOperator๋Š” ์‚ฌ์šฉ์ž ์ •์˜ Python ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ํฐ ์œ ์—ฐ์„ฑ์„ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” operator ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

4

task_b๊ฐ€ task_a ํ›„์— ์‹คํ–‰๋˜๋„๋ก ๋‘ ํƒœ์Šคํฌ task_a์™€ task_b ์‚ฌ์ด์˜ ์˜์กด์„ฑ์„ ์–ด๋–ป๊ฒŒ ์ •์˜ํ•ฉ๋‹ˆ๊นŒ?

5

์ž์ •์— ๋งค์ผ ์‹คํ–‰์„ ๋‚˜ํƒ€๋‚ด๋Š” cron ํ‘œํ˜„์‹์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

+17 ๋ฉด์ ‘ ์งˆ๋ฌธ

๊ธฐํƒ€ Data Engineering ๋ฉด์ ‘ ์ฃผ์ œ

Linux & Shell - ๊ธฐ์ดˆ

Junior
20๊ฐœ ์งˆ๋ฌธ

Git & GitHub - ๊ธฐ์ดˆ

Junior
20๊ฐœ ์งˆ๋ฌธ

๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง์„ ์œ„ํ•œ ๊ณ ๊ธ‰ Python

Junior
25๊ฐœ ์งˆ๋ฌธ

Docker - ๊ธฐ์ดˆ

Junior
25๊ฐœ ์งˆ๋ฌธ

Google Cloud Platform - ๊ธฐ์ดˆ

Junior
20๊ฐœ ์งˆ๋ฌธ

CI/CD ๋ฐ ์ฝ”๋“œ ํ’ˆ์งˆ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Docker Compose

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

FastAPI - ๋ฐ์ดํ„ฐ API

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Data Engineering์„ ์œ„ํ•œ ๊ณ ๊ธ‰ SQL

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Data Lake - ์•„ํ‚คํ…์ฒ˜ ๋ฐ ์ˆ˜์ง‘

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง์„ ์œ„ํ•œ BigQuery

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

PostgreSQL - ๊ด€๋ฆฌ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Data Engineering์„ ์œ„ํ•œ Data Modeling

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Fivetran & Airbyte - ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

dbt - ๊ธฐ์ดˆ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

Kubernetes - ๊ธฐ์ดˆ

Mid-Level
20๊ฐœ ์งˆ๋ฌธ

dbt - ๊ณ ๊ธ‰ ๊ธฐ๋Šฅ

Senior
20๊ฐœ ์งˆ๋ฌธ

ETL / ELT / ETLT ํŒจํ„ด

Senior
20๊ฐœ ์งˆ๋ฌธ

Apache Airflow - ๊ณ ๊ธ‰

Senior
20๊ฐœ ์งˆ๋ฌธ

Airflow + dbt - ํŒŒ์ดํ”„๋ผ์ธ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜

Senior
20๊ฐœ ์งˆ๋ฌธ

PySpark - ๋Œ€๊ทœ๋ชจ ์ฒ˜๋ฆฌ

Senior
20๊ฐœ ์งˆ๋ฌธ

Google Pub/Sub - ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆฌ๋ฐ

Senior
20๊ฐœ ์งˆ๋ฌธ

Apache Beam & Dataflow

Senior
20๊ฐœ ์งˆ๋ฌธ

Kubernetes - ํ”„๋กœ๋•์…˜ ๋ฐ ์Šค์ผ€์ผ๋ง

Senior
20๊ฐœ ์งˆ๋ฌธ

Terraform - Infrastructure as Code

Senior
20๊ฐœ ์งˆ๋ฌธ

NoSQL ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค

Senior
20๊ฐœ ์งˆ๋ฌธ

๋ชจ๋˜ Data Architecture

Senior
20๊ฐœ ์งˆ๋ฌธ

๋ชจ๋‹ˆํ„ฐ๋ง ๋ฐ ๊ด€์ฐฐ ๊ฐ€๋Šฅ์„ฑ

Senior
20๊ฐœ ์งˆ๋ฌธ

IAM ๋ฐ ๋ฐ์ดํ„ฐ ๋ณด์•ˆ

Senior
20๊ฐœ ์งˆ๋ฌธ

๋‹ค์Œ ๋ฉด์ ‘์„ ์œ„ํ•ด Data Engineering์„ ๋งˆ์Šคํ„ฐํ•˜์„ธ์š”

๋ชจ๋“  ์งˆ๋ฌธ, flashcards, ๊ธฐ์ˆ  ํ…Œ์ŠคํŠธ, ์ฝ”๋“œ ๋ฆฌ๋ทฐ ์—ฐ์Šต, ๋ฉด์ ‘ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์— ์ ‘๊ทผํ•˜์„ธ์š”.

๋ฌด๋ฃŒ๋กœ ์‹œ์ž‘ํ•˜๊ธฐ