Data Engineering

Data Engineering

DATA

Comprehensive Data Engineering curriculum covering the entire data production chain. From environment setup with Docker and GCP to pipeline orchestration with Airflow and dbt, through Data Warehouse creation with BigQuery and PostgreSQL. Learn to handle data streaming with PySpark, Pub/Sub and Apache Beam, and deploy to production with Kubernetes and Terraform. Master CI/CD best practices, monitoring and modern data architectures.

What you'll learn

Development environments: Linux, Git, GitHub, VS Code, advanced Python

CI/CD and code quality: Ruff, Pylint, Poetry, GitHub Actions

Containerization with Docker and Docker Compose

APIs with FastAPI: design, deployment, documentation

Data Lake: ingestion, storage, raw data organization

Data Warehouse with BigQuery: schemas, partitioning, optimization

PostgreSQL: setup, administration, comparison with managed solutions

Data ingestion with Fivetran and Airbyte

Transformation with dbt: models, tests, documentation, modularity

Orchestration with Apache Airflow: DAGs, scheduling, monitoring

Big Data with PySpark: large-scale transformations

Data streaming: Google Pub/Sub, Apache Beam, Dataflow

Kubernetes: container deployment, scaling, production clusters

Infrastructure as Code with Terraform

Advanced databases: GraphDB, Document DBs, Wide Column DBs

Logging, monitoring and pipeline observability

Key topics to master

The most important concepts to understand this technology and ace your interviews

1

Linux & Shell: essential commands, bash scripting, permissions, cron jobs

2

Git & GitHub: branching, merge, rebase, pull requests, CI/CD workflows

3

Advanced Python: OOP, decorators, generators, context managers, typing, async/await

4

CI/CD: linting (Ruff, Pylint), packaging (Poetry), tests, GitHub Actions, pipelines

5

Docker: Dockerfile, images, containers, volumes, networks, multi-stage builds

6

Docker Compose: multi-container services, dependencies, healthchecks, local orchestration

7

FastAPI: routes, Pydantic models, dependencies, middleware, deployment

8

Advanced SQL: window functions, CTEs, analytical queries, optimization, indexing

9

BigQuery: serverless architecture, partitioning, clustering, costs, UDFs, federated queries

10

PostgreSQL: configuration, replication, indexing (B-tree, GIN, GiST), VACUUM, EXPLAIN ANALYZE

11

Data Modeling: star schema, fact/dimension tables, normalization, SCD, data vault

12

ELT vs ETL vs ETLT: patterns, trade-offs, architecture choices

13

Fivetran & Airbyte: connectors, sync modes, CDC, schema evolution

14

dbt: models, sources, refs, tests, snapshots, incremental models, Jinja macros

15

Apache Airflow: DAGs, operators, sensors, XCom, connections, pools, task dependencies

16

PySpark: RDD vs DataFrame, transformations, actions, partitioning, broadcast variables

17

Streaming: Pub/Sub (topics, subscriptions), Apache Beam (PCollections, transforms, windowing), Dataflow

18

Kubernetes: pods, deployments, services, ingress, ConfigMaps, Secrets, Helm, scaling

19

Terraform: providers, resources, state, modules, plan/apply, infrastructure as code

20

IAM & security: least privilege principles, service accounts, GCP roles

21

NoSQL databases: GraphDB (Neo4j), Document DBs (MongoDB, Firestore), Wide Column (Cassandra, Bigtable)

22

Data Architecture: Data Lake vs Data Warehouse vs Data Lakehouse, Data Mesh, Data Contracts

23

Monitoring & observability: logging, metrics, alerting, SLA/SLO/SLI, data quality checks