Apache Airflow in 2026: Pipeline Orchestration, DAGs and Interview Questions
Master Apache Airflow 3.2 with this hands-on tutorial covering DAG authoring with the Task SDK, pipeline orchestration patterns, asset partitions, and real interview questions for data engineering roles in 2026.

Apache Airflow 3.2 represents the most significant evolution of the open-source orchestration platform since its 3.0 architectural overhaul. With over 10 million monthly downloads on PyPI and adoption across companies ranging from Airbnb to Spotify, Airflow remains the dominant tool for building, scheduling, and monitoring data pipelines. This tutorial covers DAG authoring with the new Task SDK, pipeline orchestration patterns, and the interview questions that surface in data engineering roles.
Airflow 3.2, released in April 2026, introduces asset partitions for granular data-aware scheduling, native async support in the PythonOperator, and multi-team deployments. All DAG imports now use the stable airflow.sdk namespace introduced in Airflow 3.0.
DAG Authoring with the Airflow Task SDK
Airflow 3.0 introduced the Task SDK, a standalone package that decouples DAG definitions from Airflow internals. The goal: write portable, version-stable DAGs that survive Airflow upgrades without code changes. All core objects — DAG, dag, task, BaseOperator, Connection, Variable — now live under airflow.sdk.
The legacy import paths (airflow.decorators.task, airflow.models.dag.DAG) still work in 3.2 but are marked deprecated and will be removed in a future release.
# etl_daily_revenue.py
import pendulum
from airflow.sdk import dag, task
@dag(
schedule="@daily",
start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
catchup=False,
tags=["finance", "etl"],
)
def etl_daily_revenue():
@task()
def extract_transactions() -> list[dict]:
# Pulls raw transactions from the payments API
import requests
response = requests.get("https://api.internal/payments/daily")
return response.json()["transactions"]
@task(multiple_outputs=True)
def transform(transactions: list[dict]) -> dict:
# Aggregates by currency and computes totals
totals = {}
for tx in transactions:
currency = tx["currency"]
totals[currency] = totals.get(currency, 0) + tx["amount"]
return {"totals": totals, "count": len(transactions)}
@task()
def load(totals: dict, count: int):
# Inserts aggregated row into the warehouse
from warehouse import insert_revenue_summary
insert_revenue_summary(totals=totals, transaction_count=count)
raw = extract_transactions()
result = transform(raw)
load(result["totals"], result["count"])
etl_daily_revenue()The @dag decorator replaces the traditional with DAG(...) context manager. The @task decorator turns plain Python functions into Airflow tasks, with XCom serialization handled automatically. Calling transform(raw) declares a dependency — Airflow builds the DAG edges from the function call graph.
Dynamic Task Mapping for Variable Workloads
Static DAGs break when the number of items to process changes between runs. Dynamic task mapping, stable since Airflow 2.4 and refined in 3.x, solves this by expanding tasks at runtime using .expand().
# etl_multi_region.py
import pendulum
from airflow.sdk import dag, task
@dag(
schedule="@hourly",
start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
catchup=False,
tags=["regional", "etl"],
)
def etl_multi_region():
@task()
def get_regions() -> list[str]:
# Returns the active regions from config
return ["us-east-1", "eu-west-1", "ap-southeast-1"]
@task()
def process_region(region: str) -> dict:
# Processes events for a single region
from data_processor import aggregate_events
return aggregate_events(region)
@task()
def merge_results(results: list[dict]):
# Combines all regional outputs into a single report
from reporter import publish_global_report
publish_global_report(results)
regions = get_regions()
# .expand() creates one mapped task instance per region
processed = process_region.expand(region=regions)
merge_results(processed)
etl_multi_region()At execution time, Airflow creates three parallel instances of process_region — one per region. If a fourth region appears tomorrow, no DAG code changes are needed. The merge_results task waits for all mapped instances to complete before executing.
By default, Airflow caps mapped tasks at 1024 instances per mapping. Override this with max_map_length in the DAG configuration when processing larger datasets.
Asset Partitions: Data-Aware Scheduling in Airflow 3.2
Before Airflow 3.2, data-aware scheduling operated at the asset level: when a producer DAG updated an asset, all consumer DAGs triggered regardless of which data slice changed. Asset partitions fix this by enabling partition-level granularity.
Consider a scenario where three upstream DAGs produce hourly player statistics for different sports leagues. A downstream analytics DAG should only trigger when all three leagues have published data for the same hour.
# downstream_analytics.py
import pendulum
from airflow.sdk import dag, task
from airflow.timetables.assets import CronPartitionTimetable
# Define partitioned assets with hourly granularity
nba_stats = Asset("s3://datalake/nba/hourly/", partitions=CronPartitionTimetable("0 * * * *"))
epl_stats = Asset("s3://datalake/epl/hourly/", partitions=CronPartitionTimetable("0 * * * *"))
nfl_stats = Asset("s3://datalake/nfl/hourly/", partitions=CronPartitionTimetable("0 * * * *"))
@dag(
# Triggers only when all three assets have matching partition
schedule=(nba_stats & epl_stats & nfl_stats),
start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
catchup=False,
)
def unified_sports_analytics():
@task()
def aggregate_all_leagues(**context):
partition = context["partition"]
# Process only the specific hourly slice across all leagues
from analytics import build_cross_league_report
build_cross_league_report(partition=partition)
aggregate_all_leagues()
unified_sports_analytics()Partitioned scheduling eliminates redundant pipeline runs. The downstream DAG only fires when the exact partition it needs is available across all upstream sources — not before, not on partial data.
Ready to ace your Data Engineering interviews?
Practice with our interactive simulators, flashcards, and technical tests.
Native Async Tasks for I/O-Heavy Workloads
Airflow 3.2 adds native async support in the PythonOperator. Previously, running concurrent I/O operations (thousands of API calls, batch file downloads) required writing custom deferrable operators. Now, pass an async function directly.
# async_file_download.py
import asyncio
import aiohttp
from airflow.sdk import dag, task
import pendulum
@dag(
schedule="@daily",
start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
catchup=False,
tags=["async", "download"],
)
def async_file_download():
@task()
async def download_files():
urls = [f"https://data.source/files/{i}.csv" for i in range(500)]
async with aiohttp.ClientSession() as session:
# Downloads 500 files concurrently instead of sequentially
tasks = [fetch_and_save(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return {"downloaded": len(results)}
download_files()
async def fetch_and_save(session: aiohttp.ClientSession, url: str):
async with session.get(url) as resp:
content = await resp.read()
filename = url.split("/")[-1]
with open(f"/data/downloads/{filename}", "wb") as f:
f.write(content)
return filename
async_file_download()The async approach downloads 500 files concurrently through a single worker slot, compared to 500 sequential HTTP calls in the synchronous version. For I/O-bound workloads, this translates to order-of-magnitude speedups without provisioning additional workers.
Airflow Architecture: Components and Execution Flow
Understanding Airflow's architecture is essential for both operating it in production and answering interview questions. Five components interact during every DAG run:
- Scheduler — Parses DAG files, resolves dependencies, and queues tasks for execution. In Airflow 3.x, the scheduler operates statelessly against the metadata database.
- Executor — Determines where tasks run.
LocalExecutorhandles single-machine setups.CeleryExecutordistributes across worker nodes.KubernetesExecutorspins up a pod per task. - Workers — Execute the actual task code. With the new Task Execution API in Airflow 3.0, workers communicate through a stable contract, enabling execution in containers, edge environments, or external runtimes.
- Metadata database — PostgreSQL (recommended) or MySQL. Stores DAG definitions, task states, XCom values, connection credentials, and audit logs.
- Web server — The Airflow UI for monitoring DAG runs, inspecting logs, triggering manual runs, and managing connections.
Avoid SequentialExecutor in production — it runs one task at a time and exists only for development. For Kubernetes-native environments, KubernetesExecutor provides the strongest isolation since each task gets its own pod with independent resources and dependencies.
Airflow vs. Prefect vs. Dagster in 2026
Airflow competes with two significant alternatives. The right choice depends on team size, existing infrastructure, and the types of pipelines being built.
| Feature | Airflow 3.2 | Prefect 3.x | Dagster 1.9 |
|---|---|---|---|
| DAG definition | Python decorators (airflow.sdk) | Python decorators (@flow, @task) | Python decorators (@asset, @op) |
| Scheduling | Cron, asset-aware, partition-aware | Cron, event-driven | Cron, sensor-based, asset-aware |
| Execution model | Centralized scheduler + distributed workers | Hybrid (server + work pools) | Centralized dagster-daemon |
| Dynamic tasks | .expand() mapped tasks | Native Python loops | Dynamic partitions |
| Async support | Native in 3.2 | Native since 2.0 | Async I/O ops |
| Multi-team isolation | Built-in (3.2 experimental) | Workspace-based (Cloud) | Branch deployments |
| Community size | Largest (35k+ GitHub stars) | Growing (18k+ stars) | Growing (12k+ stars) |
| Best fit | Complex, multi-team pipelines with established infrastructure | Smaller teams wanting fast iteration | Data asset-centric organizations |
Airflow's advantage lies in its ecosystem: 80+ provider packages covering every major cloud service, database, and API. Prefect excels at developer experience with less boilerplate. Dagster's asset-centric model fits teams that think in terms of data products rather than task sequences.
Interview Questions: Apache Airflow for Data Engineers
The following questions reflect what hiring managers and senior engineers ask during data engineering interviews at companies running Airflow in production.
What is a DAG, and how does Airflow use it?
A DAG (Directed Acyclic Graph) defines a workflow as a collection of tasks with dependencies, guaranteed to have no circular references. Airflow parses Python files in the dags/ folder, builds the dependency graph, and the scheduler determines execution order. Each DAG run creates a DagRun object tied to a logical date. The "acyclic" constraint ensures the scheduler can always find a valid execution order — task A runs before task B, never the reverse simultaneously.
How does XCom work, and when should it be avoided?
XCom (cross-communication) passes small pieces of data between tasks. A task pushes a return value to XCom; downstream tasks pull it. The Task SDK handles this automatically when passing function return values between decorated tasks. XCom stores data in the metadata database by default, making it unsuitable for large datasets (anything over a few KB). For transferring large data, use external storage (S3, GCS) and pass only the reference path through XCom.
Explain the difference between schedule, start_date, and catchup.
The schedule parameter (renamed from schedule_interval in Airflow 3.x) defines how often the DAG runs: a cron string, timedelta, timetable object, or asset trigger. start_date sets the earliest logical date for which a DAG run can be created. catchup=True (default) creates DAG runs for all missed intervals between start_date and now. Setting catchup=False tells the scheduler to skip past intervals and only schedule from the current time forward. A common production pattern: set catchup=False for operational DAGs and catchup=True for historical backfills.
How does the KubernetesExecutor differ from CeleryExecutor?
CeleryExecutor maintains a pool of long-running worker processes connected via a message broker (Redis or RabbitMQ). Tasks queue and execute on available workers. KubernetesExecutor creates a new Kubernetes pod for each task, using the task's Docker image and resource requirements. Celery offers lower latency (no pod startup overhead) and works well for homogeneous workloads. KubernetesExecutor provides stronger isolation, per-task resource control, and eliminates the need to manage a static worker pool — ideal for heterogeneous workloads where tasks have different dependency and memory requirements.
What strategies exist for handling task failures?
Airflow provides multiple failure-handling mechanisms. retries and retry_delay configure automatic retries with exponential backoff. on_failure_callback triggers custom logic (Slack alerts, PagerDuty incidents) when a task fails. trigger_rule controls how downstream tasks respond — all_success (default), one_success, all_failed, or none_failed_min_one_success. For transient infrastructure issues, the retry_exponential_backoff=True parameter increases wait time between retries. SLAs (now Deadline Alerts in 3.x) monitor execution duration and fire callbacks when tasks exceed expected runtime.
Production Best Practices for Airflow Deployments
Running Airflow reliably at scale requires attention to several operational patterns that go beyond writing correct DAGs.
Idempotent tasks. Every task should produce the same result when executed multiple times with the same inputs. Use INSERT ... ON CONFLICT or MERGE instead of plain INSERT. Partition output data by logical date. This ensures safe retries and backfills without data duplication.
Small, focused DAGs. Resist the temptation to build monolithic DAGs with dozens of tasks. Split complex pipelines into multiple DAGs connected through assets (formerly datasets). Smaller DAGs parse faster, are easier to debug, and allow partial pipeline restarts.
Connection management. Store all credentials in Airflow's connection manager or an external secrets backend (AWS Secrets Manager, HashiCorp Vault). Never hardcode credentials in DAG files. Airflow 3.x encrypts connection fields at rest in the metadata database.
Monitoring and alerting. Export Airflow metrics to Prometheus or StatsD. Track scheduler_heartbeat, dag_processing.total_parse_time, and executor.queued_tasks. Airflow 3.2 adds OpenTelemetry traces for end-to-end pipeline observability.
Preparing for a data engineering interview that covers pipeline orchestration? SharpSkill offers dedicated Airflow interview questions and practice modules that mirror real-world scenarios from production environments.
Start practicing!
Test your knowledge with our interview simulators and technical tests.
Conclusion
- Airflow 3.2 introduces the Task SDK (
airflow.sdk) as the stable API for DAG authoring — migrate imports now to avoid breaking changes in future releases - Asset partitions enable partition-level data-aware scheduling, eliminating redundant pipeline triggers on partial data updates
- Native async support in the PythonOperator handles I/O-heavy workloads (batch downloads, API fan-outs) without custom deferrable operators
- Dynamic task mapping with
.expand()adapts to variable workload sizes at runtime without DAG code changes - Interview preparation should cover DAG mechanics, executor trade-offs (Celery vs. Kubernetes), XCom limitations, and idempotency patterns
- Production deployments benefit from small idempotent DAGs, external secrets management, and metric exports to observability platforms
Start practicing!
Test your knowledge with our interview simulators and technical tests.
Tags
Share
Related articles

Apache Spark with Python: Building Data Pipelines Step by Step
A hands-on PySpark tutorial covering DataFrame operations, ETL pipeline construction, and Spark 4.0 features. Includes production-ready code examples for data engineers preparing for technical interviews.

Apache Spark 4 in 2026: New Features, Structured Streaming and Interview Questions
A comprehensive guide to Apache Spark 4.x covering ANSI mode, VARIANT type, Real-Time Mode streaming, Spark Connect, and common data engineering interview questions with code examples.

dbt in 2026: Data Transformations, Testing and Interview Questions
Master dbt (data build tool) with this hands-on tutorial covering SQL transformations, layered modeling, testing strategies, and real interview questions for data engineering roles in 2026.