Apache Airflow in 2026: Pipeline Orchestration, DAGs and Interview Questions

Master Apache Airflow 3.2 with this hands-on tutorial covering DAG authoring with the Task SDK, pipeline orchestration patterns, asset partitions, and real interview questions for data engineering roles in 2026.

Apache Airflow pipeline orchestration DAGs tutorial 2026

Apache Airflow 3.2 represents the most significant evolution of the open-source orchestration platform since its 3.0 architectural overhaul. With over 10 million monthly downloads on PyPI and adoption across companies ranging from Airbnb to Spotify, Airflow remains the dominant tool for building, scheduling, and monitoring data pipelines. This tutorial covers DAG authoring with the new Task SDK, pipeline orchestration patterns, and the interview questions that surface in data engineering roles.

Airflow 3.2 at a glance

Airflow 3.2, released in April 2026, introduces asset partitions for granular data-aware scheduling, native async support in the PythonOperator, and multi-team deployments. All DAG imports now use the stable airflow.sdk namespace introduced in Airflow 3.0.

DAG Authoring with the Airflow Task SDK

Airflow 3.0 introduced the Task SDK, a standalone package that decouples DAG definitions from Airflow internals. The goal: write portable, version-stable DAGs that survive Airflow upgrades without code changes. All core objects — DAG, dag, task, BaseOperator, Connection, Variable — now live under airflow.sdk.

The legacy import paths (airflow.decorators.task, airflow.models.dag.DAG) still work in 3.2 but are marked deprecated and will be removed in a future release.

python
# etl_daily_revenue.py
import pendulum
from airflow.sdk import dag, task

@dag(
    schedule="@daily",
    start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
    catchup=False,
    tags=["finance", "etl"],
)
def etl_daily_revenue():
    @task()
    def extract_transactions() -> list[dict]:
        # Pulls raw transactions from the payments API
        import requests
        response = requests.get("https://api.internal/payments/daily")
        return response.json()["transactions"]

    @task(multiple_outputs=True)
    def transform(transactions: list[dict]) -> dict:
        # Aggregates by currency and computes totals
        totals = {}
        for tx in transactions:
            currency = tx["currency"]
            totals[currency] = totals.get(currency, 0) + tx["amount"]
        return {"totals": totals, "count": len(transactions)}

    @task()
    def load(totals: dict, count: int):
        # Inserts aggregated row into the warehouse
        from warehouse import insert_revenue_summary
        insert_revenue_summary(totals=totals, transaction_count=count)

    raw = extract_transactions()
    result = transform(raw)
    load(result["totals"], result["count"])

etl_daily_revenue()

The @dag decorator replaces the traditional with DAG(...) context manager. The @task decorator turns plain Python functions into Airflow tasks, with XCom serialization handled automatically. Calling transform(raw) declares a dependency — Airflow builds the DAG edges from the function call graph.

Dynamic Task Mapping for Variable Workloads

Static DAGs break when the number of items to process changes between runs. Dynamic task mapping, stable since Airflow 2.4 and refined in 3.x, solves this by expanding tasks at runtime using .expand().

python
# etl_multi_region.py
import pendulum
from airflow.sdk import dag, task

@dag(
    schedule="@hourly",
    start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
    catchup=False,
    tags=["regional", "etl"],
)
def etl_multi_region():
    @task()
    def get_regions() -> list[str]:
        # Returns the active regions from config
        return ["us-east-1", "eu-west-1", "ap-southeast-1"]

    @task()
    def process_region(region: str) -> dict:
        # Processes events for a single region
        from data_processor import aggregate_events
        return aggregate_events(region)

    @task()
    def merge_results(results: list[dict]):
        # Combines all regional outputs into a single report
        from reporter import publish_global_report
        publish_global_report(results)

    regions = get_regions()
    # .expand() creates one mapped task instance per region
    processed = process_region.expand(region=regions)
    merge_results(processed)

etl_multi_region()

At execution time, Airflow creates three parallel instances of process_region — one per region. If a fourth region appears tomorrow, no DAG code changes are needed. The merge_results task waits for all mapped instances to complete before executing.

Mapped task limits

By default, Airflow caps mapped tasks at 1024 instances per mapping. Override this with max_map_length in the DAG configuration when processing larger datasets.

Asset Partitions: Data-Aware Scheduling in Airflow 3.2

Before Airflow 3.2, data-aware scheduling operated at the asset level: when a producer DAG updated an asset, all consumer DAGs triggered regardless of which data slice changed. Asset partitions fix this by enabling partition-level granularity.

Consider a scenario where three upstream DAGs produce hourly player statistics for different sports leagues. A downstream analytics DAG should only trigger when all three leagues have published data for the same hour.

python
# downstream_analytics.py
import pendulum
from airflow.sdk import dag, task
from airflow.timetables.assets import CronPartitionTimetable

# Define partitioned assets with hourly granularity
nba_stats = Asset("s3://datalake/nba/hourly/", partitions=CronPartitionTimetable("0 * * * *"))
epl_stats = Asset("s3://datalake/epl/hourly/", partitions=CronPartitionTimetable("0 * * * *"))
nfl_stats = Asset("s3://datalake/nfl/hourly/", partitions=CronPartitionTimetable("0 * * * *"))

@dag(
    # Triggers only when all three assets have matching partition
    schedule=(nba_stats & epl_stats & nfl_stats),
    start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
    catchup=False,
)
def unified_sports_analytics():
    @task()
    def aggregate_all_leagues(**context):
        partition = context["partition"]
        # Process only the specific hourly slice across all leagues
        from analytics import build_cross_league_report
        build_cross_league_report(partition=partition)

    aggregate_all_leagues()

unified_sports_analytics()

Partitioned scheduling eliminates redundant pipeline runs. The downstream DAG only fires when the exact partition it needs is available across all upstream sources — not before, not on partial data.

Ready to ace your Data Engineering interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Native Async Tasks for I/O-Heavy Workloads

Airflow 3.2 adds native async support in the PythonOperator. Previously, running concurrent I/O operations (thousands of API calls, batch file downloads) required writing custom deferrable operators. Now, pass an async function directly.

python
# async_file_download.py
import asyncio
import aiohttp
from airflow.sdk import dag, task
import pendulum

@dag(
    schedule="@daily",
    start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
    catchup=False,
    tags=["async", "download"],
)
def async_file_download():
    @task()
    async def download_files():
        urls = [f"https://data.source/files/{i}.csv" for i in range(500)]
        async with aiohttp.ClientSession() as session:
            # Downloads 500 files concurrently instead of sequentially
            tasks = [fetch_and_save(session, url) for url in urls]
            results = await asyncio.gather(*tasks)
        return {"downloaded": len(results)}

    download_files()

async def fetch_and_save(session: aiohttp.ClientSession, url: str):
    async with session.get(url) as resp:
        content = await resp.read()
        filename = url.split("/")[-1]
        with open(f"/data/downloads/{filename}", "wb") as f:
            f.write(content)
        return filename

async_file_download()

The async approach downloads 500 files concurrently through a single worker slot, compared to 500 sequential HTTP calls in the synchronous version. For I/O-bound workloads, this translates to order-of-magnitude speedups without provisioning additional workers.

Airflow Architecture: Components and Execution Flow

Understanding Airflow's architecture is essential for both operating it in production and answering interview questions. Five components interact during every DAG run:

  • Scheduler — Parses DAG files, resolves dependencies, and queues tasks for execution. In Airflow 3.x, the scheduler operates statelessly against the metadata database.
  • Executor — Determines where tasks run. LocalExecutor handles single-machine setups. CeleryExecutor distributes across worker nodes. KubernetesExecutor spins up a pod per task.
  • Workers — Execute the actual task code. With the new Task Execution API in Airflow 3.0, workers communicate through a stable contract, enabling execution in containers, edge environments, or external runtimes.
  • Metadata database — PostgreSQL (recommended) or MySQL. Stores DAG definitions, task states, XCom values, connection credentials, and audit logs.
  • Web server — The Airflow UI for monitoring DAG runs, inspecting logs, triggering manual runs, and managing connections.
Production executor choice

Avoid SequentialExecutor in production — it runs one task at a time and exists only for development. For Kubernetes-native environments, KubernetesExecutor provides the strongest isolation since each task gets its own pod with independent resources and dependencies.

Airflow vs. Prefect vs. Dagster in 2026

Airflow competes with two significant alternatives. The right choice depends on team size, existing infrastructure, and the types of pipelines being built.

| Feature | Airflow 3.2 | Prefect 3.x | Dagster 1.9 | |---|---|---|---| | DAG definition | Python decorators (airflow.sdk) | Python decorators (@flow, @task) | Python decorators (@asset, @op) | | Scheduling | Cron, asset-aware, partition-aware | Cron, event-driven | Cron, sensor-based, asset-aware | | Execution model | Centralized scheduler + distributed workers | Hybrid (server + work pools) | Centralized dagster-daemon | | Dynamic tasks | .expand() mapped tasks | Native Python loops | Dynamic partitions | | Async support | Native in 3.2 | Native since 2.0 | Async I/O ops | | Multi-team isolation | Built-in (3.2 experimental) | Workspace-based (Cloud) | Branch deployments | | Community size | Largest (35k+ GitHub stars) | Growing (18k+ stars) | Growing (12k+ stars) | | Best fit | Complex, multi-team pipelines with established infrastructure | Smaller teams wanting fast iteration | Data asset-centric organizations |

Airflow's advantage lies in its ecosystem: 80+ provider packages covering every major cloud service, database, and API. Prefect excels at developer experience with less boilerplate. Dagster's asset-centric model fits teams that think in terms of data products rather than task sequences.

Interview Questions: Apache Airflow for Data Engineers

The following questions reflect what hiring managers and senior engineers ask during data engineering interviews at companies running Airflow in production.

What is a DAG, and how does Airflow use it?

A DAG (Directed Acyclic Graph) defines a workflow as a collection of tasks with dependencies, guaranteed to have no circular references. Airflow parses Python files in the dags/ folder, builds the dependency graph, and the scheduler determines execution order. Each DAG run creates a DagRun object tied to a logical date. The "acyclic" constraint ensures the scheduler can always find a valid execution order — task A runs before task B, never the reverse simultaneously.

How does XCom work, and when should it be avoided?

XCom (cross-communication) passes small pieces of data between tasks. A task pushes a return value to XCom; downstream tasks pull it. The Task SDK handles this automatically when passing function return values between decorated tasks. XCom stores data in the metadata database by default, making it unsuitable for large datasets (anything over a few KB). For transferring large data, use external storage (S3, GCS) and pass only the reference path through XCom.

Explain the difference between schedule, start_date, and catchup.

The schedule parameter (renamed from schedule_interval in Airflow 3.x) defines how often the DAG runs: a cron string, timedelta, timetable object, or asset trigger. start_date sets the earliest logical date for which a DAG run can be created. catchup=True (default) creates DAG runs for all missed intervals between start_date and now. Setting catchup=False tells the scheduler to skip past intervals and only schedule from the current time forward. A common production pattern: set catchup=False for operational DAGs and catchup=True for historical backfills.

How does the KubernetesExecutor differ from CeleryExecutor?

CeleryExecutor maintains a pool of long-running worker processes connected via a message broker (Redis or RabbitMQ). Tasks queue and execute on available workers. KubernetesExecutor creates a new Kubernetes pod for each task, using the task's Docker image and resource requirements. Celery offers lower latency (no pod startup overhead) and works well for homogeneous workloads. KubernetesExecutor provides stronger isolation, per-task resource control, and eliminates the need to manage a static worker pool — ideal for heterogeneous workloads where tasks have different dependency and memory requirements.

What strategies exist for handling task failures?

Airflow provides multiple failure-handling mechanisms. retries and retry_delay configure automatic retries with exponential backoff. on_failure_callback triggers custom logic (Slack alerts, PagerDuty incidents) when a task fails. trigger_rule controls how downstream tasks respond — all_success (default), one_success, all_failed, or none_failed_min_one_success. For transient infrastructure issues, the retry_exponential_backoff=True parameter increases wait time between retries. SLAs (now Deadline Alerts in 3.x) monitor execution duration and fire callbacks when tasks exceed expected runtime.

Production Best Practices for Airflow Deployments

Running Airflow reliably at scale requires attention to several operational patterns that go beyond writing correct DAGs.

Idempotent tasks. Every task should produce the same result when executed multiple times with the same inputs. Use INSERT ... ON CONFLICT or MERGE instead of plain INSERT. Partition output data by logical date. This ensures safe retries and backfills without data duplication.

Small, focused DAGs. Resist the temptation to build monolithic DAGs with dozens of tasks. Split complex pipelines into multiple DAGs connected through assets (formerly datasets). Smaller DAGs parse faster, are easier to debug, and allow partial pipeline restarts.

Connection management. Store all credentials in Airflow's connection manager or an external secrets backend (AWS Secrets Manager, HashiCorp Vault). Never hardcode credentials in DAG files. Airflow 3.x encrypts connection fields at rest in the metadata database.

Monitoring and alerting. Export Airflow metrics to Prometheus or StatsD. Track scheduler_heartbeat, dag_processing.total_parse_time, and executor.queued_tasks. Airflow 3.2 adds OpenTelemetry traces for end-to-end pipeline observability.

Preparing for a data engineering interview that covers pipeline orchestration? SharpSkill offers dedicated Airflow interview questions and practice modules that mirror real-world scenarios from production environments.

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Conclusion

  • Airflow 3.2 introduces the Task SDK (airflow.sdk) as the stable API for DAG authoring — migrate imports now to avoid breaking changes in future releases
  • Asset partitions enable partition-level data-aware scheduling, eliminating redundant pipeline triggers on partial data updates
  • Native async support in the PythonOperator handles I/O-heavy workloads (batch downloads, API fan-outs) without custom deferrable operators
  • Dynamic task mapping with .expand() adapts to variable workload sizes at runtime without DAG code changes
  • Interview preparation should cover DAG mechanics, executor trade-offs (Celery vs. Kubernetes), XCom limitations, and idempotency patterns
  • Production deployments benefit from small idempotent DAGs, external secrets management, and metric exports to observability platforms

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#apache-airflow
#data-engineering
#data-pipeline
#orchestration
#dag
#interview
#python

Share

Related articles