dbt in 2026: Data Transformations, Testing and Interview Questions

Master dbt (data build tool) with this hands-on tutorial covering SQL transformations, layered modeling, testing strategies, and real interview questions for data engineering roles in 2026.

dbt data build tool transformations and testing tutorial 2026

dbt (data build tool) has become the standard framework for transforming data inside modern warehouses, used by over 40,000 companies in production as of 2026. This tutorial covers the core mechanics of dbt transformations, testing strategies, and the questions that actually come up in data engineering interviews.

What dbt actually does

dbt handles the T in ELT. It compiles SQL SELECT statements into DDL (CREATE TABLE, CREATE VIEW, MERGE) and executes them against a warehouse — Snowflake, BigQuery, Redshift, or Databricks. Version control, dependency resolution, testing, and documentation come built in.

Layered Modeling: Staging, Intermediate, and Marts

A well-structured dbt project separates transformations into three layers. This approach, popularized by the dbt community, enforces single-responsibility models and makes debugging straightforward.

Staging models sit closest to the raw data. Their job is narrow: rename columns, cast types, and filter out junk rows. No joins, no aggregations.

sql
-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('ecommerce', 'raw_orders') }}
)
SELECT
    id AS order_id,
    customer_id,
    CAST(order_date AS DATE) AS order_date,
    CAST(amount AS DECIMAL(10, 2)) AS order_amount,
    LOWER(status) AS order_status
FROM source
WHERE id IS NOT NULL

Intermediate models apply business logic — joins, aggregations, window functions. They reference staging models through ref(), which registers dependencies in the DAG.

sql
-- models/intermediate/int_customer_orders.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
)
SELECT
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) AS total_orders,
    SUM(o.order_amount) AS lifetime_value,
    MIN(o.order_date) AS first_order_date,
    MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name

Marts are the final consumption layer — clean, documented tables that dashboards and analysts query directly.

This layered approach means a broken source only affects staging, not the entire pipeline. Each layer can be tested independently.

The ref() Function and DAG Resolution

ref() is not an alias. Calling {{ ref('stg_orders') }} does two things: it resolves the fully qualified table name at compile time, and it registers a dependency edge in dbt's directed acyclic graph (DAG). Without ref(), dbt has no way to determine execution order.

A common mistake is hardcoding table names instead of using ref(). This breaks dependency tracking and can lead to models running before their upstream dependencies are ready.

sql
-- Bad: hardcoded reference, no dependency tracking
SELECT * FROM analytics.stg_orders

-- Good: ref() registers the dependency
SELECT * FROM {{ ref('stg_orders') }}

The DAG also drives features like dbt run --select stg_orders+, which runs a model and everything downstream of it — useful for targeted rebuilds after a source schema change.

Materializations: Choosing the Right Storage Strategy

dbt offers four built-in materializations, each suited to different access patterns and data volumes.

| Materialization | Storage | Best for | Trade-off | |----------------|---------|----------|----------| | view | No storage (computed on read) | Lightweight transforms, small datasets | Slow reads on large data | | table | Full table rebuild each run | Mart-layer models, fast reads | Rebuilds everything, higher cost | | incremental | Appends/merges new rows only | Large fact tables, event streams | More complex logic, needs unique_key | | ephemeral | Inlined as CTE, never materialized | Reusable logic shared across models | Cannot be queried directly |

The default is view. Override it in the model config block or in dbt_project.yml:

sql
-- models/marts/fct_daily_revenue.sql
{{
    config(
        materialized='incremental',
        unique_key='revenue_date',
        incremental_strategy='merge'
    )
}}
SELECT
    DATE(order_date) AS revenue_date,
    SUM(order_amount) AS daily_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM {{ ref('stg_orders') }}
WHERE order_status = 'completed'
{% if is_incremental() %}
    AND order_date > (SELECT MAX(revenue_date) FROM {{ this }})
{% endif %}
GROUP BY DATE(order_date)

The is_incremental() block only runs during incremental builds — on a full refresh (dbt run --full-refresh), it is skipped and the entire table is rebuilt.

Ready to ace your Data Engineering interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Testing Data Quality at Every Layer

dbt provides two categories of tests: generic tests declared in YAML, and singular tests written as standalone SQL files that return rows when they fail.

Generic tests cover structural constraints:

yaml
# models/staging/_stg_models.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique           # No duplicate order IDs
          - not_null         # Every row has an order ID
      - name: order_status
        tests:
          - accepted_values:
              values: ['completed', 'pending', 'cancelled', 'refunded']
      - name: customer_id
        tests:
          - not_null
          - relationships:   # Referential integrity check
              to: ref('stg_customers')
              field: customer_id

Singular tests validate business rules that generic tests cannot express:

sql
-- tests/assert_revenue_never_negative.sql
-- Returns rows where daily revenue is negative (should return 0 rows)
SELECT
    revenue_date,
    daily_revenue
FROM {{ ref('fct_daily_revenue') }}
WHERE daily_revenue < 0

Unit tests, introduced in dbt 1.8, validate transformation logic with controlled inputs and expected outputs — no warehouse required. They run directly in the dbt test phase.

The testing strategy should follow the data flow: source-level freshness checks and row count validations, staging-level schema tests (not_null, unique), intermediate-level referential integrity, and mart-level business logic assertions.

dbt Core v1.10 and the Fusion Engine

dbt Core v1.10 introduced the --sample flag for the run and build commands. This flag applies time-based sampling to ref() and source(), letting developers validate transformations on a subset of data without the cost of a full build. Helpful for iterating on models backed by multi-billion-row fact tables.

The dbt Fusion engine, a ground-up Rust rebuild, is now the default for new projects in dbt Cloud on Snowflake, BigQuery, Redshift, and Databricks. Fusion introduces semantic versioning starting at 2.0 and brings significant performance improvements to compilation and execution.

Other notable 2026 additions: the Semantic Layer YAML spec for centralized metric definitions, Cost Insights (beta) for estimating warehouse compute per model, and native private packages now generally available.

Macros and Jinja for Reusable Logic

When the same SQL pattern appears in multiple models, extract it into a macro. Macros are Jinja functions stored in the macros/ directory.

sql
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ROUND(CAST({{ column_name }} AS DECIMAL(10, 4)) / 100, 2)
{% endmacro %}

Called in any model:

sql
-- models/staging/stg_payments.sql
SELECT
    payment_id,
    order_id,
    {{ cents_to_dollars('amount_cents') }} AS amount_dollars,
    payment_method
FROM {{ source('stripe', 'payments') }}

This keeps the codebase DRY. The alternative — copying the conversion formula into every model that needs it — creates maintenance overhead and inconsistency.

Interview Questions That Actually Get Asked

Data engineering interviews in 2026 increasingly test dbt knowledge beyond the basics. Here are the questions that differentiate candidates, drawn from patterns seen across roles at companies using modern data stacks.

Q: An incremental model reprocessed the entire table overnight. What happened?

The most common cause: the unique_key column contained nulls. When dbt tries to MERGE on a null key, the match fails for every row, so it inserts duplicates. The second cause: someone ran dbt run --full-refresh without realizing it rebuilds the table from scratch. The third: the is_incremental() filter referenced a column that does not exist in the target table yet (first run vs. subsequent runs). Debugging starts with checking the compiled SQL in target/compiled/.

Q: How should tests be layered in a production dbt project?

Source freshness checks run first — if the source has not updated, downstream models should not run on stale data. Staging tests validate schema constraints: primary keys (unique + not_null), accepted values, and type casts. Intermediate tests check referential integrity across joins. Mart tests assert business invariants: revenue >= 0, active users >= paying users, and sum of parts equals total. All tests run in CI on pull requests against a dev schema, then again post-deployment in production with alerting.

Q: When should a model be ephemeral vs. a view?

Ephemeral models inline their SQL as a CTE into consuming models — useful for reusable logic that does not need to be queried independently. Views compute on read and exist as queryable objects. The trade-off: ephemeral models cannot be tested directly (they have no table to run assertions against) and cannot appear in data lineage tools. Use ephemeral for internal helper logic; use views for anything that needs independent testing or monitoring.

Q: Explain the difference between source() and ref().

source() points to raw tables defined in a sources.yml file — these are tables that dbt does not manage. ref() points to other dbt models. Both register DAG dependencies, but source() also enables freshness checks (dbt source freshness), which ref() does not. Using source() for raw tables and ref() for everything else is not optional — it is how dbt knows what it controls and what it does not.

For a broader set of data engineering interview questions, including topics like ETL vs ELT pipeline architecture, the SharpSkill practice modules on dbt fundamentals and advanced dbt patterns cover these concepts with interactive exercises.

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Conclusion

  • dbt transforms raw warehouse data through SQL SELECT statements, compiled into DDL automatically — the layered staging/intermediate/mart structure keeps each model focused on a single responsibility
  • The ref() function is the backbone of dependency management: it resolves table names and builds the DAG that controls execution order
  • Incremental materializations reduce cost on large tables, but require careful handling of unique_key, null values, and the is_incremental() filter
  • Testing should follow the data flow — source freshness, schema constraints at staging, referential integrity at intermediate, business logic assertions at mart level
  • dbt Core v1.10 brings the --sample flag for faster iteration, while the Fusion engine (Rust-based, version 2.0) is now the default in dbt Cloud
  • Interview questions on dbt focus on debugging real failures (incremental reprocessing, null keys, stale sources) rather than definitions — hands-on experience with compiled SQL and CI pipelines matters more than memorized answers

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#dbt
#data-engineering
#data-transformation
#testing
#interview
#analytics-engineering
#sql

Share

Related articles