dbt for Data Analysts in 2026: Modeling, Testing and Interview Questions

Master dbt (data build tool) for data analytics — project structure, SQL modeling, testing strategies, and common interview questions with practical examples.

dbt data build tool modeling and testing for data analysts

dbt (data build tool) has become the standard transformation layer in the modern data stack, and data analyst roles in 2026 increasingly require hands-on dbt skills. With dbt Core v2.0 in alpha and v1.12 stable, the framework now covers everything from SQL modeling to automated testing and documentation — all version-controlled in Git.

This guide covers dbt project structure, modeling patterns, testing strategies, and the interview questions most frequently asked in data analyst and data engineering interviews.

dbt in the ELT Pipeline

dbt handles the T (Transform) in ELT. Raw data lands in the warehouse via ingestion tools (Fivetran, Airbyte), then dbt transforms it into clean, tested, documented models ready for analysis. Unlike traditional ETL, transformations run directly in the warehouse using SQL.

dbt Project Structure: Staging, Intermediate and Marts

A well-organized dbt project follows a layered architecture. Each layer has a specific responsibility, and this separation prevents spaghetti SQL that becomes impossible to maintain.

The standard convention uses three layers:

  • Staging — thin models that rename, cast, and clean raw source data. One staging model per source table.
  • Intermediate — business logic that joins and aggregates staging models. These are internal building blocks.
  • Marts — final models consumed by dashboards, reports, and analysts. Each mart represents a business entity or metric.
sql
-- models/staging/stripe/stg_stripe__payments.sql
with source as (
    select * from {{ source('stripe', 'payments') }}
),

renamed as (
    select
        id as payment_id,
        amount / 100.0 as amount_usd,  -- Stripe stores cents
        status as payment_status,
        created::timestamp as created_at,
        customer_id
    from source
    where status != 'failed'  -- Filter invalid records early
)

select * from renamed
sql
-- models/marts/finance/fct_monthly_revenue.sql
with payments as (
    select * from {{ ref('stg_stripe__payments') }}
),

monthly as (
    select
        date_trunc('month', created_at) as revenue_month,
        count(*) as total_transactions,
        sum(amount_usd) as gross_revenue,
        sum(case when payment_status = 'refunded' then amount_usd else 0 end) as refunds
    from payments
    group by 1
)

select
    revenue_month,
    total_transactions,
    gross_revenue,
    refunds,
    gross_revenue - refunds as net_revenue  -- Key business metric
from monthly

The staging model handles type casting and renaming. The mart model contains the business logic. This separation means a change in the source schema only requires updating one staging model, not every downstream query.

Materializations: Choosing the Right Strategy

dbt supports four core materializations that control how models are persisted in the warehouse. Picking the wrong one leads to either slow queries or unnecessary compute costs.

| Materialization | Use Case | Rebuild Behavior | |----------------|----------|-------------------| | view | Lightweight staging models, low query frequency | Recreated as SQL view on each run | | table | Mart models queried often by dashboards | Full table rebuild on each run | | incremental | Large fact tables (events, logs) | Appends/merges new rows only | | ephemeral | Reusable CTEs, never queried directly | Compiled inline as subquery |

Incremental models deserve special attention because they handle the most common performance bottleneck — processing billions of rows on every run.

sql
-- models/marts/product/fct_page_views.sql
{{ config(
    materialized='incremental',
    unique_key='page_view_id',
    incremental_strategy='merge'
) }}

with events as (
    select
        event_id as page_view_id,
        user_id,
        page_url,
        session_id,
        event_timestamp
    from {{ ref('stg_snowplow__events') }}
    where event_type = 'page_view'

    {% if is_incremental() %}
        -- Only process new events since last run
        and event_timestamp > (select max(event_timestamp) from {{ this }})
    {% endif %}
)

select * from events

The is_incremental() macro checks whether the model already exists. On the first run, dbt processes all data. On subsequent runs, only new rows are processed — reducing a 2-hour job to minutes.

Testing Data Quality with dbt

Data testing in dbt works at two levels: schema tests defined in YAML and custom tests written in SQL. Both run with dbt test and fail the pipeline if violations are found.

Schema tests cover the most common data quality checks without writing SQL:

yaml
# models/staging/stripe/_stripe__models.yml
version: 2

models:
  - name: stg_stripe__payments
    description: "Cleaned payment records from Stripe"
    columns:
      - name: payment_id
        description: "Unique payment identifier"
        tests:
          - unique
          - not_null
      - name: payment_status
        tests:
          - accepted_values:
              values: ['succeeded', 'pending', 'refunded']
      - name: amount_usd
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"  # No negative payments

For complex validation logic that YAML cannot express, custom singular tests fill the gap:

sql
-- tests/assert_revenue_not_negative.sql
-- This test fails if any month has negative net revenue
select
    revenue_month,
    net_revenue
from {{ ref('fct_monthly_revenue') }}
where net_revenue < 0  -- Should never happen

Any rows returned by a singular test indicate a failure. This pattern catches data corruption, upstream bugs, and logic errors before they reach dashboards.

Ready to ace your Data Analytics interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Jinja and Macros: Reusable SQL Logic

dbt uses Jinja templating to make SQL dynamic. The two most important functions are ref() for model references and source() for raw data references. Beyond these, macros eliminate repetitive SQL patterns across the project.

sql
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)::numeric(12, 2)
{% endmacro %}
sql
-- Usage in any model
select
    payment_id,
    {{ cents_to_dollars('amount_cents') }} as amount_usd,
    {{ cents_to_dollars('tax_cents') }} as tax_usd
from {{ source('stripe', 'payments') }}

The macro compiles to standard SQL at build time. This approach guarantees consistent currency conversion logic across every model that handles monetary values — a single change in the macro propagates everywhere.

dbt Interview Questions for Data Analysts

Interviewers testing dbt knowledge focus on practical understanding of the tool rather than memorized definitions. The following questions appear consistently in data analytics interviews.

Q1: What is the difference between ref() and source()?

source() points to raw tables loaded by ingestion tools and defined in a sources.yml file. ref() points to other dbt models. Using ref() builds the DAG (directed acyclic graph) that dbt uses to determine execution order and track lineage. Hardcoding table names instead of using ref() breaks lineage tracking and prevents dbt from running models in the correct order.

Q2: When should an incremental model be used instead of a table?

Incremental models are appropriate when the source data is append-only or has a reliable timestamp column, and the table is large enough that full rebuilds are too slow or expensive. Event tables, log data, and time-series data are typical candidates. Small dimension tables should remain as table materializations because the complexity of incremental logic is not justified.

Q3: How does dbt handle slowly changing dimensions (SCD Type 2)?

dbt snapshots implement SCD Type 2 tracking. A snapshot monitors a source table and records changes over time by adding dbt_valid_from and dbt_valid_to columns. Two strategies exist: timestamp (uses an updated_at column) and check (compares column values directly). Snapshots run with dbt snapshot, separately from dbt run.

Q4: Explain the staging/intermediate/marts pattern.

Staging models clean and rename raw source data — one model per source table, no joins, no aggregations. Intermediate models contain business logic: joins, filters, and calculations that serve as building blocks. Mart models are the final outputs consumed by analysts and dashboards. This layering ensures each model has a single responsibility and changes to source schemas only impact staging models.

Q5: How do custom tests differ from schema tests?

Schema tests are declared in YAML and cover standard checks: unique, not_null, accepted_values, and relationships. Custom singular tests are SQL queries stored in the tests/ directory — any rows returned indicate failure. Custom generic tests are parameterized macros that can be reused across models like schema tests. Use schema tests for common constraints and singular tests for complex business rules like "revenue should never be negative."

dbt Best Practices for Production Projects

Production dbt projects that scale beyond a handful of models require consistent conventions. These practices prevent the technical debt that accumulates when multiple analysts contribute to the same project.

Naming conventions keep the project navigable. Staging models use the prefix stg_, intermediate models use int_, and fact/dimension tables use fct_ or dim_. Source-specific subdirectories (models/staging/stripe/, models/staging/salesforce/) group related models.

Documentation lives next to the code. Every model and column should have a description in the corresponding YAML file. Running dbt docs generate produces a searchable documentation site with a visual DAG — essential for onboarding new team members and auditing data lineage.

Source freshness checks catch upstream failures before they cascade:

yaml
# models/staging/stripe/_stripe__sources.yml
version: 2

sources:
  - name: stripe
    database: raw
    schema: stripe
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _fivetran_synced
    tables:
      - name: payments
      - name: customers

Running dbt source freshness checks whether the source data has been updated within the expected window. A 24-hour gap in payment data signals an ingestion failure that needs investigation before running transformations.

For deeper SQL skills applied in dbt models, the guides on SQL window functions and CTEs and advanced SQL for interviews cover the query patterns used most frequently in staging and mart models.

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Conclusion

  • dbt handles the transformation layer in ELT pipelines, running SQL directly in the data warehouse without requiring a separate compute engine
  • The staging/intermediate/marts pattern separates source cleaning from business logic, making changes predictable and isolated
  • Incremental materializations reduce processing time for large tables by only handling new or updated rows
  • Schema tests (YAML) cover standard constraints; singular tests (SQL) validate complex business rules
  • Jinja macros eliminate repeated SQL logic and ensure consistency across models
  • Source freshness checks catch upstream ingestion failures before they corrupt downstream models
  • Interview preparation should focus on practical scenarios — DAG lineage, materialization trade-offs, and testing strategies — rather than memorized definitions

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#dbt
#data-analytics
#sql
#data-modeling
#interview

Share

Related articles