The Role of dbt in Modern Data Transformation

In just a few years, dbt (data build tool) has gone from a scrappy open-source project to the de facto standard for data transformation in modern data platforms. Understanding what it does, why it works so well, and how to use it effectively is now essential knowledge for anyone building a data platform.

What Is dbt?

dbt is a transformation tool that lets data teams write modular, testable SQL transformations and compile them into a directed acyclic graph (DAG) of materialised views, tables, and incremental models.

The core premise: transformation logic belongs in the warehouse, not in your ETL tool.

With dbt, you write SELECT statements. dbt handles the CREATE TABLE AS, manages dependencies, runs tests, generates documentation, and deploys everything in the right order.

dbt comes in two flavours:

dbt Core — open-source CLI tool
dbt Cloud — managed service with a web IDE, scheduling, CI, and observability

Why dbt Has Won

Before dbt, SQL transformations were typically managed through:

Stored procedures with no version control
Hand-rolled ETL scripts in Python
Drag-and-drop ETL tools with opaque internals

dbt brought software engineering practices to SQL:

Version control via Git
Modular, composable models via ref()
Automated testing
Auto-generated documentation
Clear ownership and lineage

It also embraces SQL — the language that data teams already know — rather than requiring engineers to learn a new paradigm.

Project Structure

A well-structured dbt project follows a layered pattern aligned with the Medallion Architecture:

my_project/
├── models/
│   ├── staging/       # One-to-one with source tables; light cleaning only
│   │   ├── stg_orders.sql
│   │   └── stg_customers.sql
│   ├── intermediate/  # Business logic; joins and transformations
│   │   └── int_orders_enriched.sql
│   └── marts/         # Business-facing; one folder per domain
│       ├── finance/
│       │   └── fct_revenue.sql
│       └── marketing/
│           └── dim_customers.sql
├── tests/             # Custom data tests
├── seeds/             # Static CSV data
├── macros/            # Reusable Jinja macros
├── snapshots/         # SCD Type 2 history tables
└── dbt_project.yml

Staging Models

Staging models are a thin wrapper over source tables. They:

Rename columns to a consistent naming convention
Cast data types
Apply minimal business logic (e.g., parse timestamps)

They should never be joined — that is the job of intermediate models.

Intermediate Models

Intermediate models contain the heavy transformation logic:

Joining staging models together
Applying business rules
Aggregating or pivoting

They are not typically exposed to end users.

Mart Models

Mart models are built for specific business domains and are the primary interface for analytics consumers:

Fact tables (fct_) — event-grain, transactional
Dimension tables (dim_) — entity-grain, descriptive

Key dbt Features

`ref()` — Dependency Management

The ref() function is dbt’s superpower. Instead of hardcoding table names, you reference other models:

-- models/marts/finance/fct_revenue.sql
select
    o.order_id,
    o.order_date,
    c.customer_segment,
    o.revenue
from {{ ref('int_orders_enriched') }} o
left join {{ ref('dim_customers') }} c
    on o.customer_id = c.customer_id

dbt automatically infers the dependency graph and executes models in the correct order.

`source()` — Source Declarations

Raw source tables are declared in YAML, enabling freshness checks and lineage tracking:

# models/staging/sources.yml
sources:
  - name: raw
    schema: raw
    tables:
      - name: orders
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

Materialisation Strategies

dbt supports multiple materialisation strategies:

Strategy	Description	Use case
`view`	Creates a SQL view	Fast, no storage, always fresh
`table`	Creates a full table	Better query performance
`incremental`	Appends/updates rows since last run	Large tables, daily updates
`snapshot`	SCD Type 2 history tracking	Slowly changing dimensions

Testing

dbt tests are declared in YAML and run automatically:

models:
  - name: fct_revenue
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: revenue
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0

Custom SQL tests can also be added in the tests/ directory.

dbt in the Platform Stack

A typical modern data platform stack looks like:

Ingestion — Fivetran / Airbyte / custom pipelines → raw warehouse schema
Transformation — dbt → staging, intermediate, and mart models
Serving — BI tools (Looker, Metabase, Tableau) query mart models
Orchestration — dbt runs triggered by Airflow, Dagster, or dbt Cloud
Quality — dbt tests + Elementary / Monte Carlo for observability

Best Practices

Keep staging models thin — resist the urge to do too much in staging
One model, one purpose — avoid catch-all transformation models
Document everything — use description: in YAML files; it pays off when onboarding new team members
Use dbt packages — dbt-utils, dbt-expectations, and dbt-date save significant effort
Slim CI — on pull requests, only run models modified in the branch (dbt build --select state:modified+)
Treat dbt_project.yml as your architecture document — use it to enforce naming conventions and materialisation defaults by layer

Conclusion

dbt has fundamentally changed how data transformation is done. By treating SQL as a first-class engineering artefact — versioned, tested, documented, and composable — it has enabled data teams to build data platforms with the same rigour as any software engineering team. If your data platform does not yet use dbt, the question is not whether to adopt it, but how soon.