4 minute read

In just a few years, dbt (data build tool) has gone from a scrappy open-source project to the de facto standard for data transformation in modern data platforms. Understanding what it does, why it works so well, and how to use it effectively is now essential knowledge for anyone building a data platform.

What Is dbt?

dbt is a transformation tool that lets data teams write modular, testable SQL transformations and compile them into a directed acyclic graph (DAG) of materialised views, tables, and incremental models.

The core premise: transformation logic belongs in the warehouse, not in your ETL tool.

With dbt, you write SELECT statements. dbt handles the CREATE TABLE AS, manages dependencies, runs tests, generates documentation, and deploys everything in the right order.

dbt comes in two flavours:

  • dbt Core — open-source CLI tool
  • dbt Cloud — managed service with a web IDE, scheduling, CI, and observability

Why dbt Has Won

Before dbt, SQL transformations were typically managed through:

  • Stored procedures with no version control
  • Hand-rolled ETL scripts in Python
  • Drag-and-drop ETL tools with opaque internals

dbt brought software engineering practices to SQL:

  • Version control via Git
  • Modular, composable models via ref()
  • Automated testing
  • Auto-generated documentation
  • Clear ownership and lineage

It also embraces SQL — the language that data teams already know — rather than requiring engineers to learn a new paradigm.

Project Structure

A well-structured dbt project follows a layered pattern aligned with the Medallion Architecture:

my_project/
├── models/
│   ├── staging/       # One-to-one with source tables; light cleaning only
│   │   ├── stg_orders.sql
│   │   └── stg_customers.sql
│   ├── intermediate/  # Business logic; joins and transformations
│   │   └── int_orders_enriched.sql
│   └── marts/         # Business-facing; one folder per domain
│       ├── finance/
│       │   └── fct_revenue.sql
│       └── marketing/
│           └── dim_customers.sql
├── tests/             # Custom data tests
├── seeds/             # Static CSV data
├── macros/            # Reusable Jinja macros
├── snapshots/         # SCD Type 2 history tables
└── dbt_project.yml

Staging Models

Staging models are a thin wrapper over source tables. They:

  • Rename columns to a consistent naming convention
  • Cast data types
  • Apply minimal business logic (e.g., parse timestamps)

They should never be joined — that is the job of intermediate models.

Intermediate Models

Intermediate models contain the heavy transformation logic:

  • Joining staging models together
  • Applying business rules
  • Aggregating or pivoting

They are not typically exposed to end users.

Mart Models

Mart models are built for specific business domains and are the primary interface for analytics consumers:

  • Fact tables (fct_) — event-grain, transactional
  • Dimension tables (dim_) — entity-grain, descriptive

Key dbt Features

ref() — Dependency Management

The ref() function is dbt’s superpower. Instead of hardcoding table names, you reference other models:

-- models/marts/finance/fct_revenue.sql
select
    o.order_id,
    o.order_date,
    c.customer_segment,
    o.revenue
from {{ ref('int_orders_enriched') }} o
left join {{ ref('dim_customers') }} c
    on o.customer_id = c.customer_id

dbt automatically infers the dependency graph and executes models in the correct order.

source() — Source Declarations

Raw source tables are declared in YAML, enabling freshness checks and lineage tracking:

# models/staging/sources.yml
sources:
  - name: raw
    schema: raw
    tables:
      - name: orders
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

Materialisation Strategies

dbt supports multiple materialisation strategies:

Strategy Description Use case
view Creates a SQL view Fast, no storage, always fresh
table Creates a full table Better query performance
incremental Appends/updates rows since last run Large tables, daily updates
snapshot SCD Type 2 history tracking Slowly changing dimensions

Testing

dbt tests are declared in YAML and run automatically:

models:
  - name: fct_revenue
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: revenue
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0

Custom SQL tests can also be added in the tests/ directory.

dbt in the Platform Stack

A typical modern data platform stack looks like:

  1. Ingestion — Fivetran / Airbyte / custom pipelines → raw warehouse schema
  2. Transformationdbt → staging, intermediate, and mart models
  3. Serving — BI tools (Looker, Metabase, Tableau) query mart models
  4. Orchestration — dbt runs triggered by Airflow, Dagster, or dbt Cloud
  5. Quality — dbt tests + Elementary / Monte Carlo for observability

Best Practices

  • Keep staging models thin — resist the urge to do too much in staging
  • One model, one purpose — avoid catch-all transformation models
  • Document everything — use description: in YAML files; it pays off when onboarding new team members
  • Use dbt packagesdbt-utils, dbt-expectations, and dbt-date save significant effort
  • Slim CI — on pull requests, only run models modified in the branch (dbt build --select state:modified+)
  • Treat dbt_project.yml as your architecture document — use it to enforce naming conventions and materialisation defaults by layer

Conclusion

dbt has fundamentally changed how data transformation is done. By treating SQL as a first-class engineering artefact — versioned, tested, documented, and composable — it has enabled data teams to build data platforms with the same rigour as any software engineering team. If your data platform does not yet use dbt, the question is not whether to adopt it, but how soon.

Updated: