The Role of dbt in Modern Data Transformation
In just a few years, dbt (data build tool) has gone from a scrappy open-source project to the de facto standard for data transformation in modern data platforms. Understanding what it does, why it works so well, and how to use it effectively is now essential knowledge for anyone building a data platform.
What Is dbt?
dbt is a transformation tool that lets data teams write modular, testable SQL transformations and compile them into a directed acyclic graph (DAG) of materialised views, tables, and incremental models.
The core premise: transformation logic belongs in the warehouse, not in your ETL tool.
With dbt, you write SELECT statements. dbt handles the CREATE TABLE AS, manages dependencies, runs tests, generates documentation, and deploys everything in the right order.
dbt comes in two flavours:
- dbt Core — open-source CLI tool
- dbt Cloud — managed service with a web IDE, scheduling, CI, and observability
Why dbt Has Won
Before dbt, SQL transformations were typically managed through:
- Stored procedures with no version control
- Hand-rolled ETL scripts in Python
- Drag-and-drop ETL tools with opaque internals
dbt brought software engineering practices to SQL:
- Version control via Git
- Modular, composable models via
ref() - Automated testing
- Auto-generated documentation
- Clear ownership and lineage
It also embraces SQL — the language that data teams already know — rather than requiring engineers to learn a new paradigm.
Project Structure
A well-structured dbt project follows a layered pattern aligned with the Medallion Architecture:
my_project/
├── models/
│ ├── staging/ # One-to-one with source tables; light cleaning only
│ │ ├── stg_orders.sql
│ │ └── stg_customers.sql
│ ├── intermediate/ # Business logic; joins and transformations
│ │ └── int_orders_enriched.sql
│ └── marts/ # Business-facing; one folder per domain
│ ├── finance/
│ │ └── fct_revenue.sql
│ └── marketing/
│ └── dim_customers.sql
├── tests/ # Custom data tests
├── seeds/ # Static CSV data
├── macros/ # Reusable Jinja macros
├── snapshots/ # SCD Type 2 history tables
└── dbt_project.yml
Staging Models
Staging models are a thin wrapper over source tables. They:
- Rename columns to a consistent naming convention
- Cast data types
- Apply minimal business logic (e.g., parse timestamps)
They should never be joined — that is the job of intermediate models.
Intermediate Models
Intermediate models contain the heavy transformation logic:
- Joining staging models together
- Applying business rules
- Aggregating or pivoting
They are not typically exposed to end users.
Mart Models
Mart models are built for specific business domains and are the primary interface for analytics consumers:
- Fact tables (
fct_) — event-grain, transactional - Dimension tables (
dim_) — entity-grain, descriptive
Key dbt Features
ref() — Dependency Management
The ref() function is dbt’s superpower. Instead of hardcoding table names, you reference other models:
-- models/marts/finance/fct_revenue.sql
select
o.order_id,
o.order_date,
c.customer_segment,
o.revenue
from {{ ref('int_orders_enriched') }} o
left join {{ ref('dim_customers') }} c
on o.customer_id = c.customer_id
dbt automatically infers the dependency graph and executes models in the correct order.
source() — Source Declarations
Raw source tables are declared in YAML, enabling freshness checks and lineage tracking:
# models/staging/sources.yml
sources:
- name: raw
schema: raw
tables:
- name: orders
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
Materialisation Strategies
dbt supports multiple materialisation strategies:
| Strategy | Description | Use case |
|---|---|---|
view |
Creates a SQL view | Fast, no storage, always fresh |
table |
Creates a full table | Better query performance |
incremental |
Appends/updates rows since last run | Large tables, daily updates |
snapshot |
SCD Type 2 history tracking | Slowly changing dimensions |
Testing
dbt tests are declared in YAML and run automatically:
models:
- name: fct_revenue
columns:
- name: order_id
tests:
- not_null
- unique
- name: revenue
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
Custom SQL tests can also be added in the tests/ directory.
dbt in the Platform Stack
A typical modern data platform stack looks like:
- Ingestion — Fivetran / Airbyte / custom pipelines → raw warehouse schema
- Transformation — dbt → staging, intermediate, and mart models
- Serving — BI tools (Looker, Metabase, Tableau) query mart models
- Orchestration — dbt runs triggered by Airflow, Dagster, or dbt Cloud
- Quality — dbt tests + Elementary / Monte Carlo for observability
Best Practices
- Keep staging models thin — resist the urge to do too much in staging
- One model, one purpose — avoid catch-all transformation models
- Document everything — use
description:in YAML files; it pays off when onboarding new team members - Use dbt packages —
dbt-utils,dbt-expectations, anddbt-datesave significant effort - Slim CI — on pull requests, only run models modified in the branch (
dbt build --select state:modified+) - Treat
dbt_project.ymlas your architecture document — use it to enforce naming conventions and materialisation defaults by layer
Conclusion
dbt has fundamentally changed how data transformation is done. By treating SQL as a first-class engineering artefact — versioned, tested, documented, and composable — it has enabled data teams to build data platforms with the same rigour as any software engineering team. If your data platform does not yet use dbt, the question is not whether to adopt it, but how soon.