Implementing Data Quality Frameworks in Modern Data Platforms
Poor data quality costs organisations millions annually — in bad decisions, failed regulatory audits, and eroded trust in data products. Yet data quality is frequently treated as an afterthought. Building a robust data quality framework from the start is one of the highest-ROI investments a data platform team can make.
What Is Data Quality?
Data quality is multidimensional. The six commonly cited dimensions are:
| Dimension | Definition | Example |
|---|---|---|
| Completeness | Is all required data present? | No null values in required fields |
| Accuracy | Does the data reflect reality? | Revenue figures match source systems |
| Consistency | Is data consistent across sources? | Customer counts match between CRM and data warehouse |
| Timeliness | Is data available when needed? | Daily tables updated by 08:00 |
| Validity | Does data conform to expected formats? | Dates in ISO 8601, postcodes in correct format |
| Uniqueness | Are there unexpected duplicates? | Each order ID appears exactly once |
A good data quality framework addresses all six dimensions across the platform.
Layers of Data Quality Testing
Layer 1: Source System Validation
Before data enters your platform, validate it at ingestion:
- Schema validation — does the incoming data match the expected schema?
- Row count checks — does the volume look reasonable?
- Freshness checks — is the source system delivering data on schedule?
Tools like Airbyte and Fivetran provide some built-in source monitoring. Custom ingestion pipelines should include validation before writing to the Bronze layer.
Layer 2: Transform-Time Tests
During transformation (Silver and Gold layers), apply rule-based quality tests.
With dbt:
# models/silver/schema.yml
models:
- name: silver_orders
columns:
- name: order_id
tests:
- not_null
- unique
- name: customer_id
tests:
- not_null
- relationships:
to: ref('silver_customers')
field: customer_id
- name: order_status
tests:
- accepted_values:
values: ['pending', 'confirmed', 'shipped', 'cancelled']
These tests run as part of the CI/CD pipeline and catch issues before they reach Gold.
Layer 3: Statistical / Anomaly Detection
Rule-based tests catch known failure modes. Anomaly detection catches the unknown unknowns:
- Row count deviations beyond expected ranges
- Sudden shifts in column distribution
- Unexpected nullity rate changes
Tools like Monte Carlo, Anomalo, and Elementary (open-source, built on dbt) provide automated anomaly detection with minimal configuration.
Layer 4: Business Logic Validation
Some quality checks require deep business domain knowledge:
- Revenue totals should reconcile with finance systems
- Customer counts should follow expected growth trends
- Refund rates should stay within historical bounds
These are typically implemented as SQL assertions or as scheduled dbt models that alert when thresholds are breached.
Choosing Your Tooling
| Tool | Strengths | Best for |
|---|---|---|
| dbt tests | Native to dbt, versioned, CI-integrated | dbt-centric stacks |
| Great Expectations | Highly configurable, rich UI | Python-heavy stacks |
| Soda Core | YAML-based, cloud-native | Multi-warehouse environments |
| Monte Carlo | AI-powered anomaly detection | Enterprise, less manual config |
| Elementary | Open-source, dbt-native observability | dbt shops wanting observability |
For most mid-sized data teams, dbt tests for rule-based checks combined with Elementary or Monte Carlo for anomaly detection covers the majority of use cases.
Building a Data Quality Culture
Tools alone do not create data quality. The cultural elements are equally important.
Data Contracts
Define explicit agreements between data producers and consumers:
- Expected schema
- Freshness SLA
- Quality metrics (e.g., null rate < 1%)
- Owner and escalation path
dbt 1.5+ supports first-class data contracts via contracts: in schema.yml.
Ownership and Accountability
Every data product should have a named owner who is responsible for quality. Quality failures should trigger alerts to the owner, not just the platform team.
Quality Metrics in Data Catalogues
Expose data quality scores in your data catalogue (DataHub, Atlan, Collibra) so consumers can self-assess trustworthiness before building on a dataset.
Incident Response for Data Quality Issues
When quality issues reach production, you need a clear process:
- Detect — automated monitoring raises an alert
- Triage — determine impact and affected consumers
- Communicate — notify downstream consumers immediately
- Remediate — fix the root cause in the pipeline
- Backfill — reprocess affected data
- Post-mortem — document what went wrong and add a test to prevent recurrence
Conclusion
Data quality is not a project — it is an ongoing practice. The best data platforms embed quality checks throughout the pipeline, from ingestion to serving, and create a culture where data producers take ownership of the quality of the data products they publish. Investing in this foundation pays dividends in data trust, faster debugging, and fewer production incidents.