3 minute read

Poor data quality costs organisations millions annually — in bad decisions, failed regulatory audits, and eroded trust in data products. Yet data quality is frequently treated as an afterthought. Building a robust data quality framework from the start is one of the highest-ROI investments a data platform team can make.

What Is Data Quality?

Data quality is multidimensional. The six commonly cited dimensions are:

Dimension Definition Example
Completeness Is all required data present? No null values in required fields
Accuracy Does the data reflect reality? Revenue figures match source systems
Consistency Is data consistent across sources? Customer counts match between CRM and data warehouse
Timeliness Is data available when needed? Daily tables updated by 08:00
Validity Does data conform to expected formats? Dates in ISO 8601, postcodes in correct format
Uniqueness Are there unexpected duplicates? Each order ID appears exactly once

A good data quality framework addresses all six dimensions across the platform.

Layers of Data Quality Testing

Layer 1: Source System Validation

Before data enters your platform, validate it at ingestion:

  • Schema validation — does the incoming data match the expected schema?
  • Row count checks — does the volume look reasonable?
  • Freshness checks — is the source system delivering data on schedule?

Tools like Airbyte and Fivetran provide some built-in source monitoring. Custom ingestion pipelines should include validation before writing to the Bronze layer.

Layer 2: Transform-Time Tests

During transformation (Silver and Gold layers), apply rule-based quality tests.

With dbt:

# models/silver/schema.yml
models:
  - name: silver_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('silver_customers')
              field: customer_id
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'confirmed', 'shipped', 'cancelled']

These tests run as part of the CI/CD pipeline and catch issues before they reach Gold.

Layer 3: Statistical / Anomaly Detection

Rule-based tests catch known failure modes. Anomaly detection catches the unknown unknowns:

  • Row count deviations beyond expected ranges
  • Sudden shifts in column distribution
  • Unexpected nullity rate changes

Tools like Monte Carlo, Anomalo, and Elementary (open-source, built on dbt) provide automated anomaly detection with minimal configuration.

Layer 4: Business Logic Validation

Some quality checks require deep business domain knowledge:

  • Revenue totals should reconcile with finance systems
  • Customer counts should follow expected growth trends
  • Refund rates should stay within historical bounds

These are typically implemented as SQL assertions or as scheduled dbt models that alert when thresholds are breached.

Choosing Your Tooling

Tool Strengths Best for
dbt tests Native to dbt, versioned, CI-integrated dbt-centric stacks
Great Expectations Highly configurable, rich UI Python-heavy stacks
Soda Core YAML-based, cloud-native Multi-warehouse environments
Monte Carlo AI-powered anomaly detection Enterprise, less manual config
Elementary Open-source, dbt-native observability dbt shops wanting observability

For most mid-sized data teams, dbt tests for rule-based checks combined with Elementary or Monte Carlo for anomaly detection covers the majority of use cases.

Building a Data Quality Culture

Tools alone do not create data quality. The cultural elements are equally important.

Data Contracts

Define explicit agreements between data producers and consumers:

  • Expected schema
  • Freshness SLA
  • Quality metrics (e.g., null rate < 1%)
  • Owner and escalation path

dbt 1.5+ supports first-class data contracts via contracts: in schema.yml.

Ownership and Accountability

Every data product should have a named owner who is responsible for quality. Quality failures should trigger alerts to the owner, not just the platform team.

Quality Metrics in Data Catalogues

Expose data quality scores in your data catalogue (DataHub, Atlan, Collibra) so consumers can self-assess trustworthiness before building on a dataset.

Incident Response for Data Quality Issues

When quality issues reach production, you need a clear process:

  1. Detect — automated monitoring raises an alert
  2. Triage — determine impact and affected consumers
  3. Communicate — notify downstream consumers immediately
  4. Remediate — fix the root cause in the pipeline
  5. Backfill — reprocess affected data
  6. Post-mortem — document what went wrong and add a test to prevent recurrence

Conclusion

Data quality is not a project — it is an ongoing practice. The best data platforms embed quality checks throughout the pipeline, from ingestion to serving, and create a culture where data producers take ownership of the quality of the data products they publish. Investing in this foundation pays dividends in data trust, faster debugging, and fewer production incidents.

Updated: