Surfacing Data Quality Issues

kmq · March 3, 2023, 11:59pm

Right now we have several very complex transformations and we continually get questions from end users on why a particular row of source data didn’t make it through all the transformation steps into the final database tables. This is sometimes easy to answer (“some source data field was invalid”) and sometimes difficult (“at some intermediate transformation step a particular join condition failed”).

Our goal is to empower users to be able to see these problems themselves and not have to ask the engineering team.

We want to move our pipelines to dbt, but I am struggling to see how to implement this. tests came to mind first as you can save test failures in the database. We could build a UI on top of these tables so users could view problems. However, the downside is it seems rows that fail tests still make it through transformations. Ideally we’d be able to write some validation logic that would both record the problem in a db table for the UI as well as prevent the row from being included in the transformation.

I did see this post that seems to discuss the same issue, but it seems heavy weight for a seemingly common issue. I can’t really imagine doing this for dozens of transformations.

Has anyone faced this issue and found a simpler solution?

Topic		Replies	Views
How to handle test failures Archive	5	18330	February 24, 2022
Best Practices for Data Quality Checks in dbt for Data Processing Pipeline In-Depth Discussions	0	634	June 25, 2024
How to get started with data testing In-Depth Discussions testing	4	8071	July 16, 2019
Identify and capture records of failed cases in dbt tests Help testing	0	427	September 19, 2024
Data quality + transformation in DBT Help	1	408	April 17, 2024

Surfacing Data Quality Issues

Related topics