Surfacing Data Quality Issues

Right now we have several very complex transformations and we continually get questions from end users on why a particular row of source data didn’t make it through all the transformation steps into the final database tables. This is sometimes easy to answer (“some source data field was invalid”) and sometimes difficult (“at some intermediate transformation step a particular join condition failed”).

Our goal is to empower users to be able to see these problems themselves and not have to ask the engineering team.

We want to move our pipelines to dbt, but I am struggling to see how to implement this. tests came to mind first as you can save test failures in the database. We could build a UI on top of these tables so users could view problems. However, the downside is it seems rows that fail tests still make it through transformations. Ideally we’d be able to write some validation logic that would both record the problem in a db table for the UI as well as prevent the row from being included in the transformation.

I did see this post that seems to discuss the same issue, but it seems heavy weight for a seemingly common issue. I can’t really imagine doing this for dozens of transformations.

Has anyone faced this issue and found a simpler solution?