Select rows based on test pass/fail status

The problem I’m having

Hi all! Looking for some advice on dbt best practices. I have a db which is the result of a manual data input process for which I would like to use dbt to split into high-quality data (passing tests) and data for review (failing tests).

The context of why I’m trying to do this

When dbt test fails, it’s likely not because anything is wrong with my data – it’s just that manual curation is imperfect / incomplete. I’d like to be able to push through the rows that do pass to downstream models while holding back those that don’t.

What I’ve already tried

I’ve tried using store-failures but this doesn’t necessarily give me a clear path to finding which rows are the source of the failures (without writing lots of manual code to reverse engineer the tests). I’ve also tried using an incremental model and selecting into the for-review model based on what is not in the downstream model, but as soon as one row is failing, this approach holds back everything.

Hello! We had a similar case with my current client. The way we ended up working around this was to build our own ‘Data Quality’ model that sat between the intermediate model and our final mart. The DQ model was then just a collection of CTEs, where each one was its own test and the select included the unique key, which we then unioned together at the end.

In our final mart, we then basically did select * from int_model where id not in (select distinct id from dq_model).

This has the downside of not being able to use any of the dbt tests, but we created our own macros for common tests and based most of it on dbt tests!

We also selected out a test name/description as part of the DQ model, which meant we could give good summaries to our client around where the biggest quality issues were.

A manual approach but it’s working for us! :slight_smile:

1 Like