I am trying to implement a test that can check if there is missing values in the source table.
The context of why I’m trying to do this
I am currently implementing new test to my dbt models that going to feed my dashboards. My users need kpis in order to get an overview of the company performance.
Last week, I got a missing value issue. All the kpis from the dashboards did not break but the kpis were wrong (because there was values in the model). Hopefully, I manage to see the error, but I want to avoid this missing value error without having to check the data everyday.
What I’ve already tried
I tried to build custom singular test that checks if all the ids from my stg table are in my source. Unfortunately, dbt run the test after running the stg model.
Here is the test I was trying to do :
select id from {{ ref('stg_table') }}
where id not in (select id from {{ source('source_table') }})
Long story short, I want to check if all my id from the stg table are in my source table before running the stg model. So, if my ids are not in the source table, I know that I have missing values in the source table.
How can I check that there is no missing values in my sources tables ?
1 - You can simply run dbt test before dbt run, so you test before running
2 - You can create a snapshot for the source table with the invalidate_hard_deletes config enabled (invalidate_hard_deletes | dbt Developer Hub). Then your stg model can reference the snapshot
Does any of these options make sense for your problem?
I think you should consider the audit_helper package. It has a compare_queries test that can show you if you are missing records either on query A or query B. You can set a ‘error’ on the test so it will not continue when the test fails. You can even store the missing values in a table if needed.
Thank you so much for your different options !
I used the simple trick of running the dbt test before dbt run for my specific generic test (using tags). It was easy to make an it works well !
And I will probably try to implement audit helper later in my dbt project because it seems to be a great option.
I think if you check all those id’s before data load trigger using run_query() and create if condition to check all id present. If so then start data load trigger else raise error