Hi eb
I would like to test if some data exist in some tables - which are NOT sources to models in my dbt job -
before starting job execution (if data doesn’t exist job should exit).
I guess I could set a project variable with some macro in the projects file vars section.
Then test this variable before all models execution…
but I would much prefer to have this check done once and not n times (per n models)…
Another not so elegant solution is to raise an error which as stated before
creates noise in monitoring systems.
hey
Since we are not using any orchestrating tool (like airflow etc.),
we’d need dbt to check that a number of facts, from which it draws data for statistics and kpi calculations
, are complete with yesterday’s data.
I would like then to run a macro on job start which asserts such conditions
and only if they pass would the dbt job start to role out.
The dbt Cloud runner doesn’t have triggers like this. If you don’t want to trigger your monitoring systems you could perhaps use webhooks to check the reason for failure and only send a failure notification if it’s an unexpected reason (examples in this guide).
Otherwise, this is still technically a table that needs to be fresh in order for your job to run. You could define it as a source with freshness rules, and then use the source_status:fresher selection method to only run models if that source table is fresh. You would need to force the dependency.
Finally, is it possible to use freshness properties on the actual sources you’re querying? This would be the most ideal outcome if it’s possible, because you won’t be fighting the system.
thank you very much!
I am using CLI - not cloud runner (then running the dbt project on K8s).
Freshness rule are tricky but may work??
It’s not enough to know there’s some fresh data in a source…
I need to know it is fresh and complete.
This I can only know by running a query over it,
which does statistics and returns a boolean result (true\false).
So the problem is how would I use such a query result as a freshness hook…
If I make the first model in the project the check results table (populating it only with true results),
would I be still able to attach a freshness rule to this table,
while it’s a model by itself in the project (not just a source)…?
Kinda feeling a vicious cycle here…
Oh I didn’t realise the computed statistics are themselves a model in the project!
It sounds like you might want to make an incremental model. Incremental models only insert records that have changed since the last run, and you can implement custom logic to express what that means for you.
For example, it might be:
with upstream_table as (
select * from {{ ref('upstream_table') }}
),
summary_stats_table as (
select * from {{ ref('summary_stats_table') }}
),
final as (
select *
from upstream_table
{% if is_incremental() %}
where
-- find new records not already in the table
upstream_table.modified_at > (select max(this.modified_at) from {{ this }} as this)
and exists (
-- check that the summary stats table is complete for the day(s) being processed
select null
from summary_stats_table
where upstream_table.modified_at = summary_stats_table.summary_date
and summary_stats_table.statistics_are_complete = 1
)
{% endif %}
)
select * from final
I think you’ve gotten bogged down in thinking that you need to be running preflight hooks to validate your data, as opposed to just writing a query that describes valid data to load and returning no records if there’s nothing ready to process.
indeed…very true!
So… I’m adding now the statistics check as the jobs first model.
The next model (the jobs original root model), will check if the statistics meets
its ‘good to go’ definition - probably with a macro of its own that will set a boolean result.
Then the actual root model will execute or not respectively…
Thank you very much!
I apologise for barging in, but I have a similar problem and have been struggling for a week to find a solution to it - I really hope you can give me a hint
I pull data from GA4 to BQ, and it can be transferred in the early morning, afternoon or evening
thought to create a small script that would check the number of days from the current date to the maximum date from BQ and if the result equals 1 (for example) - only then would I run a big query
but the problem is not how to configure it for every hour, but how to stop it after the first positive run?
i.e. now I have a schedule for every morning and it is not triggered if there is no data - but if I check the result every hour - then I get a lot of extra requests (after the first positive one)