Hi DBT helpfuls,
I have a 30 million row snapshot with strategy timestamp. I was expecting the DBT logic under the hood would be similar to having:
…
where updated_at > (select max(updated_at) from {{ this }})
…
which executes within 2 seconds.
But it clearly doesn’t as it takes 20 minutes+, so I assume it’s still looking up every PK or doing something else?
Is there a way to configure the snapshot to evaluate the latest updated_at ONLY, as per the above? Or what is going on here?
Thanks for any help
We had a similar issue at my company where we had some tables as large as 16B rows.
Ultimately our solution was to add where updated_at > (select date_add(updated_at, -7) from {{ this }})
to the snapshot definition to have a rolling 7-day window. While we may miss some late arriving facts, this was a reasonable compromise since before the query would run over two hours, or time out.
You can see the relevant code here
1 Like
Yes great - thanks smartinez.
The only problem with that is if the snapshot table does not already exist, the reference to {{ this }} causes an error. So for new deployments this is a problem.
I don’t suppose you know if there is a way to detect if the snapshot table already exists? I’ve tried to build this condition into the where clause as well - interrogating the infomation_schema for the table, but it evaluates all conditions, so still generates an error when the snapshot table is not present.
IIRC, the first time a snapshot runs, it generates the initial build code. You can probably try a ref()
in the subselect instead and see if that works.
I want to amend my statement here and open this up for a larger discussion.
Adding a time window to the snapshot definition will add it to the first query in the MERGE cte set that makes up a snapshot.
with snapshot_query as (
{{ source_sql }}
), ...
If you include the invalidate_hard_deletes
flag, then your snapshotted query will surely include records that are not present in your source query, which looks like dbt will expire those perfectly valid records:
select
'delete' as dbt_change_type,
source_data.*,
{{ snapshot_get_time() }} as dbt_valid_from,
{{ snapshot_get_time() }} as dbt_updated_at,
{{ snapshot_get_time() }} as dbt_valid_to,
snapshotted_data.dbt_scd_id
from snapshotted_data
left join deletes_source_data as source_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
where source_data.dbt_unique_key is null
Right now some of our larger tables (i.e. transactions) take in excess of 30 minutes to run. Even larger ones have timed out.