Snapshot timestamp strategy clarification

Hi DBT helpfuls,
I have a 30 million row snapshot with strategy timestamp. I was expecting the DBT logic under the hood would be similar to having:

where updated_at > (select max(updated_at) from {{ this }})

which executes within 2 seconds.
But it clearly doesn’t as it takes 20 minutes+, so I assume it’s still looking up every PK or doing something else?

Is there a way to configure the snapshot to evaluate the latest updated_at ONLY, as per the above? Or what is going on here?

Thanks for any help

We had a similar issue at my company where we had some tables as large as 16B rows.

Ultimately our solution was to add where updated_at > (select date_add(updated_at, -7) from {{ this }})

to the snapshot definition to have a rolling 7-day window. While we may miss some late arriving facts, this was a reasonable compromise since before the query would run over two hours, or time out.

You can see the relevant code here

1 Like

Yes great - thanks smartinez.

The only problem with that is if the snapshot table does not already exist, the reference to {{ this }} causes an error. So for new deployments this is a problem.
I don’t suppose you know if there is a way to detect if the snapshot table already exists? I’ve tried to build this condition into the where clause as well - interrogating the infomation_schema for the table, but it evaluates all conditions, so still generates an error when the snapshot table is not present.

IIRC, the first time a snapshot runs, it generates the initial build code. You can probably try a ref() in the subselect instead and see if that works.

I want to amend my statement here and open this up for a larger discussion.

Adding a time window to the snapshot definition will add it to the first query in the MERGE cte set that makes up a snapshot.

    with snapshot_query as (

        {{ source_sql }}

    ), ...

If you include the invalidate_hard_deletes flag, then your snapshotted query will surely include records that are not present in your source query, which looks like dbt will expire those perfectly valid records:

        select
            'delete' as dbt_change_type,
            source_data.*,
            {{ snapshot_get_time() }} as dbt_valid_from,
            {{ snapshot_get_time() }} as dbt_updated_at,
            {{ snapshot_get_time() }} as dbt_valid_to,
            snapshotted_data.dbt_scd_id

        from snapshotted_data
        left join deletes_source_data as source_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where source_data.dbt_unique_key is null

Right now some of our larger tables (i.e. transactions) take in excess of 30 minutes to run. Even larger ones have timed out.