Setting dbt_valid_from to NULL

One of the primary features of snapshots is to create a timeline of the underlying data. This is cool, since you can then find rows which were active at a certain point in time.

A common query pattern we use is:

select

    *

from

    table as t

join

    snapshot as s on s.some_key = t.some_key

where

    (s.dbt_valid_from is null or s.dbt_valid_from <= t.resource_timestamp)

and

    (s.dbt_valid_to is null or t.resource_timestamp < s.dbt_valid_to)

With dbt 1.9 (at least) it is documented that indeed dbt_valid_to=NULL means that the last record is currently active, or the most recent record in the timeline. So the above query would select that row if the joined resource is newer.

However, the same logic does not appear to be carried over to dbt_valid_from=NULL. My question is twofold:

  1. When making an initial snapshot, dbt_valid_from is set to the dbt execution time. Why is this, and would it make sense to add an option to instead set it to NULL in the first record of the timeline?
  2. If you then choose to manually set dbt_valid_from to NULL (e.g. with a post-hook), then subsequent snapshots will NOT update that record any longer. Is this by design, or possibly a bug?

> When making an initial snapshot, dbt_valid_from is set to the dbt execution time. Why is this, and would it make sense to add an option to instead set it to NULL in the first record of the timeline?
This behaviour makes sense to me as it captures the earliest known timestamp for when that record was known to be accurate. That is. if I create a snapshot of a table today, I know the values were correct at the time captured in dbt_valid_from. They might also have been valid last week, or last year, but I don’t know that.

Why not just change this condition:

> (s.dbt_valid_from is null or s.dbt_valid_from <= t.resource_timestamp)
To s.dbt_valid_from &lt;= t.resource_timestamp? (I’d probably use <= btw)

edit: I copied the wrong snippet originally

If you want to assume that for any records in t with a resource_timestmp that predates the creation of the snapshot should use the first recorded value, I’d suggest doing that explicitly in your model, not undermining the integrity of the snapshot itself

Note: @sutcliffe.alex originally posted this reply in Slack. It might not have transferred perfectly.

Thanks! Indeed, we resorted to using the following macro to add a version column to downstream models:

{% macro snapshot_version_column(unique_key) %}
(row_number() over (
  partition by {{ unique_key }}
  order by dbt_valid_from nulls first
))
{% endmacro %}

This can then be used as you like to further implement the desirable logic. One being to add a null real_valid_from column for the first records in the timeline:

{% macro snapshot_add_real_valid_from(ref, unique_key, initial_valid_from) %}

with versioned as (
  select
    {{ snapshot_version_column(unique_key) }} as version,
    *
  from
    {{ ref }}
)
select
    *,
    case when version = 1 then {{ initial_valid_from }}
    else dbt_valid_from end as real_valid_from
from versioned

{% endmacro %}

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.