Building models on top of snapshots

In our recent guide to snapshots we wrote the following:

Snapshots should almost always be run against source tables. Your models should then select from these snapshots, treating them like regular data sources. As much as possible, snapshot your source data in its raw form and use downstream models to clean up the data.

So if you’ve already got some snapshots in your project, here are some patterns we find useful when writing models that are built on top of a snapshot:

ref your snapshot

Just like models and seeds, you can use the ref function in place of a hardcoded reference to a table or view. It’s a good idea to use ref in any models that are built on top of a snapshot so you can understand the dependencies in your DAG.

Use dbt_valid_to to identify current versions

It might be useful for downstream models to only select the current version of a record – use the dbt_valid_to column to identify these rows.

select
  ...,
  dbt_valid_to is null as is_current_version

from {{ ref('snapshot_orders') }}

Add a version number for each record

Use a window function to make it easier for anyone querying this model to understand the versions for a given record.

select
  ...,
  row_number() over (
    partition by order_id -- this is the unique_key of the source table
    order by dbt_valid_from
  ) as version

from {{ ref('snapshot_orders') }}

Coalesce dbt_valid_to with a future date

Coalescing this field replaces NULLs with a date, making it easy to join to the snapshot in any downstream models (join conditions do not like NULLs). Danielle Dalton, from Rent the Runway, recently shared that RTR uses a variable in their dbt project, the_distant_future, to make their future date consistent across models, like so:

select
  ...,
  dbt_valid_from as valid_from,
  coalesce(
      dbt_valid_to,
      {{ var('the_distant_future') }}
  ) as valid_to

from {{ ref('snapshot_orders') }}

:point_up: I like that a lot.

Union your snapshot with pre-snapshot “best guess”

It’s often the case that you start snapshotting a data source after it has had historical changes. In this case, it might be worth writing a query to construct your “best guess” of the historic values, and union it together with your snapshot for a complete history. The team at RTR add a column, historical_accuracy, to let their end users know whether the record is inferred or actual.

Date-spine your snapshot

Sometimes it makes more sense to have a record per day, rather than a record per changed record. We use a pattern we refer to as “date spining” to achieve this – in short we join snapshot to a table of all days to fan it out. We’ve written more about this pattern in this article.

Make your model resilient to the timing of the snapshot

Even if you plan to run a snapshot daily, the nature of ETL jobs is that they’ll fail from time to time, resulting in missed days. Or, you may accidentally run a snapshot twice in a day. As a result, write the SQL in your model to be able to handle any missed days, or days with multiple snapshots.

1 Like