Building models on top of snapshots

In our recent guide to snapshots we wrote the following:

Snapshots should almost always be run against source tables. Your models should then select from these snapshots, using the ref function. As much as possible, snapshot your source data in its raw form and use downstream models to clean up the data.

So if you’ve already got some snapshots in your project, here are some patterns we find useful when writing models that are built on top of a snapshot:

ref your snapshot

Just like models and seeds, you can use the ref function in place of a hardcoded reference to a table or view. It’s a good idea to use ref in any models that are built on top of a snapshot so you can understand the dependencies in your DAG.

Use dbt_valid_to to identify current versions

It might be useful for downstream models to only select the current version of a record – use the dbt_valid_to column to identify these rows.

select
  ...,
  dbt_valid_to is null as is_current_version

from {{ ref('snapshot_orders') }}

Add a version number for each record

Use a window function to make it easier for anyone querying this model to understand the versions for a given record.

select
  ...,
  row_number() over (
    partition by order_id -- this is the unique_key of the source table
    order by dbt_valid_from
  ) as version

from {{ ref('snapshot_orders') }}

Coalesce dbt_valid_to with a future date

Coalescing this field replaces NULLs with a date, making it easy to join to the snapshot in any downstream models (join conditions do not like NULLs). Danielle Dalton, from Rent the Runway, recently shared that RTR uses a variable in their dbt project, the_distant_future, to make their future date consistent across models, like so:

select
  ...,
  dbt_valid_from as valid_from,
  coalesce(
      dbt_valid_to,
      {{ var('the_distant_future') }}
  ) as valid_to

from {{ ref('snapshot_orders') }}

:point_up: I like that a lot.

Union your snapshot with pre-snapshot “best guess”

It’s often the case that you start snapshotting a data source after it has had historical changes. In this case, it might be worth writing a query to construct your “best guess” of the historic values, and union it together with your snapshot for a complete history. The team at RTR add a column, historical_accuracy, to let their end users know whether the record is inferred or actual.

Date-spine your snapshot

Sometimes it makes more sense to have a record per day, rather than a record per changed record. We use a pattern we refer to as “date spining” to achieve this – in short we join snapshot to a table of all days to fan it out. We’ve written more about this pattern in this article.

Make your model resilient to the timing of the snapshot

Even if you plan to run a snapshot daily, the nature of ETL jobs is that they’ll fail from time to time, resulting in missed days. Or, you may accidentally run a snapshot twice in a day. As a result, write the SQL in your model to be able to handle any missed days, or days with multiple snapshots.

10 Likes

Hey, Claire, thank you so much for this write-up.
I’m interested in disambiguating operational concerns from modeling. IE I want to use snapshots as sources for models without having to physically instantiate them as part of a dev workflow(for ex. i may not have schema create permissions in an env and namespace collision makes working within a single schema difficult). I was thinking of implementing a custom materialization based off of ‘incremental’ or maybe a riff on your ‘increment_by_period’ (‘incremental_snapshot’?) that just adds a ‘where eff_end_dt is null’ to the end of a model’s select statement. Then have snapshot specific envs configured for that materialization while other envs can point at raw tables/seed data(that may not have dbt scd columns). Also have thought that maybe implementing a custom strategy on incremental might work. Do you have any thoughts on these approaches?

2 Likes

I think I answered my own question by creating a sql_footer config

1 Like

@krudflinger : I would love to learn more about how you tackled this. We are building up our models now, and we are struggling with some of the key realities that we think snapshots miss:

  1. What happens when there is an ETL failure (@claire : would like to see some examples of this from your ‘resilient’ section)

  2. What happens when we learn that a field was incorrect, and we need to ‘correct’ history ?

  3. How do you tackle this idea of pre-snapshot history? (@claire : would love to see some ideas/examples on this – i really like the idea of quality column)

  4. What happens when you want (or need) to rebuild an environment?

We are exploring the idea of leveraging incrementals, and custom incrementals, to create ‘archives’ of the source data so we can play it back, but I’m not seeing a way to set a loop for operations (while not_finished do dbt run).

We’re just a month into this, so still eyes wide open, so any guidance from your experience would be great!

Thanks,
Ron