Using dynamic schemas for snapshots

Unlike models, snapshots build into the same target_schema, no matter who is running them. This helps ensure that whenever you ref a snapshot, you’re using the production version of the snapshot, so that your dev transformations reflect reailty.

However, this can cause problems if you want to test the snapshot command in dev or even during a CI run, since the database user you run this command with in dev/CI may not have the correct privileges to mutate the snapshot table.

If this is causing headaches for you, there is a workaround! Here’s how I recently did it:

1. Use the generate_schema_name macro to make the target_schema environment-aware

The generate_schema_name macro is how dbt constructs an environment-aware schema name for a model (docs), and we can leverage it here to do the same thing.

Normally, I’d configure this directly from the dbt_project.yml file for all my snapshots to make them behave consistently. But for reasons that I’d prefer to not get into (read: I don’t fully understand) we have to do this directly from the snapshot itself

snapshots/orders_snapshot.sql:

{% snapshot orders_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema=generate_schema_name('snapshots'),
      unique_key='id',

      strategy='timestamp',
      updated_at='order_date',
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

If you’re using custom databases, it might also be a good idea to do the same for your target_database configurations (i.e. target_database=generate_database_name('analytics'))

Now when you $ dbt run, the snapshot will be built in a different schema depending on who is running it.
For my setup:

-- in prod:
analytics.snapshots.orders_snapshot

-- in dev:
analytics.dbt_claire_snapshots.orders_snapshot

2. Don’t ref your snapshots

The docs say you should select from snapshots using ref:

select * from {{ ref('orders_snapshot') }}

But, if you’ve followed step one, you really shouldn’t use ref. This is because we want to make sure we are always selecting from the production version of the snapshot, not your dev or ci version. If we use ref, that won’t be the case.

So, what’s an analytics engineer to do?

Option 1: Use sources

Some folks have worked around this by choosing to define their snapshots as sources, and instead using the {{ source(...) }} function to select from their snapshot. That’s a reasonable approach, but the one downside is that you’ll end up with two nodes representing the same thing in the docs —one for the snapshot, and one for the source that represents the snapshot. Not too bad of a trade-off in the schema of things!

Option 2: Create a macro to ref for you

What if, instead of {{ ref('orders_snapshot') }}, we could do something like:

select * from {{ snapshot('orders_snapshot') }}

And have it always compile to the production version of the snapshot, while maintaining the edge in the DAG between the snapshot and the model?

We can! (Quelle surprise!)

Here’s a macro that I recently used:

macros/snapshot.sql

{% macro snapshot(snapshot_name) %}

{% set snapshot_relation=ref(snapshot_name) %}

{# Make sure this is the actual name of your target in prod #}
{% if target.name != 'prod' %}

{# Change this part based on how you name things in prod #}
{% set prod_snapshot_relation = adapter.get_relation(
      database=snapshot_relation.database,
      schema='snapshots',
      identifier=snapshot_relation.identifier
) %}

{% endif %}

{#
Use an or operator to handle:
1. the case where the prod version has not yet been created
2. the case where we are in prod
#}

{{ return(prod_snapshot_relation or snapshot_relation) }}

{% endmacro %}

Et voilà! In both prod and dev, my model will select from:

select * from analytics.analytics.orders_snapshot

And my documentation looks right!

Here’s a Loom video where I run through this (it’s a little long, but I go into much greater depth):

7 Likes

Hi Claire, thanks for the workaround on this case.
I’m wondering if I can use tag in the snapshot config and then exclude the tag (--exclude tag:snapshot) while running the dbt snapshot on production. Can I do that as workaround?

Hi @teguhn — the best way to find the answer to these kinds of questions is to try it out yourself! Let me know if it works :wink:

Hi @claire ! Thanks for sharing this and going into so much detail :raised_hands:

I found this while looking for some ideas to improve how we are operating (in Snowflake) - which is trying to clone the prod snapshot into test schemas each run to keep the data fresh without potentially impacting any of the prod snapshots. Do you have any insight into if that is a valid tactic? Or is there no real risk in having the production snapshot used across test and ci schemas instead of a clone?

Thank you for any insight!

I just wanted to put this note here for anyone that utilizes this technique:

If you are attempting to run SlimCI, as of writing, the generate_database_name and generate_schema_name within the snapshot configuration will mark these files as state: modified.

Environment awareness logic for snapshots (and perhaps other models, though I’ve not tested that) must live within dbt_project.yml in order to not trigger files as modified when comparing across targets (i.e. target == prod and target == ci)

While I did utilize the snapshots macro listed here to have dev point to prod snapshots, this did not solve the issue I was hoping to solve (slimCI running all snapshots in their own temporary schema, which then runs ever child model if using state:modified+)

The magic sauce is explained here: Caveats to state comparison | dbt Docs

1 Like

Proposing another workaround (when using Snowflake):

I wanted to avoid snapshots being deployed into default_schema (target.schema), which in my case would be MRT_schema in deployment and USER_schema in development. Instead, I wanted it to be deployed into INT_schema, but still be built in USER_schema in development.

And without Jinja being nessecary (!) this was done as following:

SETUP

  1. First make two different folders inside snapshot, to group snapshots whether they were to be publicly available in the organization (hence deployed), or still only in development (hence dev_only).
    image

  2. Go into dbt_project.yml and edit the code to the following:

snapshots:
  <project_name>:
    deployed: 
      +target_schema: INT_schema
    dev_only:
      +target_schema: target.schema  # user schema 

USE:

  • When developing the snapshot and wanting it to be for developing uses only, leave the snapshot.sql in dev_only.
  • When wanting to publish the dataset, so others in the organization may use it - but still in a custom schema, move the file into deployed.

This gives both the flexibility and control during both dev and production.

1 Like

Hey, I don’t understand why this works: generate_schema_name takes two parameters: custom_schema_name and node, so without the second parameter it just returns a error. Do you use a customized version of generate_schema_name?

@claire - Thanks for sharing this. I just tried to follow the steps outlined and can now get my snapshot to materialize within my dev schema but, upon subsequent runs in dev, it just rebuilds the existing table rather than inserting into/updating it. If I explicitly point to the dev database/schema in the snapshot config block the table updates/inserts as expected after the initial run. Curious if you’ve seen this before or what could be causing the behavior.

Just wanted to echo that I’m seeing the same thing. I’ve used this technique before and it appears that something has changed with the generate_schema_name macro. (generate_database_name still works fine). dbt is destroying and rebuilding the snapshots on each run as it would with a normal dbt model. It has the same behavior in both dev and prod.

So are you suggesting you move the files to a different folder before you create the PR?

Hi Claire,

Thank you for sharing the solution. When I attempted to add the function ‘generate_schema_name(‘snapshots’)’ to the target_schema, dbt returned an error indicating that ‘id’ (as the unique key) doesn’t exist. However, when I removed the ‘generate_schema_name’ function, dbt ran perfectly. I’m puzzled as to why this problem occurred because the ‘id’ column does exist in my source data.

Do you have any idea ?

Thank in advance.