Performing a blue/green deploy of your dbt project on Snowflake

claire · July 1, 2020, 7:14pm

This post is heavily inspired by work done by Alan from tails.com — thanks for sharing your approach on a dbt office hours!

What’s a blue/green deployment?

Blue-green deployment is a technique that reduces downtime and risk by running two identical production environments. Let’s take a website as an example — when you need to deploy a new version of your website, the new version is created in a separate environment. After the build is successful, traffic is routed to the new build. If anything goes wrong, you just don’t switch over the traffic!

In analytics, we can take a similar approach:

Maintain two separate databases (or schemas!) — for our example we’re going to call them stage and prod.
Whenever you perform your production run of dbt project, run it in the stage database
Once all your tests have passed, switch out the stage database for the prod database.

Hey presto! No bad data creeping into prod ever

Doing this on Snowflake

Set up two new databases, and grant ownership of these databases to the role that runs your dbt project

create database stage;
grant ownership on database stage to role transformer;

create database prod;
grant ownership on database prod to role transformer;

Update your dbt project to write to the stage database (this will mean updating either your profiles.yml or your connection in dbt Cloud).
Add an operation to your project to swap the databases (more on this below).

# macros/swap_database.sql
{% macro swap_database() %}

    {% set sql='alter database prod swap with stage' %}
    {% do run_query(sql) %}
    {{ log("database swapped", info=True) }}

{% endmacro %}

Add a macro to your project to override the ref function (more on this below)

-- Macro to override ref and to render identifiers without a database.

{% macro ref(model_name) %}

  {% do return(builtins.ref(model_name).include(database=false)) %}

{% endmacro %}

Add the step to swap the databases to your deployment process — you might end up with steps like this:

$ dbt run
$ dbt test
$ dbt run-operation swap_database

That’s it! Now, if your run fails at any step, bad data won’t be promoted to prod!

What’s this `swap` syntax?

Snowflake has a handy tool to help us out here — the swap syntax. It essentially performs two renames at once — you can find more info out here (cmd + f for swap with).

So by executing:

alter database prod swap with stage;

We’re essentially executing these two things at the exact same time, with zero downtime:

alter database prod rename to stage;
alter database stage rename to prod;

Why do I need to override `ref`?

By default, dbt resolves any ref function to a fully qualified object reference. This means that this:

select * from {{ ref('dim_customers') }}

Gets resolved to:

select * from stage.core.dim_customers

In Snowflake (and other databases), views are saved as queries. If the query has a reference to stage.core.dim_customers, we could end up with errors like this after we’ve done the swap:

SQL compilation error: Schema 'STAGE.CORE' does not exist or not authorized.

Or worse, we could be selecting from an out of date view (i.e. the one that got “demoted” to the stage database).

To get around this, we need to tell dbt: “well ackshullllly, don’t put the database name in there”. So this:

select * from {{ ref('dim_customers') }}

Should get resolved to this:

select * from core.dim_customers

Since this is a relative reference, as long as core.dim_customers exists in this prod database (which it should!), the view will work as expected.

To change the behavior of dbt, we can add our own ref macro. Whenever you add your own macro with the same name as a dbt macro, dbt will honor your version (related: dbt ships with a project).

There’s a bit of magic to do with builtins here, but the important thing is this — use the code above to get dbt to not use the database name when resolving ref functions.

Loom

Here’s a video of me walking through this code in a sample dbt project

Happy swap-ing!

reh · July 2, 2020, 8:51pm

This is amazing. Too bad BigQuery won’t let you rename a dataset

claire · July 6, 2020, 5:25pm

100% agree. Fortunately @claus found a pretty reasonably workaround — check it out here!

reid · October 2, 2020, 4:30pm

Claire! We’ve implemented the blue/green method for our dbt runs but I’m noticing a peculiar situation…

When we run dbt into staging and then swap the schemas, the view definitions are still referencing the staging schema even after the swap macro. Is this to be expected?

I figure if I really need to, I can run soemthing along :
$ dbt run -m config.materialized:table --target stage
$ dbt run-operation swap_schema
$ dbt run -m config.materialized:view --target prod

claire · October 2, 2020, 5:28pm

Ok, so it sounds like you’re swapping between schemas, right? Assuming the command looks something like:

{% macro swap_schema() %}

    {% set sql='alter schema prod swap with stage' %}
    {% do run_query(sql) %}
    {{ log("schema swapped", info=True) }}

{% endmacro %}

If that’s the case, you might need to alter the ref function even further!

{% macro ref(model_name) %}
  {% do return(builtins.ref(model_name).include(database=false).include(schema=false) %}
{% endmacro %}

Then a model like this customers model:

{{ config(materialized='view') }}
select * from {{ ref('stg_customers') }}

Ends up compiling to this:

create view my_database.stage.customers as (
  select * from stg_customers
)

(Note the lack of schema prefix in the view definition)

But then you get into all kind of strange things:

How do you get Snowflake to use the right schema when there’s no schema reference in the view definition? Does Snowflake automatically search for stg_customers the same schema the customers view was created in? (I’m not sure!) Or do you have to tell anything (or anyone) that selects from these views to use schema prod before selecting from a view… (in which case, that’s not great)
This would probably prevent you from using custom schemas in the future

You could implement an approach like the one you suggested, but I think it will look more like the below, since the initial command should run everything (otherwise, tables might be built on top of views that haven’t been updated).

$ dbt run -m --target stage
$ dbt run-operation swap_schema
$ dbt run -m config.materialized:view --target prod

Overall, I guess I’d push back on swapping schemas here, and encourage you to think about swapping databases instead. It just feels much tidier!

jameshd · January 21, 2021, 4:20pm

Hi Claire
Firstly thanks to you and the team for the fantastic dbt Fundamentals course, which I have just completed and have found excellent.
My question: do you have any best practice advice for performing a blue/green deploy of our dbt project in Redshift? Thanks.

AbsolutelyJeff · January 29, 2021, 6:48pm

I do complete builds of our small data warehouse, and rename the new database to make it production. While it’s not a blue/green deployment, it’s similar and these tips were helpful!

I also had to override the source macro in manner similar for ref, otherwise views that referenced sources would have the database name in their definitions and they would fail to execute.

{% macro source(source_name, table_name) %}
  {% do return(builtins.source(source_name, table_name).include(database=false)) %}
{% endmacro %}

Hope this is helpful to someone!

jeremyyeo · October 19, 2021, 1:01am

Sorry about the thread necro but wanted to add to this body of knowledge for folks that are wanting to do blue/green deploys but with schema swaps instead in a single dbt Cloud job (should work with dbt-core too) that has the following steps:

dbt run on the blue_ schema.
dbt test on the blue_ schema.
dbt run-operation swap_schema to swap blue_ schema with the green_ schema.
dbt docs generate on the green_ schema.

Note: the swap_schema macro is left as an exercise to the reader.

With a custom generate_schema_name macro like so:

{% macro generate_schema_name(custom_schema_name, node) -%}

    {%- set default_schema = target.schema -%}

    {%- if var('is_pre_swap') is not none -%}
        blue_{{ default_schema }}
    {%- else -%}
        green_{{ default_schema }}
    {% endif -%}
 
{%- endmacro %}

And a sequence of run steps like so:

$ dbt run --vars 'is_pre_swap: 1'
$ dbt test --vars 'is_pre_swap: 1'
$ dbt run-operation swap_schema
$ dbt docs generate

This allows a single dbt Cloud job to do what we want since at the final step (docs generate), the is_pre_swap variable will be evaluated to none so dbt will look to generate docs on the green_ schemas correctly.

christophercridley · December 16, 2021, 9:21pm

I assume the reporter role would need read permission on both the staging and production databases? Is there some programatic way to only allow reporter to see production with the name swapping?

Just slightly wrestling with the idea of reporter seeing both the staging and production databases.

marcelcastrobr · January 12, 2022, 8:19am

@claire and @jameshd , any advice here on how to implement blue-green deployment using dbt-core on redshift?

cherns31 · January 12, 2022, 5:45pm

This sounds really interesting. My only doubt is: what about incremental models? Since the former PROD DB tables demoted to STAGE don’t have the latest incremental batch, you’ll be missing that info during the next RUN, which will be based on these demoted tables.

marcelcastrobr · January 13, 2022, 3:11pm

Hi All, in order to make it running in Redshift I did the following adjustments:

All dbt transformation is done in a schema called schema_name_stage. Thus I created a macro to create this schema if it does not exist at the beginning.

macros/create_stage_schema.sql

{% macro create_stage_schema() %}
{% set sql=‘CREATE SCHEMA IF NOT EXISTS schema_name_stagev AUTHORIZATION user’ %}
{% do run_query(sql) %}
{{ log(“schema schema_name_stage created”, info=True) }}
{% endmacro %}
This macro is called on on-run-start.

instead of swaping database I swape schemas. in redshift using the macro below:

macros/swap_schema.sql

{% macro swap_schema() %}
{% set sql0=‘DROP SCHEMA IF EXISTS schema_name_previous CASCADE’ %}
{% do run_query(sql0) %}
{% set sql1=‘ALTER SCHEMA schema_name_prod RENAME TO schema_name_previous’ %}
{% do run_query(sql1) %}
{% set sql2=‘ALTER SCHEMA schema_name_stage RENAME TO schema_name_prod’ %}
{% do run_query(sql2) %}
{{ log(“schemas swapped from schema_name_stage to schema_name_prod”, info=True) }}
{% endmacro %}

Thus the final pipeline looks like:
3.1. dbt run
3.2. dbt test
3.3 dbt run-operation swap_schema
I only run the step 3.3 if both run and test worked using github actions.

I hope this helps others.

jameshd · January 17, 2022, 9:53am

Yes - we effectively run two versions of our modelled data “unaudited” and “audited”.
Each time the job runs it does everything in the unaudited schema including all tests.
Only if it passes does it move onto audited schema.
This has saved our production data many times and works with many incremental models.

I used this video https://www.youtube.com/watch?v=jGwUonA3mDg to set it up (it talks about BigQuery but the principles work for Redshift).

Claus was incredibly helpful in supporting me and gave me the key tip:
Take a look at the slides (slide 26 and on) for how I proposed doing this on dbt Cloud.
tl;dr is that you could use var for this instead of target and get the same result. So each job on dbt Cloud would pass in the relevant var to control the build

Therefore we pass in a --vars varaible in our dbt Cloud set-up, so our job looks like this:

dbt seed --vars ‘audit: true’
dbt run --vars ‘audit: true’
dbt test --vars ‘audit: true’
dbt source snapshot-freshness --vars ‘audit: true’
dbt seed
dbt run
dbt test
dbt source snapshot-freshness

You could then unload part/all of the unaudited schema if you want, but we find running two copies is an acceptable cost for the benefit.

Thmsrey · August 22, 2022, 11:55am

Hi!

Running in the same problem here and we were wondering if the resultmethod could help here to achieve the same kind of checks?

Thmsrey · August 23, 2022, 12:06pm

Basically the idea would be to:

dbt run on the audit target
dbt teston this audit target which will write a manifest
dbt run --exclude result:error on prod target
dbt test on prod target

Would this work as intended?

preidy · October 12, 2022, 3:28pm

Hello, I am setting up this system currently, curious what you found to be the permission situation with this deployment strategy? Did you give the reporter role permissions to both staging and production? I’m a bit confused if production permissions remain with the production database with the swap or not. The way I understand it, it seems like it is necessary to give reporter permissions to both databases.

Would appreciate any insight you have, thanks!

obar1 · October 17, 2022, 5:26pm

what about creating each time via dbt cloud a new schema as yyyymmddhhss_some_name and then update all views all used by users to query the downstream models (1 TRAN) to point to the latest using some smart query on information schema …
No need to swap anything - just run dbt operation to make some cleanup from time to time

sakce · January 28, 2023, 12:40am

Hiya

Snowflake’s doc on this (here) says the following on the SWAP WITH syntax:

SWAP WITH target_db_name
Swaps all objects (schemas, tables, views, etc.) and metadata, including identifiers, between the two specified databases. Also swaps all access control privileges granted on the databases and objects they contain. SWAP WITH essentially performs a rename of both databases as a single operation.

My understanding of this is that you would only need to grant the reporter role on the production database. Although… I haven’t tested this out yet :))

soumaya.mauthoor · August 22, 2023, 9:35am

Is there a reason to use a separate dbt run and dbt tests instead of dbt build? So the sequence of run steps is (following @jeremyyeo’s suggestion):

$ dbt build --vars 'is_pre_swap: 1'
$ dbt run-operation swap_schema
$ dbt docs generate

soumaya.mauthoor · August 22, 2023, 5:14pm

Also would it be possible to modify dbt build so that each model is independently run, tested, and swapped , for example by creating a post_test_hook before the transformation occurs in the downstream resources?

Topic		Replies	Views
Understanding how DBT handles tables for no down time Help	4	1993	October 31, 2022
Select all models materialized as views Help best-practice , orchestration-and-deployment	4	2716	March 7, 2024
Production rollbacks Archive	2	6318	July 21, 2021
Best practices to enforce models without failing assertions Archive	1	2478	May 20, 2022
Configuring dbt Cloud to Differentiate Between Development and Production Environments in BigQuery Help bigquery	0	967	November 27, 2023