I added more details here:
Goal
The goal is to build a slim Development (dev) and Continuous Integration (CI) environment that runs in different databases.
The challenge
Currently, dbt allows us to configure a sandbox environment using the --defer and --state arguments when compiling, running or testing models. The article Defer | dbt Developer Hub explains what Defer is, its benefits and its challenges.
So, let’s dive into a scenario:
Production database has one schema called to transform, one source table source_a and three models, model_a, model_b and model_c, as demonstrated below.
Their SQL definition is:
-- Model A -- model_a.sql
Select * from {{ source('source_a') }}
-- Model B -- model_b.sql
Select * from {{ ref('model_a') }}
-- Model C -- model_c.sql
Select * from {{ ref('model_b') }}
-- Compiled results
-- Model A -- model_a.sql
Select * from prod.transform.source_a
-- Model B -- model_b.sql
Select * from prod.transform.model_a
-- Model C -- model_c.sql
Select * from prod.transform.model_b
Let’s suppose that we want to make a change in the model_b. However, the model_a is a huge table that takes a long time to rebuild and has computing and storage costs. So, we want to consume the model_a from the production that makes our changes only on model_b.
dbt allows us to do that using the --defer and --state. So, I make my changes in model_b and run the command that looks like this:
dbt run --select model_b --defer --state ./path_to_last_prod_manifest_artifact.
The compiled code of model_b would be something like this:
-- Modified and compiled Model B -- model_b.sql
create view prod.dbt_rafael_transform.model_b as
Select * from prod.transform.model_a
So, model_b would be created in the production database but in the dev_rafael schema. Now, we have something link this:
The best benefit of this is we can run tests and execute our code in a slim way saving costs and improving the velocity of our development. This process is also used for the CI, which means CI will only build the changed models.
However, I see only one issue with this solution, imagine now that we have three or four people working on changes, and it means we’ll have three or four more new schemas in the databases. The schemas that are going to be processed and modified alongside the production schema which is not safe, as everyone only should have read access to the production database.
So the expected behaviour I would like to see when using --defer and --state arguments are as below:
So, when I run dbt run --select model_b --defer --state ./path_to_last_prod_manifest_artifact, the compiled code would be something like:
-- Modified and compiled Model B -- model_b.sql
create view dev.dbt_rafael_transform.model_b as
Select * from prod.transform.model_a
So now, the database and schemas are dev keeping the production database clean and secure.
The question is: how can I achieve it?
Do I need to change the ref or generate_database_name macros? Should it be handled by --defer and state? Is this a bug in dbt?