I want to use different databases in my model creation. I want my model to be created in my dev database but use the data from production environnement. I can produce this behaviour (i’m on bigquery) but i fail at automatizing it.
To do that, i try to leverage the {{this}} and {{selected_resources}} jinja macro, but it fail because one is empty at compilation and the other at execution.
On the other hand, if you need to create up-stream models in dev, you could avoid it using the defer flag (Defer | dbt Developer Hub). Missing up-stream models will be referred from prod env.
In this case, you should specify a database in the configuration. You should point out that modelA should be created in project1 and modelB in project2. dbt will handle everything else.
I think it’s a bit more complicated than that. @wefo
I want some logic for which database to choose based on the fact that a model is selected in the “dbt run”.
So if i do :
dbt run modelA => modelA is created in project1.
CREATE TABLE database1.table_A
AS
SELECT * FROM database2.table_B
dbt run modelB => modelB is created in project1.
CREATE TABLE database1.table_B
AS
SELECT * FROM database2.table_A
As we can see the “database” in configuration is not fixed, it depends on if the model is part of the {{selected_models}}
Based on your initial post, you want to run some models in the development environment but avoid recreating required models and instead use production data.
In this case, use the defer flag (Defer | dbt Developer Hub) and the state flag pointing to the production manifest.json.
Creating model_A dbt will check if model_B exists in the development environment and refer to it, if not then dbt will take the model_B from production.
P.S. model_A and model_B cannot refer to each other because you will get a loop in your DAG and the dbt run will fail.
Lol i can see that because it’s still not what i want. I will give the code that doesn’t work properly, maybe it will work clarify:
this is my model configuration:
{{config(
materialized = ‘table’,
schema=‘mart’,
database=get_database(this)
)}}
and this is my macro :
{% macro get_mart_database(model) %}
{{print(this)}}
{{print(selected_resources)}}
{%- if this.identifier in selected_resources -%}
{%- do log(this.identifier ~ " is included based on the current selection", info=true) -%}
databaseA
{%- else -%}
{%- do log(this.identifier ~ " is not included based on the current selection", info=true) -%}
databaseB
{%- endif -%}
{% endmacro %}
This code doesn’t work because one is empty at compilation and the other at execution.
Any idea how i can make it work? @wefo
Ok, now I see what kind of technical problem you are facing. Could you also provide more context why you are doing it and what result you want to achieve?
Why i want to do it?
Some team members need to create dataset, to do so we allow them to contribute to our dbt project.
They perform all of their test in dev and once their happy with their model they open a Pull Request to our repo. They have read_only right on my production environment and create right on my development environment.
So they build model in development but the data consistency in dev is not always perfect. And as they build complex model, for an easier feedback loop we allow them to build dev table based on production data.
That’s why i need a way to have my model created in dev but all the {{ref}} inside my model to refer to my production env. I thought the way of doing it was the macro i wrote but as you can see it’s not working.
I don’t want to write an entire article about dev and prod, but all my post and hits were guiding towards the solution. Let’s go step by step because I have implemented it.
Here is an example of lineage
You should have a compiled manifest.json file. You could place it in an accessible place for your developers.
You give BigQuery DataEditor and Bigquery JobUser permissions for your developers in a development project and BigQuery DataViewer permission in a production project.
Developer writes or modifies a model and runs the following command: dbt run -s model_B --target dev --defer --state folder_path_to_production_manifest
Read this article carefully and what the difference gives flag defer and state (Defer | dbt Developer Hub)
Simply put, it will create model_B in the development project under dataset dev_alice using data from model_A in the production project.
Furthermore if developer writes another command: dbt run -s model_C --target dev --defer --state folder_path_to_production_manifest
Given that model_B already exists in the dev_alice dataset, it will refer to this modified model in dev env.
This workflow allows us to go further through the pipeline and implement required code changes without modifying production data and testing if the changes work with other models.
When the development part is over, the developer pushes code and creates a pull request.
Summary
The main point is that in most cases, you should handle the environment using the target flag and define the target project and dataset in the project.yml file.
Regarding production data, you should use flags defer and state to dynamically change referred models to production models in case they do not exist in the development dataset.
NOTES:
you should constantly update compiled manifest.json file for the developers to point to the newest version.
each developer defines their development dataset in the project.yml file. It is possible to write a more advanced materialization macro if there is a need to change the dataset based on a Jira task, for example (Custom schemas | dbt Developer Hub)
Thanks for the time you took to give a well structured answer!
I have been working with --defer and --state previously and the reason why i think this is not a solution in my case is because i don’t want this behaviour : Given that model_B already exists in the dev_alice dataset, it will refer to this modified model in dev env.
Let’s say i want to build model_C in dev. I want that whatever if model_A and model_B exists in my current environment (dev), i will still be using production data. (because we don’t know if model_A have been build with production data or dev data or if the data are up to date in dev).
And if i want to build model_B then model_A will be pointing to prod env. So there have to be a macro deciding which env to take based on the fact that i’m part of the {{selected_resources}} no?
To renforce the fact that the --deferflag is not my solution. This mention in the --defer documentation is not ok with what i want to produce =>
When the --defer flag is provided, dbt will resolve ref calls differently depending on two criteria:
Is the referenced node included in the model selection criteria of the current run?
Does the reference node exist as a database object in the current environment?
If the answer to both is no—a node is not included and it does not exist as a database object in the > > current environment—references to it will use the other namespace instead, provided by the state manifest.
Do you see why i think the --defer flag is not a solution in my particular case?
Yes, I get your situation, but it’s very strange anyway.
Have you thought about misalignment? If I add a column to model_B and test it in dev, after that, I want to adjust model_C, including this column. I have to take into account that the column exists, and I cannot take production data. I should use model_B, which I already created. This workflow ensures that there will be no discrepancies between related models when I push all changes to production.
On the other hand, if you work on entirely different tasks and want to make independent adjustments to model_B and model_C. And as you mention, in this case model_B should take model_A from prod and model_C from model_B in prod.
I have solved this using a custom schema name, and my developers work not on a dedicated dataset (which we could set in the profiles.yml file), but they pass an additional argument (Jira task ID) to dbt run command. Then we have a defined custom schema generator (Custom schemas | dbt Developer Hub) which takes that argument, e.g.
dbt run --select model_B --target dev --vars 'task: JIRA_ID'
this compiles to dev_project.JIRA_ID.model_B
Obviously, you should remember not to create model_A in your dev so defer could do the job, otherwise it will not work.
What I like about it is that we could check not only the code during pull request, but also check how the data looks in dev env. Furthermore, the developer could work on different tasks simultaneously without messing with different code changes in the same dataset. After the pull request passes all tests, we delete the associated dataset.