Hello all dbt-ers!
There has been a lot of discussion in various places about how folks set up their dbt environments, databases, schemas/datasets, etc. I would like to share here the specific changes we have made to generate_schema_name
and the basics of how we set up our build environment in the hope that it will be useful to others, and that they will in turn share theirs. Here is our code:
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{# If in dbt cloud, we always want models marked to generate in custom schemas
to generate in just that specific schema, not with the prefixed default schema#}
{%- elif target.name in ['dev-cloud','prod-cloud'] -%}
{{ custom_schema_name }}
{# If running in a local non-dbt-cloud environment then use the local schema
name no matter what #}
{%- elif target.name not in ['dev-cloud', 'prod-cloud'] -%}
{{ default_schema }}
{# Should never arrive here #}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
So the basics of it are this:
- We use BigQuery and have two different projects, one project for all dev datasets and data, and one for all prod datasets and data.
- Our developers develop using dbt Core locally in feature branches. They submit PRs for the feature branches. If approved, the PRs get merged into our
dev
branch. Periodically we then merge fromdev
tomaster
. - We want our dbt cloud scheduled runs between dev and prod to be as close as possible to catch any issues and we also want our devs to be able to develop in their own “sandboxes” without stepping on or being stepped on by what the scheduled dbt cloud jobs in Dev are doing or each other.
- Here is the list of datasets in our Dev environment and their purposes: a) “analytics” - Where end-user-facing published models are made available. b) “sandbox” - Where all intermediate and non-user facing temporary processing and models live c) “snapshot” – Where our snapshots live d) A bunch of other datasets for raw sources as they are loaded e) “dev_name_sandbox” - variants of “sandbox” where individual devs do their development and testing when running w/ dbt Core
What we want in our generate_schema_name
macro is something that, in our dbt cloud runs for both Dev and Prod, publishes all user-facing models into the analytics
schema and all the temporary/intermediate models into sandbox
. But for developers working locally, it should just publish all the models, whether user-facing or not, into their own local sandbox, which is specified in their profile file. This would also include if we do any other custom schema overrides for other models in the future, which we may do. So we are using targets for now in dbt cloud to differentiate this, although definitely hoping in the future to have an environment variable or some such thing to differentiate between dbt cloud jobs and locally run jobs.
Interested to see what others have done!