Extracting schema and model names from the filename

Default Schema and Model Names in dbt

By default, dbt builds materialized models in the schema dictated by the user’s profile target. Furthermore, dbt names the model accordingly using the filename of the model’s SQL file. This is refered to as the model’s “alias”.

For example, let’s suppose I’m doing some analysis on some Stripe payment data. If my target schema is dbt_brandyn and I have a model called payments.sql, the defaults described above would build my model in dbt_brandyn.payments. This is a great default for quick prototypes and small transformation pipelines. However, once you start implementing large dbt projects into production, you can imagine the need for more flexibility in where tables and views are materialized.

A while ago, dbt added the schema and alias parameters for models. Suppose we edit payments.sql to include the following configuration:

-- models/stripe_payments.sql
{{ config(schema='stripe', alias='payments', ...) }}
select ...

Then, in this case, the model is materialized as dbt_brandyn_stripe.payments. This might not be the schema that you’d prefer, but this points to the flexibility of dbt’s architecture. Somewhere in dbt’s code base, dynamic schema names and aliases are being generated. In fact, they are being generated by the generate_schema_name macro (code here) and the generate_alias_name macro (code here). It’s worth reviewing those snippets and convincing yourself that they do, in fact, produce the results above.

Our First Override

What’s more, you can override these macros in your dbt project. Using an override, you can effectively define a function that specifies the schema and alias for each model based on parameters in your profile (e.g., target, if you want to distinguish between dev and prod runs) and many of the properties intrinsic to each node (e.g., it’s name, config parameters like schema and alias, tags, materialization strategy, etc). It’s extremely flexible and details for overriding these macros can be found in dbt’s documentation.

We would probably prefer to materialize payments.sql as stripe.payments in our warehouse. In order to do so, we can override the generate_schema_name macro as follows:

-- macros/custom_naming_macros.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
  {%- set default_schema = target.schema -%}
  {%- if custom_schema_name is none -%}
    {{ default_schema }}
  {%- else -%}
    {{ custom_schema_name | trim }}
  {%- endif -%}
{%- endmacro %}

Furthermore, let’s say you have payments from two different sources: stripe and braintree. You might want to materialize one model as stripe.payments in your warehouse, and the other as braintree.payments. In dbt, you can’t have two files with the same name, so instead of calling them stripe/payments.sql and braintree/payments.sql, you need to have distinct names (like stripe_payments.sql and braintree_payments.sql). Use the schema and alias parameters as above. For example:

-- models/stripe_payments.sql
{{ config(schema='stripe', alias='payments', ...) }}
select ...

-- models/braintree_payments.sql
{{ config(schema='braintree', alias='payments', ...) }}
select ...

A More Advanced Example: Using the Filename

I’d like to provide an example of how we use these macros at my company. In particular, we have a number of contributors working in a Redshift warehouse. Our contributors are not super users and each has a staging schema like described earlier; mine is dbt_brandyn. When a user sets up their dev target in their profile, this will be the schema they use.

We settled on a naming convention for models that are materialized as tables and views. They are always named schemaname__modelname.sql, with two underscores separating schemaname and modelname. In production, we want these models to be materialized at schemaname.modelname. This naming convention also makes it obvious what table your referencing when you use ref('schemaname__modelname').

However, when a user is developing, we need to build these models in the user’s staging schema. We need to be careful not to materialize at dbt_brandyn.modelname because that would result in potential name collisions (e.g., in the payments example above, we can’t materialize the Stripe payments and Braintree payments both at dbt_brandyn.payments). We settled on the following format: if I’m building a model in my staging schema using the dev target, it should be built to dbt_brandyn.schemaname__modelname.

Once we agreed on the above convention, we set out to make dbt do that. We override generate_schema_name and generate_alias_name as follows:

-- macros/custom_naming_macros.sql

{% macro generate_schema_name(custom_schema_name=none, node=none) -%}
  {%- if target.name == 'dev' -%}
    {{ target.schema }}
  {%- elif target.name == 'prod' and custom_schema_name is not none -%}
    {{ custom_schema_name | trim }}
  {%- elif target.name == 'prod' and custom_schema_name is none -%}
    {% set node_name = node.name %}
    {% set split_name = node_name.split('__') %}
    {{ split_name[0] | trim }}
  {%- else -%}
    {{ target.schema }}
  {%- endif -%}
{%- endmacro %}

{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
  {%- if target.name == 'dev' and custom_alias_name is none -%}
    {{ node.name }}
  {%- elif target.name == 'dev' and custom_alias_name is not none -%}
    {{ custom_alias_name | trim }}
  {%- elif target.name == 'prod' and custom_alias_name is not none -%}
    {{ custom_alias_name | trim }}
  {%- elif target.name == 'prod' and custom_alias_name is none -%}
    {% set node_name = node.name %}
    {% set split_name = node_name.split('__') %}
    {{ split_name[1] | trim }}
  {%- else -%}
    {{ node.name }}
  {%- endif -%}
{%- endmacro %}

Note that via this implementation, contributors do not need to specify schema and alias parameters; instead, this macro extracts them from the filename of the SQL file, which is stored in node.name. However, if schema and alias parameters are specified, they are used accordingly. If you want to disable the schema and alias parameters completely, you can remove that logic the from macro outright. For example, generate_schema_name would become:

-- macros/custom_naming_macros.sql
{% macro generate_schema_name(custom_schema_name=none, node=none) -%}
  {%- if target.name == 'dev' -%}
    {{ target.schema }}
  {%- elif target.name == 'prod' -%}
    {% set node_name = node.name %}
    {% set split_name = node_name.split('__') %}
    {{ split_name[0] | trim }}
  {%- else -%}
    {{ target.schema }}
  {%- endif -%}
{%- endmacro %}

In this formulation, even if the contributor adds a schema parameter, it is ignored by the macro. This could lead to confusion among contributors who are familiar with how dbt “should work”. Thusly, I like to add a compilation error if a schema parameter is defined for the model.

-- macros/custom_naming_macros.sql
{% macro generate_schema_name(custom_schema_name=none, node=none) -%}

  {%- if custom_schema_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has a schema configured. This is not allowed.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}

  {%- if target.name == 'dev' -%}
    {{ target.schema }}
  {%- elif target.name == 'prod' -%}
    {% set node_name = node.name %}
    {% set split_name = node_name.split('__') %}
    {{ split_name[0] | trim }}
  {%- else -%}
    {{ target.schema }}
  {%- endif -%}

{%- endmacro %}

Hopefully, this post illustrates the awesome flexibility of overriding these macros.

5 Likes

Super cool writeup @brandyn! Thanks for sharing :smiley:

This is so so great! I’ve thought about doing this before, but have never done it myself – really glad that this exists, thank you!