We started a POC with dbt in my company, to help us organize our BigQuery jobs.
In a past live, I developed a similar tool for Hive so I’m naturally quite enthusiastic about dbt.
There is one thing that keeps bugging me though: One of the best practice I developed across the years and that we follow in our company, is that the SQL jobs we write must have the very same name as the output SQL tables they generate. This allow us to avoid unnecessary mapping to remember what job generates what tables, and overall it really makes things simpler (including copy-pasting names when debugging).
I have tried several times applying this concept with dbt, but alas I keep running into issues that prevents me to do so in a clean way.
For instance, I tried to make two tables like this:
file models/test1.t1.sql
{{ config(schema='test', alias='t1', materialized='view')}}
SELECT 1 as a
file models/test2.t1.sql
{{ config(schema='test', alias='t2', materialized='view')}}
SELECT * FROM {{ ref('test.t1') }}
But when I do that, and run the command dbt docs generate, I get the following error message: model names cannot contain '.' characters
Is there a good reason to prevent model names from containing dots? I tried commenting out the code block that performs the check in utils.py and it worked perfectly fine.
So my question is this: is this check really necessary and I missed something, or would it be feasible to allow model names to contain dots? If yes, I would be happy to contribute with a pull request.
This requirement is a function of how dbt generates unique_ids for each node in the graph. Each node generated by dbt has a node constructed from the:
resource type (eg. “model”, “test”, “seed”, etc)
package name
resource name
These components are delimited with dots, resulting in unique_ids like:
model.my_project.my_model
I believe we disallowed dots in model names to ensure that unique_ids always have three dots, but I’m not certain that that’s a hard requirement anymore. If you’d like, you can feel free to create an issue and we can definitely take a deeper look!
Cool! TIL why dbt doesn’t support periods in file names!
I thought it might be useful to share that we often use double underscores in the similar way to how you want to use periods, e.g. stg_jaffle_shop__customers (see here).
With respect to this comment:
The SQL jobs we write must have the very same name as the output SQL tables they generate. This allow us to avoid unnecessary mapping to remember what job generates what tables, and overall it really makes things simpler (including copy-pasting names when debugging).
As you’ve noticed, to achieve this in dbt using the default behavior, you’d have to set the schema and alias config manually (which I think is probably very much the definition of “unnecessary mapping”). Fortunately, in dbt, you can change the way that dbt generates a schema name (docs here), and in v0.14.0, you can actually use the node argument to grab the name of your model. This opens up a world where you can use Jinja expressions to parse the '__' in the model name (or the '.' if there’s a PR to this effect), and then using the model prefix as your custom schema name!
This would look something like:
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{#- Check if we can parse the model name for the schema prefix -#}
{%- if '__' in node.name -%}
{%- set model_prefix = node.name.split ('__') [0] -%}
{%- else -%}
{%- set model_prefix = None -%}
{%- endif -%}
{#- If we have neither a custom schema, nor a model prefix, use the default schema -#}
{%- if custom_schema_name is none and model_prefix is none -%}
{{ default_schema }}
{#- Otherwise concat the custom schema or model prefix to the default schema -#}
{%- else -%}
{#- Preference the custom_schema_name over the model_prefix -#}
{{ default_schema }}_{{ (custom_schema_name or model_prefix) | trim }}
{%- endif -%}
{%- endmacro %}
You can also do the same to generate an alias name (check out the docs here). This might save you a lot of repetitive model configuration!
Thank you very much @drew and @claire for your swift responses!
Indeed, I was thinking that the use of dots in model names was prevented because of the unique_id convention that already uses dots. Perhaps there is a way to allow them despite that.
I will create an issue on the github repository to further discuss this, then.
Indeed, this is exactly what we currently do!
I have been trying to tinker with custom schema names like you suggested but the only solution I could came up with required to change the unique_id scheme of models as well.
The problem I had was that I tried having two files /models/schema1/table1.sql and /models/schema2/table1.sql, but I got errors because they both ended up with the same unique_id which is derived from the file name. So I had to change that and base the unique_id on the whole relative path instead.
I can provide more details about that on the github issue I will create if you want, or maybe I will create a separate issue to keep things clean.
I’m looking forward for the v0.14.0 release to try out what you suggest.
P.S.
I haven’t maintained flamy recently, because my new company mostly uses pyspark instead of Hive. But we developed a similar tool for pyspark. Maybe we will open source it one day, once the dust settles. We currently use it for the pyspark workflows, alongside dbt for the BigQuery workflows.
Unless this uses some recently released feature, I think I already tried doing this.
It might solve parts of my use case but the other problem I was stuck with what that the model ids are based on the file name, forcing us to use a different file name for every file, even if they are in different subfolders. What would help me would a macro that let us define the model id using the file path (or maybe even just enable such feature by default, as I don’t think it would break currently existing use cases)
Sadly I didn’t have time to work on it much recently, but hopefully I’ll find time to propose a change with a pull request to engage a discussion about this. I will also start by trying the suggested solution again and see if it is sufficient or not.
Yep, even with this macro, you still need unique filenames that don’t contain dots. This was more to show how you could use a different separator (__) to ensure you have distinct file names, and use the first part as your schema name, and the second part as your relation name.