Best practice splitting SQL into staging, intermediate and marts layers and naming conventions

I currently have a DAG running in Airflow that consists of 22 steps and next to this uses some sources that are created in this scope as well. This is all SQL so very much suited to bring over to dbt. However, I’m wondering what the best practice is to separate and split up SQL operations into the different layers and how to handle intermediate steps properly and how to organize this in proper folder structure.
I want to keep/create clarity without making it to complex by breaking it up into to many pieces.

  • do you always use a separate staging model if you do anything more than a SELECT * on a source? (ie. make some field lower case and remove spaces)
  • when you build a certain collection of data to be used as a source in 1 process only, would you consider this to be a mart layer and a FACT table or do you keep it in staging layer (ie. a combined collection of products that are located in different sources, to further use to map against in downstream steps)
  • when simply combining two intermediate models with a UNION for further use in downstream intermediate models, would you consider this to be a model in the staging layer since it is such a simple operation, or do you ‘never go back’ and is this also considered to be an intermediate model?
  • From the documentation it seems to be good practice to have the layers in the top level with specific subfolders. Do I understand this correctly? I could imaging building the structure the other way around since you then have everything related to 1 topic in one place. ie
models/staging/topic
models/intermediate/topic
models/marts/topic

versus

models/topic/staging
models/topic/intermediate
models/topic/marts
  • how do you handle/name different levels of models that can be considered the ‘output’ of the data pipeline? I guess these are considered to be marts, but in this case I would have one I currently call source__* as it contains all the data from the previous steps and this is sometimes used by myself and a few other people for analyses. Then there is a more commonly used model that applies some filtering and the last interpretation logic on the source. This table is used more by other people.
  • then lastly there will be yet more aggregated tables build on top of the last model from the pipeline, these will be used for dashboarding mostly. This is what might be considered a presentation layer, but I seem to be missing this concept within dbt?

My apologies if I’m mixing different data modelling terms, trying to understand how this would all work in an ‘ideal dbt’ world to prevent more complexity by mixing different modelling and naming conventions approaches.
This data pipeline is quite complex and even I (who build it in the first place) find it quite challenging to debug different parts, so I want to improve this while porting it all over to dbt. That’s why I appreciate all input on how you handle these kind of things

Staging layers are good practice over directly using the source because it allows you to handle changes in your source. For example, imagine a column in the source is renamed. If you used the source directly, you would have to update everything using that source for the column rename. If you use a staging layer, then you would only have to do the column rename in one place.

I would suggest looking into the medallion architecture as an alternative to just staging/intermediates/marts. The main difference is that there’s a “silver” layer that’s meant for others to build data models off of, which you would do joins and stuff off of. This is different from the “gold” layer that’s aggregated or whatever for a specific reporting need.

The problem with trying to use multiple source tables in a staging model is handling changing needs. If you ever need to use any of those source tables individually, then you can run into a problem of not having its own staging table. Having an intermediate or silver model is always an option instead.

Organizing your subfolders by type/topic works well for handling default configurations in your dbt_project.yml file. It’s easy to configure the entire marts folder to materialize as tables for example. In contrast, I usually don’t find myself needing to setup configurations per topic.

We handle them by organizing it on the database side using the custom database/schema/alias macros. At the overarching database level, we split the models into internal/staging+intermediate, data warehouse/catalog, and marts/reports/analytics databases. That way it’s clear when a dataset is built for a specific report and shouldn’t be used or if a dataset is meant to be used by everyone.

Note: @Jack Nguyen originally posted this reply in Slack. It might not have transferred perfectly.

2 Likes