Using folders inside models


#1

I’ve been using dbt for a while now. :heart: One thing I’d love to figure out is how to best approach using folders in models.

I’ve seen some projects using folders to group steps of a pipeline (similar to what’s in the Stripe Model). Other projects models are splitted in folders representing “submodels”, each folder contain similar SQL queries (by business, by stage, …).

Also, they generally use a base folder which I guess is for the sources. I’m not sure of the best use of base though!

We’re currently organizing our project with folders containing entities like mrr or trials but I’d love to hear if there are any best practices or suggestions around this.

cc @tristan


#2

There are two structures inherent in a dbt project: the dependency graph (as defined by your ref() calls) and the hierarchical folder structure. The graph gets build semi-implicitly as you go through the process of building out your transformations, but the folder structure gives you a whole other dimension to play with. That freedom makes the folder structure (and the hierarchical configuration in dbt_project.yml) really powerful, but it’s hard to know what to do with all that power! At some level your question is one of data warehouse design and modeling practices, and that’s what I’ll try to hit in my answer.

We’ve started to standardize the way we structure our folders on all of our client work, although for a long time it was pretty freeform. Our ideas on this are loosely based around Kimball data modeling (emphasis on the word loose!).

  • First, we have staging models. We put all of these within a folder called staging. The subfolders within staging are all directly associated with a single source data system. They’re called things like mongo_prod or salesforce or adwords. The models inside those folders have a single job: produce a model that acts as a clean interface to each source data table. So, if there are 12 tables in mongo that need to be used in analysis, there should be one model for each of them, and we name each of those models stg_[entity_name]. Note that it may take multiple models to produce a clean output model for a given source data table if there is a lot of data cleansing required! But there should only be a single model called stg_[entity_name].
  • Next we have business data “marts”. These business marts are frequently organized by function (marketing, sales, finance). Within each business mart we have dimension and fact models, named dim_[entity_name] and fct_[fact_name]. Organizing the folders within these business data marts is up to the designer for that mart; the folder structure should be useful in configuring models as cleanly and efficiently as possible.

So, in practice, the folders for our projects look like this:

staging
  > mongo_prod
  > adwords
  > stripe
finance
  > mrr
  > revrec
product
  > users
  > accounts
...

We’ve found that this works pretty well for us.


#3

I anxiously await the addition of tags to dbt. Right now our models are broken up by where the are in the pipeline. I have after_stage for models that depend only on the stage loaders being complete, after_ga for those things that require our post-processing a GA data to be complete, and after_events for models that require the events dataset to have been populated.

In after_stage I have base, work and public subfolders, which control into which schema the models are placed. In the others, I have only work and public.

I’d like to get to a point where I can structure the folders by subject area, as in @tristan’s example, and use tags to indicate into which “execution group” a model should fall.