How we (used to) structure our dbt projects

:new: There’s now an updated, more in-depth guide to structuring dbt projects over on the dbt Developer Hub, written by Gwen Windflower.

I wrote this article all the way back in May 2019, pretty soon after starting at dbt Labs (named Fishtown Analytics at the time), with the help of JerCo and Tristan. I’m both incredibly proud of the impact that it has had, and elated to see it get a refresh for the newest generation of analytics engineers. I guess this version can live on, but you should really go read the updated version instead :slight_smile:
Claire, June 2022

As the maintainers of dbt, and analytics consultants, at Fishtown Analytics we build a lot of dbt projects. Over time, we’ve developed internal conventions on how we structure them.

This article does not seek to instruct you on how to design a final model for your stakeholders — it won’t cover whether you should denormalize everything into one wide master table, or have many tables that need to be joined together in the BI layer. There are entire books on this topic. Instead, use this as a guide once you’ve already got an idea of what you’re building for how you should break the transformations up into separate dbt models.

It’s important to note that this is not the only, or the objectively best, way to structure a dbt project. Rather, this document reflects our current opinions. These opinions are strongly influenced by:

  • our views on data model design; which in turn are influenced by:
  • the kinds of analytics problems we are solving for clients
  • the data stack we typically work within, in which multiple data sources are loaded by third party tools, and the data warehouse is optimized for analytical queries (therefore we aren’t tightly bounded by performance optimization considerations).

Our opinions are almost guaranteed to change over time as we update our views on modeling, are exposed to more analytics problems, and data stacks evolve. It’s also worth clearly stating here: the way we structure dbt projects makes sense for our projects, but may not be the best fit for yours! This article exists on Discourse so that we can have a conversation – I would love to know how others in the community are structuring their projects.

In comparison, the (recently updated) best practices reflect principles that we believe to be true for any dbt project. Of course, these two documents go hand in hand – our projects are structured in such a way that makes the those principles easy to observe, in particular:

  • Limit references to raw data
  • Rename and recast fields once
  • Group your models in directories
  • Add tests to your models
  • Consider the information architecture of your data warehouse
  • Separate source-centric and business-centric transformations

If you want to see what the code for one of our projects looks like, check out this demonstration dbt project.

We also recently held (and recorded) an office hours on this topic – this article provides a high level outline, but there’s a lot more detail and discussion in the video.

Lastly, before I dive in, a huge thank you to @jerco for not only teaching me a lot of the material in this article, but also for doing a lot of the groundwork that went into this article – entire sections of this article are in fact lifted from his work.

Data transformation 101

The data in any of our projects has three distinct checkpoints:

  1. Sources: Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third party tool.
  2. Staging models: The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.
  3. Marts models: Models that represent business processes and entities, abstracted from the data sources that they are based on.

In a simple project, these may be the only models you build; more complex projects may have a number of intermediate models that help along this journey, as well as accessories to these models (see below).

Still confused? An example might help! Let’s think about a software business that uses both Stripe and Braintree to collect subscription payments. Their three stages of modeling might look like:

  1. Sources: Payment records from the Stripe API and payment records from the Braintree API, loaded into their data warehouse by a third party tool.
  2. Staging models: Both the Stripe and Braintree payments are recast into a consistent shape, with consistent column names.
  3. Marts models: A monthly recurring revenue (MRR) model that classifies revenue per customer per month as new revenue, upgrades, downgrades, and churn, to understand how a business is performing over time. It may be useful to note whether the revenue was collected via Stripe or Braintree, but they are not fundamentally separate models.

Of note here is that there is a distinct change that occurs between the staging and marts checkpoints – sources and staging models are source-centric, whereas marts models are business-centric.

In our dbt projects, this leads us to our first split in our models/ directory which helps us make this distinction:

├── dbt_project.yml
└── models
    ├── marts
    └── staging

Staging raw data

The goal of the staging layer is to create staging models. Staging models take raw data, and clean and prepare them for further analysis. For a user querying the data warehouse, a relation with a stg_ prefix indicates that:

  • Fields have been renamed and recast in a consistent way.¹
  • Datatypes, such as timezones, are consistent.
  • Light cleansing, such as replacing empty string with NULL values, has occurred.
  • If useful, flattening of objects might have occurred.
  • There is a primary key that is both unique and not null (and tested).

Staging models can have joins in them to field additional columns for context or enrichment; add rows through unions and remove them through filters; deduplicate a natural key or hash together a surrogate one.

Because we often work with multiple data sources, in our staging directory, we create one directory per source.

├── dbt_project.yml
└── models
    ├── marts
    └── staging
        ├── braintree
        └── stripe

Each staging directory contains at a minimum:

  • One staging model for each object that is useful for analytics:
    • Named stg_<source>__<object>.
    • Generally materialized as a view (unless performance requires it as a table).
  • A src_<source>.yml file which contains:
    • Source definitions, tests, and documentation
  • A stg_<source>.yml file which contains
├── dbt_project.yml
└── models
    ├── marts
    └── staging
        └── braintree
            ├── src_braintree.yml
            ├── stg_braintree.yml
            ├── stg_braintree__customers.sql
            └── stg_braintree__payments.sql
    

Some dbt users prefer to have one .yml file per model (e.g. stg_braintree__customers.yml). This is a completely reasonable choice, and we recommend implementing it if your .yml files start to become unwieldy.

But what about base models?

Earlier versions of the dbt documentation recommended implementing “base models” as the first layer of transformation – and we used to organize and name our models in this way, for example models/braintree/base/base_payments.sql.

We realized that while the reasons behind this convention were valid, the naming was an opinion, so in our recent update to the best practices, we took the mention of base models out. Instead, we replaced it with the principles of “renaming and recasting once” and “limiting the dependencies on raw data”.

That being said, in our dbt projects every source flows through exactly one model of the following form:

with source as (
    
    select * from {{ source('braintree', 'payments') }}
    
),

renamed as (
    
    select
        id as payment_id,
        order_id,
        convert_timezone('America/New_York', 'UTC', createdat) as created_at,
        ...
    
    from source

)

select * from renamed

We still refer to this a base transformation. If your source data is in good shape, this transformation may be all that’s required to build a staging model, and our staging model is this SQL.

However, building a staging model may warrant several models’ worth of cleaning, correcting, and categorizing, or may require a join or union to another source. To ensure our data source flows through a base transformation, we extend our DAG upstream of the staging model, by creating a separate base model, that we then select from.

In our dbt projects, we place these base models in a nested base subdirectory.

├── dbt_project.yml
└── models
    ├── marts
    └── staging
        └── braintree
            ├── base
            |   ├── base.yml
            |   ├── base_braintree__failed_payments.sql
            |   └── base_braintree__successful_payments.sql
            ├── src_braintree.yml
            ├── stg_braintree.yml
            ├── stg_braintree__customers.sql
            └── stg_braintree__payments.sql

In our projects, base models:

  • Often use the ephemeral materialization, so they are not exposed to end users querying our warehouse.
  • Are tested in a base.yml file within the same directory as the base models.

If we need additional transformations between base and staging models, we create a nested staging/<source>/intermediate directory and place these transformations in there.

Describing a business through marts

Marts are stores of models that describe business entities and processes. They are often grouped by business unit: marketing, finance, product. Models that are shared across an entire business are grouped in a core directory.

├── dbt_project.yml
└── models
    ├── marts
    |   ├── core
    |   ├── finance
    |   ├── marketing
    |   └── product
    └── staging

There are entire books written on how to design models, which is beyond the scope of this article. In our view of the world, our goal is to build fact and dimension models, that are abstracted from the source data that they rely upon:

  • fct_<verb>: A tall, narrow table representing real-world processes that have occurred or are occurring. The heart of these models is usually an immutable event stream: sessions, transactions, orders, stories, votes.
  • dim_<noun>: A wide, short table where each row is a person, place, or thing; the ultimate source of truth when identifying and describing entities of the organization. They are mutable, though slowly changing: customers, products, candidates, buildings, employees.

Where the work of staging models is limited to cleaning and preparing, fact tables are the product of substantive data transformation: choosing (and reducing) dimensions, date-spining, executing business logic, and making informed, confident decisions.

This layer of modeling is considerably more complex than creating staging models, and the models we design are highly tailored to the analytical needs of an organization. As such, we have far less convention when it comes to these models. Some patterns we’ve found to be useful are:

  • fct_ and dim_ models should be materialized as tables within a warehouse to improve query performance. As a default, we use the table materialization, and where performance requires it, we use the incremental materialization.
  • Intermediate transformations required to get to a fact or dimension model are placed in a nested marts/<mart>/intermediate directory. They are named <useful_name>__<transformation_in_past_tense>.sql. The lack of prefix and use of double underscores indicates that these are intermediate models, not to be trusted, however, it may also be worth hiding these in a different schema.
  • Models are tested and documented in a <dir_name>.yml file in the same directory as the models.
  • Any extra documentation in a docs block is placed in a <dir_name>.md file in the same directory.

A marts directory may therefore end up looking like:

├── dbt_project.yml
└── models
    ├── marts
    │   ├── core
    │   │   ├── core.md
    │   │   ├── core.yml
    │   │   ├── dim_customers.sql
    │   │   ├── fct_orders.sql
    │   │   └── intermediate
    │   │       ├── customer_orders__grouped.sql
    │   │       ├── customer_payments__grouped.sql
    │   │       ├── intermediate.yml
    │   │       └── order_payments__joined.sql
    │   ├── finance
    │   ├── marketing
    │   └── product
    └── staging

This entire project results in the following DAG:

Accessories to data

There are other kinds of SQL files that find their way into robust dbt projects. In addition to staging and marts, we find ourselves with model directories such as:

  • utils: An all_days table. This is useful everywhere, though it never forms the basis for analysis/reporting.
  • lookups: A user-mapping table, a zipcode-country table, etc. These are as likely to be CSV seeds as tables in a production database. You may reference it at several unpredictable points throughout modeling, and maybe even in a BI tool.
  • admin: Audit logs, warehouse operations, Redshift maintenance, and incremental records of the miscellaneous DDL you run to make your project run smoothly.
  • metrics: Precisely defined measurements taken from fact tables, directly conducive to time-series reporting, and tightly structured so as to allow one-to-one comparison with goals and forecasting. A metrics table lives downstream of dimension and fact tables in your DAG, and it deserves special status.
  • Packages: While not a model folder within your main project, packages that include models (like our snowplow package) can be configured into custom schema and materialization patterns from dbt_project.yml.

In projects where we find ourselves with these additional models, we often leverage custom schemas as directories in our warehouse, to logically group the models, choosing a schema name that matches the directory name in our dbt project.

Final thoughts

In this article, building the DAG for a dbt project has been described left to right, starting at sources, and ending with marts models.

However, it’s worth noting that in reality we often first think through a modeling problem from right to left — we start with an idea of the dashboard or report we want to build, then whiteboard the structure of the marts model we need in our warehouse to power this dashboard. On the same whiteboard, we’ll often then work backwards until we reach our source, before we start writing any actual SQL. I’ve found that it’s only once I’ve solved a modeling problem a few times that I get an intuition for how to build a DAG from left to right. In other words: we tend to think about our destination before we start our modeling journey.


¹We’ve standardized our naming and type conventions in our dbt coding conventions.

54 Likes

Hi Claire,

Congratulations and thanks for sharing the knowledge! The post is pretty interesting and helpful!

I was curious about the execution process that you do on production considering the “tests” that you create for each “layer”. How do you do the execution and what is the sequence of execution that you do?

Could you share these details, please?

Thanks!

For posterity, this question was answered over here :point_down:
What are the execution steps in your development and production environments?

1 Like

Question: What was the rationale for using a Common Table Expression (CTE) in building the staging model (the SQL queries in the section titled “But what about base models?”). Is there a particular benefit to using CTEs in this model, or is this a matter of style preference? Thank you!

It’s totally a matter of preference! I thought it was overkill when I started at Fishtown, but now I like it!

As rules of thumb:

  • We always like to have ref or source functions at the top of a file, as we feel that it makes it easy for us to understand the dependencies when first looking at a model. “Importing” them as CTEs helps us with this.
  • We pretty much always finish off our queries with a select * from my_final_cte. I find that this pattern helps me when writing SQL for a model, as it means I can keep chaining together CTEs easily, without having to go back and add a CTE name, parens, and indents to the part of the query I was just working on! I also find it easier for debugging during development :slight_smile:

Overall these conventions help us ensure every analyst that works on our project writes SQL that looks consistent, which improves readability in the long run!

We’ve also done some quick investigation and found that on modern data warehouses it doesn’t impact the performance of your queries, more on that over here :point_right: CTEs are Passthroughs--Some research!

4 Likes

@claire, you mentioned that you take a Kimball-lite approach to modelling. Can you elaborate on this? I am currently trying to apply a Kimball model to my data, and it has proven to be difficult.

Hi @hawkar. The answer to this question is probably an entire series of blog posts, so unfortunately it’s not one I’m able to go into at this time! Broadly speaking, we like the design aspect of Kimball around building models that represent a business, we’ve kept some of the naming conventions, but we’ve done away with lots of the implementation details that were tied to the data warehouse technology of the time.

Also check out this related article on the relevance of Kimball in a modern data warehouse.

1 Like

Thanks for the great article. I’m curious… you laid out how you structure dbt projects – how do map these to database schemas? For example, do you have separate staging schemas ?

That’s a really good question @MisterFine!

Typically, when we start a project, we put everything in one schema (in prod, we name the schema analytics, and in dev we name it dbt_claire or dbt_drew).

As the project grows, we start to leverage custom schemas to logically group our models. We tend to use schemas that match our folder hierarchy, with schemas like core, marketing, finance. Our staging models often end up in a single schema, staging.

I’d love to hear what works for you!

@claire Do you follow similar principles for seed data tables and create staging tables for them? I found that if I want to document them and test them I have to (at least with version 0.15.0), so I treat them pretty much as raw data - but it increases DAG complexity and number of objects significantly …

Hey @maria! Great question!

Since we have control over seeds, we’ll make sure that the data is already in a “staging” format, and name it with the stg_ prefix. Since there isn’t usually a data source for these models (typically they are codified business logic), we’ll often end up with names like stg_country_codes or stg_email_exclusion_list, rather than following a stg_<source>_<object> format. We aren’t super strict on this convention though, so we’re open to feedback here!

This is kind of hacky, but you can actually document and test seeds in a .yml files in the models/ directory. We’ll improve this in a future version of dbt, but for now you can do:

version: 2
models:
  - name: stg_country_codes
    columns:
       - name: country_code
         tests:
           - unique
           - not_null

Hi Clair, thanks for this! You have not mentioned rollups - things like monthly sales per department. You never mentioned them. Do you expect rollups to be handled at a different level - maybe the visualization tool? If not, do you think it would make send to add them under the mart, e.g. mart/core/rollups/roll_monthly_sales_by_dept.sql?

We tend to not do too many rollups, and leave that to our BI tool — that leaves us the ability to explore a metric to understand things like “ok if my revenue is $Y, what contributed to that number?”.

If we do have rollups, they tend to go in our metrics directory (which we very briefly mentioned here :slight_smile: )

Thanks. I am new to the area, and I assume that a BI tool can be used to select a rollup from a list of available rollups, but that it does not in itself calculate the rollup. Sounds like I am wrong about that. Can you provide an example, i.e. a BI tool that is smart enough to create a rollup on it’s own? How does it do it? Does it create the SQL query to generate it?

It depends on the tool! Looker and Tableau will generate the SQL, whereas with Mode you traditionally have had to write the SQL yourself (I haven’t been keeping up to date with their product updates). What BI tool are you using? Typically folks that get started with dbt already have a BI tool in place :slight_smile:

We use Periscope (now Scisense). Have not played with it yet.

Claire, thanks for sharing the knowledge!

Can you please recommend a few?

Check out this post:

2 Likes

Hi Claire,

Thanks for this great post.

I appreciate I’m a bit late to the party, but I’m really interested in how you model the downstream metrics, and specifically ensuring that time-series metric facts can easily be filtered by the relevant dimensions as well?

So let’s say you have a company result for ‘Quality’ that you’ve defined as a metric, but you want to be able to cut it by team, individual, product, all these things - resulting in a very large table potentially.

How do you model this downstream to your dimensions and facts? Is there a special ‘best way’ you’ve build to make it painless and still be able to cut it by day/week/month/year etc.?

Thanks
Pete

1 Like

The answer to that question is a whole post in itself! And one we’re actually working on right now. In the meantime, you can check out the talk from my coworkers on this topic:

3 Likes