How we structure our dbt projects

As the maintainers of dbt, and analytics consultants, at Fishtown Analytics we build a lot of dbt projects. Over time, we’ve developed internal conventions on how we structure them.

It’s important to note that this is not the only, or the objectively best, way to structure a dbt project. Rather, this document reflects our current opinions. These opinions are strongly influenced by:

  • our views on data model design; which in turn are influenced by:
  • the kinds of analytics problems we are solving for clients
  • the data stack we typically work within, in which multiple data sources are loaded by third party tools, and the data warehouse is optimized for analytical queries (therefore we aren’t tightly bounded by performance optimization considerations).

Our opinions are almost guaranteed to change over time as we update our views on modeling, are exposed to more analytics problems, and data stacks evolve. It’s also worth clearly stating here: the way we structure dbt projects makes sense for our projects, but may not be the best fit for yours! This article exists on Discourse so that we can have a conversation – I would love to know how others in the community are structuring their projects.

In comparison, the (recently updated) best practices reflect principles that we believe to be true for any dbt project. Of course, these two documents go hand in hand – our projects are structured in such a way that makes the those principles easy to observe, in particular:

  • Limit references to raw data
  • Rename and recast fields once
  • Group your models in directories
  • Add tests to your models
  • Consider the information architecture of your data warehouse
  • Separate source-centric and business-centric transformations

If you want to see what the code for one of our projects looks like, check out this demonstration dbt project.

We also recently held (and recorded) an office hours on this topic – this article provides a high level outline, but there’s a lot more detail and discussion in the video.

Lastly, before I dive in, a huge thank you to @jerco for not only teaching me a lot of the material in this article, but also for doing a lot of the groundwork that went into this article – entire sections of this article are in fact lifted from his work.

Data modeling 101

The data in any of our projects has three distinct checkpoints:

  1. Sources: Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third party tool.
  2. Staging models: The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.
  3. Marts models: Models that represent business processes and entities, abstracted from the data sources that they are based on.

In a simple project, these may be the only models you build; more complex projects may have a number of intermediate models that help along this journey, as well as accessories to these models (see below).

Still confused? An example might help! Let’s think about a software business that uses both Stripe and Braintree to collect subscription payments. Their three stages of modeling might look like:

  1. Sources: Payment records from the Stripe API and payment records from the Braintree API, loaded into their data warehouse by a third party tool.
  2. Staging models: Both the Stripe and Braintree payments are recast into a consistent shape, with consistent column names.
  3. Marts models: A monthly recurring revenue (MRR) model that classifies revenue per customer per month as new revenue, upgrades, downgrades, and churn, to understand how a business is performing over time. It may be useful to note whether the revenue was collected via Stripe or Braintree, but they are not fundamentally separate models.

Of note here is that there is a distinct change that occurs between the staging and marts checkpoints – sources and staging models are source-centric, whereas marts models are business-centric.

In our dbt projects, this leads us to our first split in our models/ directory which helps us make this distinction:

β”œβ”€β”€ dbt_project.yml
└── models
    β”œβ”€β”€ marts
    └── staging

Staging raw data

The goal of the staging layer is to create staging models. Staging models take raw data, and clean and prepare them for further analysis. For a user querying the data warehouse, a relation with a stg_ prefix indicates that:

  • Fields have been renamed and recast in a consistent way.ΒΉ
  • Datatypes, such as timezones, are consistent.
  • Light cleansing, such as replacing empty string with NULL values, has occurred.
  • If useful, flattening of objects might have occurred.
  • There is a primary key that is both unique and not null (and tested).

Staging models can have joins in them to field additional columns for context or enrichment; add rows through unions and remove them through filters; deduplicate a natural key or hash together a surrogate one.

Because we often work with multiple data sources, in our staging directory, we create one directory per source.

β”œβ”€β”€ dbt_project.yml
└── models
    β”œβ”€β”€ marts
    └── staging
        β”œβ”€β”€ braintree
        └── stripe

Each staging directory contains at a minimum:

  • One staging model for each object that is useful for analytics:
    • Named stg_<source>__<object>.
    • Generally materialized as a view (unless performance requires it as a table).
  • A <source>.yml file which contains:
β”œβ”€β”€ dbt_project.yml
└── models
    β”œβ”€β”€ marts
    └── staging
        └── braintree
            β”œβ”€β”€ braintree.yml
            β”œβ”€β”€ stg_braintree__customers.sql
            └── stg_braintree__payments.sql
    

But what about base models?

Earlier versions of the dbt documentation recommended implementing β€œbase models” as the first layer of transformation – and we used to organize and name our models in this way, for example models/braintree/base/base_payments.sql.

We realized that while the reasons behind this convention were valid, the naming was an opinion, so in our recent update to the best practices, we took the mention of base models out. Instead, we replaced it with the principles of β€œrenaming and recasting once” and β€œlimiting the dependencies on raw data”.

That being said, in our dbt projects every source flows through exactly one model of the following form:

with source as (
    
    select * from {{ source('braintree', 'payments') }}
    
),

renamed as (
    
    select
        id as payment_id,
        order_id,
        convert_timezone('America/New_York', 'UTC', createdat) as created_at,
        ...
    
    from source

)

select * from renamed

We still refer to this a base transformation. If your source data is in good shape, this transformation may be all that’s required to build a staging model, and our staging model is this SQL.

However, building a staging model may warrant several models’ worth of cleaning, correcting, and categorizing, or may require a join or union to another source. To ensure our data source flows through a base transformation, we extend our DAG upstream of the staging model, by creating a separate base model, that we then select from.

In our dbt projects, we place these base models in a nested base subdirectory.

β”œβ”€β”€ dbt_project.yml
└── models
    β”œβ”€β”€ marts
    └── staging
        └── braintree
            β”œβ”€β”€ base
            |   β”œβ”€β”€ base.yml
            |   β”œβ”€β”€ base_braintree__failed_payments.sql
            |   └── base_braintree__successful_payments.sql
            β”œβ”€β”€ braintree.yml
            β”œβ”€β”€ stg_braintree__customers.sql
            └── stg_braintree__payments.sql

In our projects, base models:

  • Often use the ephemeral materialization, so they are not exposed to end users querying our warehouse.
  • Are tested in a base.yml file within the same directory as the base models.

If we need additional transformations between base and staging models, we create a nested staging/<source>/intermediate directory and place these transformations in there.

Describing a business through marts

Marts are stores of models that describe business entities and processes. They are often grouped by business unit: marketing, finance, product. Models that are shared across an entire business are grouped in a core directory.

β”œβ”€β”€ dbt_project.yml
└── models
    β”œβ”€β”€ marts
    |   β”œβ”€β”€ core
    |   β”œβ”€β”€ finance
    |   β”œβ”€β”€ marketing
    |   └── product
    └── staging

There are entire books written on how to design models, which is beyond the scope of this article. In our Kimball-lite view of the world, our goal is to build fact and dimension models, that are abstracted from the source data that they rely upon:

  • fct_<verb>: A tall, narrow table representing real-world processes that have occurred or are occurring. The heart of these models is usually an immutable event stream: sessions, transactions, orders, stories, votes.
  • dim_<noun>: A wide, short table where each row is a person, place, or thing; the ultimate source of truth when identifying and describing entities of the organization. They are mutable, though slowly changing: customers, products, candidates, buildings, employees.

Where the work of staging models is limited to cleaning and preparing, fact tables are the product of substantive data transformation: choosing (and reducing) dimensions, date-spining, executing business logic, and making informed, confident decisions.

This layer of modeling is considerably more complex than creating staging models, and the models we design are highly tailored to the analytical needs of an organization. As such, we have far less convention when it comes to these models. Some patterns we’ve found to be useful are:

  • fct_ and dim_ models should be materialized as tables within a warehouse to improve query performance. As a default, we use the table materialization, and where performance requires it, we use the incremental materialization.
  • Intermediate transformations required to get to a fact or dimension model are placed in a nested marts/<mart>/intermediate directory
  • Models are tested and documented in a <dir_name>.yml file in the same directory as the models.
  • Any extra documentation in a docs block is placed in a <dir_name>.md file in the same directory.

A marts directory may therefore end up looking like:

β”œβ”€β”€ dbt_project.yml
└── models
    β”œβ”€β”€ marts
    β”‚   β”œβ”€β”€ core
    β”‚   β”‚   β”œβ”€β”€ core.md
    β”‚   β”‚   β”œβ”€β”€ core.yml
    β”‚   β”‚   β”œβ”€β”€ dim_customers.sql
    β”‚   β”‚   β”œβ”€β”€ fct_orders.sql
    β”‚   β”‚   └── intermediate
    β”‚   β”‚       β”œβ”€β”€ customer_orders.sql
    β”‚   β”‚       β”œβ”€β”€ customer_payments.sql
    β”‚   β”‚       β”œβ”€β”€ intermediate.yml
    β”‚   β”‚       └── order_payments.sql
    β”‚   β”œβ”€β”€ finance
    β”‚   β”œβ”€β”€ marketing
    β”‚   └── product
    └── staging

This entire project results in the following DAG:

Accessories to data

There are other kinds of SQL files that find their way into robust dbt projects. In addition to staging and marts, we find ourselves with model directories such as:

  • utils: An all_days table. This is useful everywhere, though it never forms the basis for analysis/reporting.
  • lookups: A user-mapping table, a zipcode-country table, etc. These are as likely to be CSV seeds as tables in a production database. You may reference it at several unpredictable points throughout modeling, and maybe even in a BI tool.
  • admin: Audit logs, warehouse operations, Redshift maintenance, and incremental records of the miscellaneous DDL you run to make your project run smoothly.
  • metrics: Precisely defined measurements taken from fact tables, directly conducive to time-series reporting, and tightly structured so as to allow one-to-one comparison with goals and forecasting. A metrics table lives downstream of dimension and fact tables in your DAG, and it deserves special status.
  • Packages: While not a model folder within your main project, packages that include models (like our snowplow package) can be configured into custom schema and materialization patterns from dbt_project.yml.

In projects where we find ourselves with these additional models, we often leverage custom schemas as directories in our warehouse, to logically group the models, choosing a schema name that matches the directory name in our dbt project.

Final thoughts

In this article, building the DAG for a dbt project has been described left to right, starting at sources, and ending with marts models.

However, it’s worth noting that in reality we often first think through a modeling problem from right to left β€” we start with an idea of the dashboard or report we want to build, then whiteboard the structure of the marts model we need in our warehouse to power this dashboard. On the same whiteboard, we’ll often then work backwards until we reach our source, before we start writing any actual SQL. I’ve found that it’s only once I’ve solved a modeling problem a few times that I get an intuition for how to build a DAG from left to right. In other words: we tend to think about our destination before we start our modeling journey.


ΒΉWe’ve standardized our naming and type conventions in our dbt coding conventions.

18 Likes

Hi Claire,

Congratulations and thanks for sharing the knowledge! The post is pretty interesting and helpful!

I was curious about the execution process that you do on production considering the β€œtests” that you create for each β€œlayer”. How do you do the execution and what is the sequence of execution that you do?

Could you share these details, please?

Thanks!

For posterity, this question was answered over here :point_down:
What are the execution steps in your development and production environments?

1 Like

Question: What was the rationale for using a Common Table Expression (CTE) in building the staging model (the SQL queries in the section titled β€œBut what about base models?”). Is there a particular benefit to using CTEs in this model, or is this a matter of style preference? Thank you!

It’s totally a matter of preference! I thought it was overkill when I started at Fishtown, but now I like it!

As rules of thumb:

  • We always like to have ref or source functions at the top of a file, as we feel that it makes it easy for us to understand the dependencies when first looking at a model. β€œImporting” them as CTEs helps us with this.
  • We pretty much always finish off our queries with a select * from my_final_cte. I find that this pattern helps me when writing SQL for a model, as it means I can keep chaining together CTEs easily, without having to go back and add a CTE name, parens, and indents to the part of the query I was just working on! I also find it easier for debugging during development :slight_smile:

Overall these conventions help us ensure every analyst that works on our project writes SQL that looks consistent, which improves readability in the long run!

We’ve also done some quick investigation and found that on modern data warehouses it doesn’t impact the performance of your queries, more on that over here :point_right: CTEs are Passthroughs--Some research!

2 Likes

@claire, you mentioned that you take a Kimball-lite approach to modelling. Can you elaborate on this? I am currently trying to apply a Kimball model to my data, and it has proven to be difficult.

Hi @hawkar. The answer to this question is probably an entire series of blog posts, so unfortunately it’s not one I’m able to go into at this time! Broadly speaking, we like the design aspect of Kimball around building models that represent a business, we’ve kept some of the naming conventions, but we’ve done away with lots of the implementation details that were tied to the data warehouse technology of the time.

Also check out this related article on the relevance of Kimball in a modern data warehouse.

1 Like