How we structure our dbt projects

Hi Claire,

Congratulations and thanks for sharing the knowledge! The post is pretty interesting and helpful!

I was curious about the execution process that you do on production considering the “tests” that you create for each “layer”. How do you do the execution and what is the sequence of execution that you do?

Could you share these details, please?

Thanks!

For posterity, this question was answered over here :point_down:
What are the execution steps in your development and production environments?

1 Like

Question: What was the rationale for using a Common Table Expression (CTE) in building the staging model (the SQL queries in the section titled “But what about base models?”). Is there a particular benefit to using CTEs in this model, or is this a matter of style preference? Thank you!

It’s totally a matter of preference! I thought it was overkill when I started at Fishtown, but now I like it!

As rules of thumb:

  • We always like to have ref or source functions at the top of a file, as we feel that it makes it easy for us to understand the dependencies when first looking at a model. “Importing” them as CTEs helps us with this.
  • We pretty much always finish off our queries with a select * from my_final_cte. I find that this pattern helps me when writing SQL for a model, as it means I can keep chaining together CTEs easily, without having to go back and add a CTE name, parens, and indents to the part of the query I was just working on! I also find it easier for debugging during development :slight_smile:

Overall these conventions help us ensure every analyst that works on our project writes SQL that looks consistent, which improves readability in the long run!

We’ve also done some quick investigation and found that on modern data warehouses it doesn’t impact the performance of your queries, more on that over here :point_right: CTEs are Passthroughs--Some research!

2 Likes

@claire, you mentioned that you take a Kimball-lite approach to modelling. Can you elaborate on this? I am currently trying to apply a Kimball model to my data, and it has proven to be difficult.

Hi @hawkar. The answer to this question is probably an entire series of blog posts, so unfortunately it’s not one I’m able to go into at this time! Broadly speaking, we like the design aspect of Kimball around building models that represent a business, we’ve kept some of the naming conventions, but we’ve done away with lots of the implementation details that were tied to the data warehouse technology of the time.

Also check out this related article on the relevance of Kimball in a modern data warehouse.

1 Like

Thanks for the great article. I’m curious… you laid out how you structure dbt projects – how do map these to database schemas? For example, do you have separate staging schemas ?

That’s a really good question @MisterFine!

Typically, when we start a project, we put everything in one schema (in prod, we name the schema analytics, and in dev we name it dbt_claire or dbt_drew).

As the project grows, we start to leverage custom schemas to logically group our models. We tend to use schemas that match our folder hierarchy, with schemas like core, marketing, finance. Our staging models often end up in a single schema, staging.

I’d love to hear what works for you!

@claire Do you follow similar principles for seed data tables and create staging tables for them? I found that if I want to document them and test them I have to (at least with version 0.15.0), so I treat them pretty much as raw data - but it increases DAG complexity and number of objects significantly …

Hey @maria! Great question!

Since we have control over seeds, we’ll make sure that the data is already in a “staging” format, and name it with the stg_ prefix. Since there isn’t usually a data source for these models (typically they are codified business logic), we’ll often end up with names like stg_country_codes or stg_email_exclusion_list, rather than following a stg_<source>_<object> format. We aren’t super strict on this convention though, so we’re open to feedback here!

This is kind of hacky, but you can actually document and test seeds in a .yml files in the models/ directory. We’ll improve this in a future version of dbt, but for now you can do:

version: 2
models:
  - name: stg_country_codes
    columns:
       - name: country_code
         tests:
           - unique
           - not_null

Thanks @claire! On a similar note, what do you usually do for snapshots?

let’s suppose we have a source table for which we build a snapshot on top of that. Do you usually build a stage model from the snapshot, the source, or both?

I think it’s probably best to just allow out of the staging layer one table which should be the snapshot so that we don’t have duplication of code. That depends on the snapshot taking all the columns that might be needed downstream.

What would be your opinion on this?

Hi Clair, thanks for this! You have not mentioned rollups - things like monthly sales per department. You never mentioned them. Do you expect rollups to be handled at a different level - maybe the visualization tool? If not, do you think it would make send to add them under the mart, e.g. mart/core/rollups/roll_monthly_sales_by_dept.sql?

We tend to not do too many rollups, and leave that to our BI tool — that leaves us the ability to explore a metric to understand things like “ok if my revenue is $Y, what contributed to that number?”.

If we do have rollups, they tend to go in our metrics directory (which we very briefly mentioned here :slight_smile: )

Thanks. I am new to the area, and I assume that a BI tool can be used to select a rollup from a list of available rollups, but that it does not in itself calculate the rollup. Sounds like I am wrong about that. Can you provide an example, i.e. a BI tool that is smart enough to create a rollup on it’s own? How does it do it? Does it create the SQL query to generate it?

It depends on the tool! Looker and Tableau will generate the SQL, whereas with Mode you traditionally have had to write the SQL yourself (I haven’t been keeping up to date with their product updates). What BI tool are you using? Typically folks that get started with dbt already have a BI tool in place :slight_smile:

We use Periscope (now Scisense). Have not played with it yet.

Claire, thanks for sharing the knowledge!

Can you please recommend a few?

Check out this post:

2 Likes

Hi Claire,

Thanks for this great post.

I appreciate I’m a bit late to the party, but I’m really interested in how you model the downstream metrics, and specifically ensuring that time-series metric facts can easily be filtered by the relevant dimensions as well?

So let’s say you have a company result for ‘Quality’ that you’ve defined as a metric, but you want to be able to cut it by team, individual, product, all these things - resulting in a very large table potentially.

How do you model this downstream to your dimensions and facts? Is there a special ‘best way’ you’ve build to make it painless and still be able to cut it by day/week/month/year etc.?

Thanks
Pete

1 Like

The answer to that question is a whole post in itself! And one we’re actually working on right now. In the meantime, you can check out the talk from my coworkers on this topic: