Repeated CTEs, DRY and intermediate models good practice

Thmsrey · August 18, 2022, 8:55am

Hi everyone,

I am refactoring some legacy SQL codebase at my company and after reading “How we structure our dbt projects”, something does not click for me regarding intermediate models.

I have a case where 2 marts models have a big portion of repeated code in a CTE. Actually, the person who wrote the code even created a temporary table (we’re using BQ) for this CTE. So this same TEMP table is created 2 times in 2 different models.

To me, the logical thing to do was to move this repeated piece of code in an intermediate model that would be ref’ed by the 2 marts models. However, the chapter of the guide regarding Intermediate Models advocates for a 1 to 1 relationship between intermediate and mart models.

I understand that if I would move my CTE into an “ephemeral” intermediate model (as it is recommended), we would compute the same thing twice. However, according to the guidelines, I don’t see where else this code should live.

Of course guidelines are just guidelines but I’m curious as to what is the dbt way to handle this. Moving it to a mart model even if it’s not really aimed to end users? And what about CTEs that are reused in many different models?

Thanks!

obar1 · December 10, 2022, 9:00am

Create a new folder and call it
Intermediate
Put there what is shared on the downstream and use table as materialization
Just avoid cycle between intermediate and final models

cstork · February 1, 2023, 11:53pm

I’m in a similar situation as the OP and this does not seem to solve the problem as I understand it: two pipelines share code (e.g. a big CTE), how do you adhere to the DRY principle in this case? Do you have to resort to a macro? That seems a bit unwieldy.

gwenwindflower · February 10, 2023, 1:16pm

Hi @Thmsrey! Winnie here, writer of the “How we structure” Guide. First off, thank you for reading it thoughtfully! Secondly, this is a really good question.

This is the squishiest part of the guide, and the set of recommendations I’m the least satisfied with how I communicated. It needs more examples.

My biggest overarching response is: as you said, guidelines are just guidelines, and everybody breaks this one when needed. Almost every dbt installation will have some instance where they need to reuse intermediate models in multiple marts, so don’t feel like you’re necessarily breaking the rules if this particular group of models is that area for y’all.

All that said, let me elaborate a bit on the goal of this recommendation and how I think about it:

I’ve seen a lot of people re-use all of their intermediate models in like 10 models each, this creates this really gnarly set of dependencies that makes your DAG difficult to reason about and duplicates tons of data in your warehouse. Storage is cheap so that’s not the end of the world, but if it’s expensive compute-wise you could be needlessly computing transformations over and over again.
What I prefer to do is layer my marts. So you mention these tables reuse this big complex CTE for example. I would consider importing it into one of the marts. Then importing that mart into the other mart that needs those transformations, and just select the columns that overlap. This keeps the dependency chain in the DAG really clear, and means I’m only processing that set of transformations one time.

Let me know if that makes sense! Again, there are always scenarios where this isn’t ideal, but I push people to think through this option first before they start plugging intermediate models all over the place and tangling up their DAG.

Happy modeling!

maxim.kupfer · April 27, 2023, 4:54pm

What do you mean by “importing” the model? I’m also very new to dbt, but I’m trying to do something simmilar where I take an already used model but refine it a bit and it sounds like this would be the clearest approach in terms of others understanding the dependency graph and all.

gwenwindflower · April 27, 2023, 5:34pm

Hey hey – good question! I’m referring to {{ ref }}'ing the model when i say import above. Sometimes we’ll call these particular CTEs imports because of the way we set up the models – typically like this:

with
--- i'm often referred to as an 'import CTE' because like package imports in Python or JS i'm 
--- a 'dependency' declared at the top of the file
model_1 as (
   select stuff
  from {{ ref('model_1') }}
}

select stuff that uses model_1 and does other stuff

does that make sense?

maxim.kupfer · April 27, 2023, 5:50pm

ah yes, I understand. I guess what I was thinking was that you could import a model without it being materialized as a table. This could be useful if you want to modularize by separating things out that don’t necessarily have to be their own table as they have a very specific use case. I take it that the intermediate models are the next best thing?

gwenwindflower · April 27, 2023, 10:50pm

You can do this! Materializations | dbt Developer Hub

This ‘materialization’ will just interpolate the logic into a CTE in the models that ref it.

tnetha · May 2, 2023, 6:51pm

Thanks for writing this up! I found this super helpful in understanding intermediate models. It always confused me why the best practice was to organize it by mart, but this cleared it up.

How would you handle having data from an ERP that fed multiple marts? For example, I have an Invoices staging table and an Invoice_line_items staging table that has the products sold on each invoice. I’ve built an intermediate model that combines these together and brings in some relevant dimensions from other tables. The thought was to use this intermediate model every time I needed invoice data for a mart and filter to just that business units’ invoices.

With the approach you outline, I would almost want to create a “company-wide” mart to put these models and then point other models there from different marts. Does this make sense with what you intended? Again, thanks for the write-up!

gwenwindflower · May 4, 2023, 4:44pm

welcome! great question –

I think you’re doing this great! The one difference is I would treat this:
“invoices and invoice line items enriched with other tables” as a mart not an intermediate table. and that’s why you’re finding yourself plugging it in lots of places, it’s your invoices mart! hard to fully say without looking at a DAG what i would do, but my instinct is you’re doing this properly and it’s just kind of shift in terminology.

Topic		Replies	Views
Why we make the intermediate model? Archive	1	4374	March 19, 2022
Naming conflicts of similar intermediate models for different jobs Archive	0	1652	June 17, 2022
Best practice splitting SQL into staging, intermediate and marts layers and naming conventions Help	1	5525	January 3, 2024
Should I move my CTEs into separate models? In-Depth Discussions	1	8313	June 21, 2023
What is the best way to break up a model for easier unit test? In-Depth Discussions	0	101	September 10, 2024

Repeated CTEs, DRY and intermediate models good practice

Related topics