Every so often, a question like this comes up on dbt slack:
“If I write a query using CTEs, would you recommend having the CTEs as a separate sql files as
ephemeral
and then reference those or keep the CTE with a with statement in the same file? the compile result will be the same - but what are the pros / cons of the two approaches?”
Pros of splitting CTEs into separate models:
- Your CTE becomes a separate model, which you can then test (including ephemeral models!). This is often useful for testing uniqueness.
- If your models are materialized as tables or views, you can more easily debug since you can select directly from them in your SQL client.
- You can reuse logic in multiple downstream models (in which case it may be a good idea to materialize your model as a table)
Cons of splitting CTEs into separate models:
- More files!
- If your models are materialized as tables or views, you’ll also have more relations in your data warehouse, which can make your warehouse seem cluttered.
My recommendation:
Like a lot of things in programming, it’s about finding a balance! And normally I find that balance by going too far in one direction (e.g. no CTEs as separate models), and then overcorrecting in the other (e.g. all CTEs as separate models), and then continuing to correct until I find a happy medium.
My rules of thumb are:
- If your model has multiple CTEs (and hundreds of lines of code) it may be worth breaking it into separate models
- If you are aggregating in a CTE, split it into a separate model. For example, in this query, the
customer_orders
CTE aggregates, and I would normally split this into a separate model.
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
-- I'd pull the customer_orders cte into a separate model
customer_orders as (
select
customer_id,
-- aggregates
min(order_date) as first_order,
max(order_date) as most_recent_order,
sum(amount) as lifetime_value,
count(*) as number_of_orders
from orders
group by 1
),
final as (
select
*
from customers
left join customer_orders using (customer_id)
)
select * from final
- If it’s a CTE that will not get re-used, it may be a good idea to keep it in the same model.
- Use the ephemeral materialization for lightweight transformations, e.g. renaming/casting to the right data type.
- If you are concerned about the number of relations that then get materialized in your warehouse, consider using custom schemas to put them in a separate “staging” schema.
- Use subdirectories within your
models/
directory to group together intermediate models to reduce the clutter within your dbt project. This also makes it easier to configure custom schemas from youdbt_project.yml
file. For example a nested structure might look like:
models/
└── marts/
└── core/
├── fct_customers.sql
├── fct_orders.sql
└── intermediate/
├── customer_orders.sql
├── customer_payments.sql
└── order_payments.sql