Should I move my CTEs into separate models?

Every so often, a question like this comes up on dbt slack:

“If I write a query using CTEs, would you recommend having the CTEs as a separate sql files as ephemeral and then reference those or keep the CTE with a with statement in the same file? the compile result will be the same - but what are the pros / cons of the two approaches?”

Pros of splitting CTEs into separate models:

  • Your CTE becomes a separate model, which you can then test (including ephemeral models!). This is often useful for testing uniqueness.
  • If your models are materialized as tables or views, you can more easily debug since you can select directly from them in your SQL client.
  • You can reuse logic in multiple downstream models (in which case it may be a good idea to materialize your model as a table)

Cons of splitting CTEs into separate models:

  • More files!
  • If your models are materialized as tables or views, you’ll also have more relations in your data warehouse, which can make your warehouse seem cluttered.

My recommendation:
Like a lot of things in programming, it’s about finding a balance! And normally I find that balance by going too far in one direction (e.g. no CTEs as separate models), and then overcorrecting in the other (e.g. all CTEs as separate models), and then continuing to correct until I find a happy medium.

My rules of thumb are:

  • If your model has multiple CTEs (and hundreds of lines of code) it may be worth breaking it into separate models
  • If you are aggregating in a CTE, split it into a separate model. For example, in this query, the customer_orders CTE aggregates, and I would normally split this into a separate model.
with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('stg_orders') }}
),

-- I'd pull the customer_orders cte into a separate model
customer_orders as (
    select
       customer_id,

        -- aggregates
        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        sum(amount) as lifetime_value,
        count(*) as number_of_orders
    from orders
    group by 1
),

final as (
    select
        *
    from customers
    left join customer_orders using (customer_id)
)

select * from final
  • If it’s a CTE that will not get re-used, it may be a good idea to keep it in the same model.
  • Use the ephemeral materialization for lightweight transformations, e.g. renaming/casting to the right data type.
  • If you are concerned about the number of relations that then get materialized in your warehouse, consider using custom schemas to put them in a separate “staging” schema.
  • Use subdirectories within your models/ directory to group together intermediate models to reduce the clutter within your dbt project. This also makes it easier to configure custom schemas from you dbt_project.yml file. For example a nested structure might look like:
models/
└── marts/
    └── core/
        ├── fct_customers.sql
        ├── fct_orders.sql
        └── intermediate/
            ├── customer_orders.sql
            ├── customer_payments.sql
            └── order_payments.sql

6 Likes

For models with many lines of code, many intermediate updates, I think the only solution is to put the code in a pre- or post-hooks. Since it is very difficult to put all that in a CTE, and a very complex CTE has a lot of performance problems in the database.