Why the Fishtown SQL style guide uses so many CTEs

If you’ve taken a look at Fishtown’s style guide, or seen some of the SQL throughout the docs / example projects (which mostly aligns with the style guide) you might have noticed a proliferation of CTEs (common table expressions). If you haven’t noticed this, check out the example below.

So, what’s the deal?

Shouldn’t those CTEs be subqueries?

Many people mistakenly believe that CTEs are “optimization fences” and are less performant than subqueries — this isn’t true on modern data warehouses (or, at least not true enough to be worth it). Compared to subqueries, CTEs are way easier to read, and debug, than subqueries. If you’re still using subqueries and you’re on Snowflake, Redshift, BigQuery or Postgres 12, then please take a long hard look at yourself in the mirror. And then go and refactor your code :wink:

What’s the deal with the CTEs at the top?

We start each model by “importing” each upstream relation that’s going to be used. This is similar to python, where you import packages used. I’ll let @tristan explain (from here).

This achieves a couple of nice things. First, it makes it very easy to see what the dependencies are for a given model when you first open up its file. Second, it gives you an easy mechanism to consistently alias all input tables—this mechanism ends up being much cleaner than typing out and then aliasing tables individually within each individual CTE. Overall, we’ve found that this pattern leads to significantly more readable code.

And, we’ve done the research — these CTEs don’t significantly affect query performance:

Caveat: at some point, if you keep stacking CTEs on top of each other, your query optimizer is going to have a bad time. Given we keep our models to less than ~100 lines of SQL, we don’t really see this happening.

What’s the deal with the final CTE, isn’t that redundant?

By now, hopefully you’ve bought into the fact that CTEs don’t affect query performance. But what’s with the last CTE, it is just so uneccesary!

I thought so too when I started at Fishtown and had to adopt this style guide. Having this last CTE makes debugging so much easier. If something weird is happening in your query, it’s much easier to replace:

select * from final

with

select * from customer_orders

rather than having to nest that last query in parentheses or comment it out. Give it a go, and you’ll see what we mean!


Example SQL:

with customers as (

    select * from {{ ref('stg_jaffle_shop__customers') }}

),

orders as (

    select * from {{ ref('fct_orders') }}

),

customer_orders as (

    select
        customer_id,
        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders,
        sum(amount) as lifetime_value

    from orders

    group by 1

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders,
        coalesce(customer_orders.lifetime_value, 0) as lifetime_value

    from customers

    left join customer_orders using (customer_id)

)

select * from final
6 Likes

Coming from a largely software development background, this was one of the first elements of the dbt style that resonated with me and made the overall framework more approachable. Tristan’s proof that the form (style) did not necessarily come at the sacrifice of function (performance) similarly resonated with me as an evaluator of this tool. Thanks for this/these posts Claire, et al!

3 Likes

I’ve been wondering this for a while and appreciate the writeup!! I love CTE’s but it does seem to be a bit excessive in the DBT examples and I do wonder if they takeaway from some readability.

A couple of additional questions here:

  1. “Second, it gives you an easy mechanism to consistently alias all input tables” - would you not be better served by using that consistent alias in your table names? Does orders really need “fct_” in-front of it and does customers really need “stg_jaffle_shop__”? If they didn’t then you wouldn’t have to remember and maintain the loosely held common alias mechanism over all your files.

  2. Is the last example of switching which table you’re returning a really a very common one in your workflows? If not - it’s also not too hard to comment out a section (easy with most code editors and a shortcut) and type “select * from customers_orders” at the end. That’s not really too many additional key presses, but I may be missing something.

  1. Model names need to be unique in dbt (even if the files live in different directories). In a large project, you’ll end up hitting naming collisions as your project grows, so we choose to be overly verbose in our file names

  2. Yep, we end up switching is last CTE out all the time when writing complex models. It’s rare that I’ll get a complex transformation right the first go, and using this workflow when debugging ends up being really useful. Of course, YMMV :slight_smile: (especially if you manage to write SQL correctly the first go :raised_hands:)

3 Likes

The final CTE (which I had not yet read about didn’t know it was your best practice I thought I’d read everything haha) just blew my MIND :exploding_head:. How many times have I found myself debugging and running into annoying commenting outs and what not. I’m sold!

1 Like