If you’ve taken a look at Fishtown’s style guide, or seen some of the SQL throughout the docs / example projects (which mostly aligns with the style guide) you might have noticed a proliferation of CTEs (common table expressions). If you haven’t noticed this, check out the example below.
So, what’s the deal?
Shouldn’t those CTEs be subqueries?
Many people mistakenly believe that CTEs are “optimization fences” and are less performant than subqueries — this isn’t true on modern data warehouses (or, at least not true enough to be worth it). Compared to subqueries, CTEs are way easier to read, and debug, than subqueries. If you’re still using subqueries and you’re on Snowflake, Redshift, BigQuery or Postgres 12, then you might be due for a refactor!
What’s the deal with the CTEs at the top?
We start each model by “importing” each upstream relation that’s going to be used. This is similar to python, where you import packages used. I’ll let @tristan explain (from here).
This achieves a couple of nice things. First, it makes it very easy to see what the dependencies are for a given model when you first open up its file. Second, it gives you an easy mechanism to consistently alias all input tables—this mechanism ends up being much cleaner than typing out and then aliasing tables individually within each individual CTE. Overall, we’ve found that this pattern leads to significantly more readable code.
And, we’ve done the research — these CTEs don’t significantly affect query performance:
Caveat: at some point, if you keep stacking CTEs on top of each other, your query optimizer is going to have a bad time. Given we keep our models to less than ~100 lines of SQL, we don’t really see this happening.
What’s the deal with the final
CTE, isn’t that redundant?
By now, hopefully you’ve bought into the fact that CTEs don’t affect query performance. But what’s with the last CTE, it is just so uneccesary!
I thought so too when I started at Fishtown and had to adopt this style guide. Having this last CTE makes debugging so much easier. If something weird is happening in your query, it’s much easier to replace:
select * from final
with
select * from customer_orders
rather than having to nest that last query in parentheses or comment it out. Give it a go, and you’ll see what we mean!
Example SQL:
with customers as (
select * from {{ ref('stg_jaffle_shop__customers') }}
),
orders as (
select * from {{ ref('fct_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders,
sum(amount) as lifetime_value
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders,
coalesce(customer_orders.lifetime_value, 0) as lifetime_value
from customers
left join customer_orders using (customer_id)
)
select * from final