Hi all! I recently did some research on a question that comes up when I talk to people about SQL and optimizers.
At Fishtown Analytics, we have a style guide for how we write SQL. Part of that style guide is using CTEs as though they’re import
statements. We start each model by “importing” each upstream relation that’s going to be used. In practice, that looks like this:
with customers as (
select * from {{ref('fct_customers')}}
),
orders as (
select * from {{ref('fct_orders')}}
)
...
This functions a lot like your imports at the top of a Python script (or any programming language):
import pandas as pd
import numpy as np
...
This achieves a couple of nice things. First, it makes it very easy to see what the dependencies are for a given model when you first open up its file. Second, it gives you an easy mechanism to consistently alias all input tables—this mechanism ends up being much cleaner than typing out and then aliasing tables individually within each individual CTE. Overall, we’ve found that this pattern leads to significantly more readable code.
The pattern causes consternation for folks who are familiar with database optimizers on older databases. Lots of CTEs, and lots of select *
, neither of which is a good idea according to conventional wisdom. I’ve always believed that this is no longer true with modern analytic databases (BQ, Snowflake, Redshift), but I hadn’t done the detailed work to validate that. Until now.
I wrote two versions of a query. The first selects directly from a raw order_items
table and does a simple aggregation on it. Here’s what that looks like with the associated Redshift explain plan:
The second passes the source data through ten CTEs before finally doing an aggregation on it. Here’s what each looks like (along with the associated Redshift explain plan:
The thing you’ll note about these two explain plans is that they are identical. Redshift doesn’t care if you pass the source data through 10 “obfuscating” CTEs, each of which select every single column in the underlying table. It “optimizes through” the CTEs and understands what the user is actually asking for: a simple count distinct on top of a single column. The other columns in the table don’t get used in the plan at all.
OK, great. Redshift treats CTEs as pass-throughs. Let’s prove that the same is true of Bigquery and Snowflake. BigQuery first. Here are the two explain plans (first a direct query, second with 10 passthrough CTEs):
Because BQ is a shared resource, two queries’ execution will never be exactly identical. These are as close to identical execution plans as you can really get. On to Snowflake:
This one is slightly more perplexing–the second query (with the CTEs) is actually faster! That doesn’t make a ton of sense to me, but I also ran this query second and my guess is that I wasn’t able to fully invalidate the cache for the second run. More importantly though, the explain plan is identical across both runs–Snowflake treats the queries as identical from an optimizer perspective.
Summary
All modern analytical database optimizers appear to treat our “import statement” CTEs as pass-throughs. These CTEs have no impact on performance whatsoever, and just act in the way that we want: as a great tool to clean up our code but not ultimately to change the explain plan.