CTEs are Passthroughs--Some research!

Hi all! I recently did some research on a question that comes up when I talk to people about SQL and optimizers.

At Fishtown Analytics, we have a style guide for how we write SQL. Part of that style guide is using CTEs as though they’re import statements. We start each model by “importing” each upstream relation that’s going to be used. In practice, that looks like this:

with customers as (
  select * from {{ref('fct_customers')}}
),
orders as (
  select * from {{ref('fct_orders')}}
)
...

This functions a lot like your imports at the top of a Python script (or any programming language):

import pandas as pd
import numpy as np
...

This achieves a couple of nice things. First, it makes it very easy to see what the dependencies are for a given model when you first open up its file. Second, it gives you an easy mechanism to consistently alias all input tables—this mechanism ends up being much cleaner than typing out and then aliasing tables individually within each individual CTE. Overall, we’ve found that this pattern leads to significantly more readable code.

The pattern causes consternation for folks who are familiar with database optimizers on older databases. Lots of CTEs, and lots of select *, neither of which is a good idea according to conventional wisdom. I’ve always believed that this is no longer true with modern analytic databases (BQ, Snowflake, Redshift), but I hadn’t done the detailed work to validate that. Until now.

I wrote two versions of a query. The first selects directly from a raw order_items table and does a simple aggregation on it. Here’s what that looks like with the associated Redshift explain plan:

The second passes the source data through ten CTEs before finally doing an aggregation on it. Here’s what each looks like (along with the associated Redshift explain plan:

The thing you’ll note about these two explain plans is that they are identical. Redshift doesn’t care if you pass the source data through 10 “obfuscating” CTEs, each of which select every single column in the underlying table. It “optimizes through” the CTEs and understands what the user is actually asking for: a simple count distinct on top of a single column. The other columns in the table don’t get used in the plan at all.

OK, great. Redshift treats CTEs as pass-throughs. Let’s prove that the same is true of Bigquery and Snowflake. BigQuery first. Here are the two explain plans (first a direct query, second with 10 passthrough CTEs):

Because BQ is a shared resource, two queries’ execution will never be exactly identical. These are as close to identical execution plans as you can really get. On to Snowflake:

This one is slightly more perplexing–the second query (with the CTEs) is actually faster! That doesn’t make a ton of sense to me, but I also ran this query second and my guess is that I wasn’t able to fully invalidate the cache for the second run. More importantly though, the explain plan is identical across both runs–Snowflake treats the queries as identical from an optimizer perspective.

Summary

All modern analytical database optimizers appear to treat our “import statement” CTEs as pass-throughs. These CTEs have no impact on performance whatsoever, and just act in the way that we want: as a great tool to clean up our code but not ultimately to change the explain plan.

12 Likes

While CTEs function as passthroughs in these examples, on Redshift, at least, there is a limit at which too many CTEs cause a degradation in performance. See this article.

1 Like

Nice link, thanks for sharing! I ultimately don’t think that this plays into the overall “CTEs as imports” pattern I describe above, as that article was mostly warning against excessively complicated code primarily written by automated tools. Very good to know though.

I’m just leaving this waypoint here for those on Postgres - there is no pass through of CTE logic on that platform.

That’s absolutely correct–Postgres treats CTEs as optimization boundaries at the moment.

hi Tristan, thanks for the research on CTE, i was wondering if CTE are still optimization boundaries for newer versions of postgres?

I don’t actually know!! If you look into it I’d love to hear what you find…

Not for a while now, see: PostgreSQL: Documentation: 13: 7.8. WITH Queries (Common Table Expressions)

1 Like

As @wrb points out, it looks like Postgres 12+ have resolved the issue with

Automatic (but overridable) inlining of common table expressions (CTEs)

PostgreSQL: Release Notes

While this may be true, for longer graphs/chains of CTEs, there comes a point at which the BigQuery planner throws up its hands and says, “Enough is too much” and throws this error:

While I have been able to clear this through some optimization and simplification of queries, most often, the only way to get the pipeline to run is to materialize one or more of the intermediate views as tables.

Has anyone else run into this? Or is there something peculiar about my environment/data?

I’m not sure if this is compatible with dbt models (I haven’t tried), but an solution I use elsewhere is to replace CTEs with temporary tables. I don’t think it’s something peculiar with your environment; I’ve run into similar issues multiple times with a not-so-complex database.

yeah, we had the same problem and had to replace some ephemeral models to tables too :confused:

Just posting this article below by database developer Dominik Golebiewski at Atheon Analytics, which confirms with Snowflake support that - at least in some circumstances - CTEs are optimisation fences in Snowflake.

I guess some questions are:

  • Whether this is a performance regression since the 2018 analysis by OP
  • What are the circumstances (e.g. table size, number of refs to imported model) under which it is an optimisation fence
  • What is the table size where for the typical non latency sensitive incremental model it’s worth refactoring CTE code (e.g. 1 billion rows per dbt run)
1 Like

I have faced the same issue with some SaaS ETL tools which produce code. I guess it does not matter whether a human is writing a code (CTE bases, or long chain of CTEs) or an ETL tool is producing one, it is always a good practice to materialize at some point if you such error messages.

2 Likes

I don’t use automatic SQL code generation tools, and I have had serious performance problems to be able to pass some models to the CTE standard.

For models with many lines of code, many intermediate updates, I think the only solution is to put the code in a pre or post hooks. Since it is very difficult to put all that in a CTE and a very complex CTE has a lot of performance problems in the database.

Is this the same for starburst/trino?