Why the Fishtown SQL style guide uses so many CTEs

If you’ve taken a look at Fishtown’s style guide, or seen some of the SQL throughout the docs / example projects (which mostly aligns with the style guide) you might have noticed a proliferation of CTEs (common table expressions). If you haven’t noticed this, check out the example below.

So, what’s the deal?

Shouldn’t those CTEs be subqueries?

Many people mistakenly believe that CTEs are “optimization fences” and are less performant than subqueries — this isn’t true on modern data warehouses (or, at least not true enough to be worth it). Compared to subqueries, CTEs are way easier to read, and debug, than subqueries. If you’re still using subqueries and you’re on Snowflake, Redshift, BigQuery or Postgres 12, then you might be due for a refactor!

What’s the deal with the CTEs at the top?

We start each model by “importing” each upstream relation that’s going to be used. This is similar to python, where you import packages used. I’ll let @tristan explain (from here).

This achieves a couple of nice things. First, it makes it very easy to see what the dependencies are for a given model when you first open up its file. Second, it gives you an easy mechanism to consistently alias all input tables—this mechanism ends up being much cleaner than typing out and then aliasing tables individually within each individual CTE. Overall, we’ve found that this pattern leads to significantly more readable code.

And, we’ve done the research — these CTEs don’t significantly affect query performance:

Caveat: at some point, if you keep stacking CTEs on top of each other, your query optimizer is going to have a bad time. Given we keep our models to less than ~100 lines of SQL, we don’t really see this happening.

What’s the deal with the final CTE, isn’t that redundant?

By now, hopefully you’ve bought into the fact that CTEs don’t affect query performance. But what’s with the last CTE, it is just so uneccesary!

I thought so too when I started at Fishtown and had to adopt this style guide. Having this last CTE makes debugging so much easier. If something weird is happening in your query, it’s much easier to replace:

select * from final

with

select * from customer_orders

rather than having to nest that last query in parentheses or comment it out. Give it a go, and you’ll see what we mean!


Example SQL:

with customers as (

    select * from {{ ref('stg_jaffle_shop__customers') }}

),

orders as (

    select * from {{ ref('fct_orders') }}

),

customer_orders as (

    select
        customer_id,
        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders,
        sum(amount) as lifetime_value

    from orders

    group by 1

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders,
        coalesce(customer_orders.lifetime_value, 0) as lifetime_value

    from customers

    left join customer_orders using (customer_id)

)

select * from final
11 Likes

Coming from a largely software development background, this was one of the first elements of the dbt style that resonated with me and made the overall framework more approachable. Tristan’s proof that the form (style) did not necessarily come at the sacrifice of function (performance) similarly resonated with me as an evaluator of this tool. Thanks for this/these posts Claire, et al!

5 Likes

I’ve been wondering this for a while and appreciate the writeup!! I love CTE’s but it does seem to be a bit excessive in the DBT examples and I do wonder if they takeaway from some readability.

A couple of additional questions here:

  1. “Second, it gives you an easy mechanism to consistently alias all input tables” - would you not be better served by using that consistent alias in your table names? Does orders really need “fct_” in-front of it and does customers really need “stg_jaffle_shop__”? If they didn’t then you wouldn’t have to remember and maintain the loosely held common alias mechanism over all your files.

  2. Is the last example of switching which table you’re returning a really a very common one in your workflows? If not - it’s also not too hard to comment out a section (easy with most code editors and a shortcut) and type “select * from customers_orders” at the end. That’s not really too many additional key presses, but I may be missing something.

  1. Model names need to be unique in dbt (even if the files live in different directories). In a large project, you’ll end up hitting naming collisions as your project grows, so we choose to be overly verbose in our file names

  2. Yep, we end up switching is last CTE out all the time when writing complex models. It’s rare that I’ll get a complex transformation right the first go, and using this workflow when debugging ends up being really useful. Of course, YMMV :slight_smile: (especially if you manage to write SQL correctly the first go :raised_hands:)

3 Likes

The final CTE (which I had not yet read about didn’t know it was your best practice I thought I’d read everything haha) just blew my MIND :exploding_head:. How many times have I found myself debugging and running into annoying commenting outs and what not. I’m sold!

2 Likes

I loved the select * from final too.

1 Like

This is no doubt also true of nested subqueries. Query optimizers work best when they have up-to-date statistics and other table metadata, and a small search space for alternative query execution plans.

Hi

I want to ask why DBT is promoting users to use

select * from table
syntax?

any particular reason?

Hi @lewisthewonder,
my -very personal- take is that these select * are used in CTEs that have just the function to resolve the abstract reference created by the ref() function.

So you use the first CTE to resolve whatever ref('customers') points to to a local name like customers and then you select what you need from that local name when you use it.

I would also say that in the ingestion part of a “SQL transformation pipeline” it is often useful to get whatever comes from the source through the first few models up to its “permanent resting place”, while when picking the data to be served out in marts it makes just sense to pick the data you really want to serve out.

2 Likes

FYI, I created a macro that generates import statements automatically.

You can call it like this:

{{ generate_imports(['model1', 'model2']) }}

Which will result in:

WITH model1 AS (

  SELECT *
  FROM {{ ref('model1') }}

), model2 AS (

  SELECT *
  FROM {{ ref('model1') }}

)

Here’s the macro:

{% macro generate_imports(refs) %}

{% set import_sql %}
WITH 
{% for model in refs %}

{{ ref }} AS (

  SELECT *
  FROM {{ ref('model')}}

), 
{% endfor %}
{% endset %}

{% if execute %}

{{ log(import_sql, info=True) }}
{% do return(import_sql) %}

{% endif %}

{% endmacro %}
2 Likes

Hey, JT
Nice idea, but I think one key step is to rename from the model to something that makes “local sense”.

So a more useful signature could take a tuple
{{ generate_imports([ ('model1', 'CTE1_name'), ('model2', 'CTE2_name')]) }}
and of course when the CTE names are not there use the same as model.

Great idea, and I noted that @tmurphy built something to do that at gitlab:

Late to the party here, but I’ve been thinking a lot about DAG lineage, and this guide is helpful. Treating refs as imports makes a lot of sense.

One thing that feels like it’s missing is a way to define all the relationships that come together to build a model like dim_store (20+ joins in my case) without creating a DAG node, and be able to reference that abstraction, and have dbt compile the necessary select statement + joins.

My observation is that deriving the 100th feature about a store is easy if you start with dim_store, but then you can’t get the feature back onto dim_store, or access it in the int layer. So you make a new model: int_store_feature_100, and copy-paste some of the logic that builds dim_store… But that copying step is pretty clearly defined: you just need stg_store, int_store_feature_98 and int_store_feature_99, and they all join 1:1 on store_id as dim_store already knows.

If the defining SQL of dim_store was an abstract non-DAG-node thing that compiled to the minimal SELECT, FROM, JOIN then it could make DAG development feel more like end-node querying and automate adherence to some of these best practices.

This should be flagged for members going through the fundamentals dbt training, along with the linked style guide. :grinning: :+1:

A word of caution when used in combination with PostgreSQL FDW (Foreign Data Wrapper).

In my setup, I connected to another Postgres DB and all I did was a simple join following the CTE pattern (which I like in general).

The query optimizer at that point decides what query it will send to the remote server, i.e. two simple queries or a join.

With the CTEs in place, the query optimizer asked the remote server for the (full) data of the two tables in order to join them locally. This blew up the pgsql_tmp file so much that we noticed it.

Refactoring the same query to a single query was not hard and actually lead to the expected optimization: the join happened on the remote server and only the result was transfered across the wire. It was also a lot faster.

EXPLAIN VERBOSE sheds some light on what the query optimizer does with the CTEs, especially in the FDW scenario.

EDIT: As usual with Postgres, this is documented PostgreSQL: Documentation: 13: F.33. postgres_fdw

When postgres_fdw encounters a join between foreign tables on the same foreign server, it sends the entire join to the foreign server, unless for some reason it believes that it will be more efficient to fetch rows from each table individually, […]