Myself and a colleague are at loggerheads over whether this implementation of the is_incremental() macro is valid. Please help us settle a very heated debate!
The context of why I’m trying to do this
We are using dbt-postgres. We would like to detect the changes in the raw table (i.e. inserts or updates) and append or update our int_purchased_item model accordingly.
-- Staging model stg_purchased_item
with source as (
select * from {{ source('item', 'purchased_item') }}
),
purchased_item as (
select ...
from source
)
select * from purchased_item
-- Intermediate model int_purchased_item - OPTION 1
{{ config(
materialized='incremental',
unique_key='purchased_item_id'
)
}}
with purchased_item as (
select * from {{ ref('stg_purchased_item') }}
{% if is_incremental() %}
where updated > (select max(updated) from {{ this }})
{% endif %}
),
category as (
select * from {{ ref('stg_category') }}
),
final as (
select
from purchased_item
left join category
)
select * from final
-- Intermediate model int_purchased_item - OPTION 2
{{ config(
materialized='incremental',
unique_key='purchased_item_id'
)
}}
with purchased_item as (
select * from {{ ref('stg_purchased_item') }}
),
category as (
select * from {{ ref('stg_category') }}
),
final as (
select
from purchased_item
left join category
)
select * from final
{% if is_incremental() %}
where updated > (select max(updated) from {{ this }})
{% endif %}
Our concern is whether we have placed the {% if is_incremental() %} logic in the correct place, and whether it is valid to apply it within a CTE as we have done in Option 1, versus placing it at the very end of the model as in Option 2.
If both are valid, which one is likely to be more performant?
The question of whether there is a “correct” answer is a little vague without having some very specific criteria for correctness. SQL is a declarative language, and both examples declare the same thing, so in principle they are identical.
In practice this is an example of what’s called “predicate pushdown”. Predicate means “where clause” and pushing it down means, performing the filtering operation as early as possible, because performing the join step of this query is a much more expensive operation than filtering, and doing the join only to throw away a huge portion of the results would be very stupid.
So, most modern database engines are not this stupid and they will evaluate whether it is possible to push down the predicate into the CTE without you telling it to do so. If it is, they will silently perform that optimisation under the hood for you. This gets a bit more obvious if in Option 2, you put the is_incremental block into the CTE called “final”, so the join and the WHERE are in the same step. How will the database choose which step to do first?
This depends on the database’s SQL interpreter and what it chooses to do, and it also depends on other optimisations to the table such as keys and indices defined on the tables - you didn’t explicitly specify a join condition in your example but the join performance will be impacted by these factors. It can sometimes be very difficult to predict what a particular database will make of any given query and performance can be bad or good for poorly-understood reasons. You can explore more of this with the EXPLAIN statement and similar tools in different databases. One example of a case that can confuse predicate pushdown is where you’re querying a view and the predicate could be pushed down into a CTE in the view, but the database may not always be able to figure this out and may process more data than it needs to.
The final factor is readability. dbt has a convention of having small CTEs doing SELECT * FROM {{ ref('my_dependency') }} but other than that, these examples are very simple one-statement queries. In a real-world model of a few hundred lines, readability is going to become a concern, and if your database engine is smart enough to push down this kind of predicate properly, readability is a much bigger concern. So someone arguing for the predicate to come later in the query on the basis that is more readable with no performance impact, would have a pretty good case, I think.
So to summarise:
It may make a difference to performance but if your database is modern then it usually won’t… but it might
You can test the performance and see for yourself
If there is no performance difference, the argument is academic and you’re free to structure the query whatever way makes the most sense to you
Using CTEs extensively in SQL code can often feel like you’re writing code procedurally, but this can also be a trap - you aren’t actually doing so, you’re still writing declarative code, and the database will make up its own mind about the right order of operations.
As best practice I would still put the incremental filter as close to the original ref as possible. The only reason for putting it in the final CTE is if it also depended on the join.
Note: @Eugene Kim (Orrum) originally posted this reply in Slack. It might not have transferred perfectly.