Couple of thoughts to add to this great post!
Even the Big Easy can get Pretty Hairy if you’re combining multiple staged models into a fact table where each staged model could be quite large and potentially benefit from incrementality.
For example, your fact represents shipped orders (fct_shipments) based on a staged or base model of shipment data and you’d like to include several fields from a separate staged or base model for orders.
Your first instinct might be to select from both staged model incrementally.
However, this will quickly lead to out of sync fact tables if you’re not careful.
Let’s say orders can ship 1 or more, or several days, after the order has been placed. Let’s further say that to keep things simple and performant, we want build incremental logic based on dates (to support partition pruning etc) and not orders ids.
So, you might start with something like this:
{{
config(
materialized = 'incremental',
unique_key = 'ship_date'
)
}}
with shipments as (
select * from {{ ref('shipments') }}
{% if is_incremental() %}
where ship_date >= (
select dateadd(day, -3, current_date)
)
{% endif %}
),
orders as (
select * from {{ ref('orders') }}
{% if is_incremental() %}
where order_date >= (
select dateadd(day, -3, current_date)
)
{% endif %}
),
shipped_orders as (
-- inner join the two models
...
)
In this case, you’d only be selecting orders that were placed and shipped in the last 3 days.
Also, because we’re going to be deleting target data based on ship date, we’d be wiping out all shipments for a particular ship date, only to be replacing it with potentially partial shipment data for that date.
So, depending on your business model, you may have to either make window for your incremental date logic much wider, or forgo that part altogether and get all orders regardless of date, and then hope the inner join to shipments for the last 3 days will get you some benefits from the query optimizer.
Hope that made sense…!