How much incrementalism should I use in my project?

hari1 · October 19, 2022, 11:40am

# aug_customers.sql
select
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    customer_orders.first_order_date,
    customer_orders.most_recent_order_date,
    coalesce(customer_orders.number_of_orders, 0) as number_of_orders

from jaffle_shop.customers

left join customer_orders using (customer_id)

If we were doing a normal materialized table it doesn’t matter if both sources get updated as we will recreate the entire table each run.

But can we convert the above transform to incremental if both customer_orders and jaffle_shop.customers are incremental? I.e, changes in both customer_orders and jaffle_shop.customers should be reflected in the final table.

The issue is that the above query converted to incremental will only cover changes from jaffle_shop.customers.

The only way I can see pushing both sources as incremental is to write two queries and then do a union. Is there a better way to do implementing this scenario with dbt? Even with the union it’s not simple as we will need to clear duplicates before dbt can execute incremental updates and this becomes really complex as the number of sources increases.

joellabes · October 20, 2022, 5:36am

This is a good question, and it comes up frequently!

it doesn’t matter whether the upstream tables are incrementals, standard tables, or views - dbt will build them in the right order and won’t start on your aug_customers model until customers and customer_orders are complete.

In your case, I would make it incremental like this:

# aug_customers.sql
select
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    customer_orders.first_order_date,
    customer_orders.most_recent_order_date,
    coalesce(customer_orders.number_of_orders, 0) as number_of_orders

from jaffle_shop.customers

left join customer_orders using (customer_id)

-- below this line is the new stuff 

{% if is_incremental() %}
  where user_id in (
    select user_id 
    from customer_orders
    where most_recent_order_date >= (
      select max(most_recent_order_date) 
      from {{ this }}
    )
  )
{% endif %}

This would capture any customers who have ordered since the model was last built.

hari1 · October 20, 2022, 6:16am

Hi @joellabes ,

Thanks a lot for the reply.

I understand now why the materialization type won’t matter with dbt as it’s doing the transformations with all the data and applying a filter on top of the result and then pushing to the sink(At least in this case).

Is there a good way to move the incremental filter to the sources and not the transformed data?

joellabes · October 20, 2022, 6:43am

When you say sources here, I’m not sure if you mean source in the way dbt refers to it (a raw, totally untransformed table which isn’t controlled by dbt). If you do, then no there’s no way to move incremental filters into the source table, because dbt doesn’t control that table, it just reads from it.

You can make your initial staging (bronze or silver) models which read from and tidy those sources incremental if you want to, but keep in mind that each model is standalone, there’s not a way to push logic into one model from another.

As a heads up, we strongly recommend holding off from making models incremental until the performance demands it. Although incrementals can improve warehouse performance, that comes at the cost of more complex model logic and a more brittle project whose tables are harder to modify as you develop.

In general, we prefer to optimise for humans’ brains!

hari1 · October 20, 2022, 6:53am

Hi @joellabes ,

Thanks for the reply and the recommendation.

We were trying to build a fully incremental pipeline which works on only change data, But the pipeline is becoming really complex and hard to manage.

Was wondering if dbt could help with reducing the complexity. But it seems that it’s more of a problem with the requirements we have set for ourselves.

joellabes · October 20, 2022, 6:59am

Yep! Reliably getting just the modified data, across all your models, is very fiddly.

A way that dbt can help to reduce the complexity is to handle all of your dependencies etc automatically, so you can break your code up into small modular pieces but be confident that they’ll be built in the right order (to say nothing of automatic quality testing, documentation, etc)

The good news is that if you build everything as views at first, you can change some of the slow ones to tables with a single line of config later, instead of having to rewrite lots of DDL. And then if you do decide to turn some of those tables into incremental tables, you’re taking on the complexity in the right places at the right time

hari1 · October 20, 2022, 7:03am

We already have dependency and ordering working properly in our pipeline, We are using databricks jobs/tasks and airflow to manage orchestration.

But quality control, testing and documentation are something we really like in dbt.

Anyways, Thanks again for the replies

osheensharma · October 30, 2022, 10:45pm

Also on the same thread, there are high chances incremental model works fine with two sources table for any dbt model , but what about the requirement where the source tables are more than 2 and needs to be imcremented each time .However it is very hard to track all changes in source separately!
just wondering to see the same topic /thread to get the appropriate result if any

joellabes · November 2, 2022, 4:00am

This is another great question - it’s hard to answer in the abstract, can you share an example of the data you’re working with, and the shape you want it to be in when you’re finished? As long as you have a consistent set of timestamps (or similar) accessible on all the source tables, it can be achievable, but it really depends.

Topic		Replies	Views
Incremental model depends on union all view Help incremental	0	415	May 14, 2024
Combining different materialization models in 1 table In-Depth Discussions incremental , best-practice , snowflake	0	1810	June 27, 2023
incremental tables with slowly changing data Help incremental , redshift , dbt-core	3	2354	June 19, 2023
Need help with the Incremental Help incremental	1	1179	December 18, 2023
Advanced Incremental Use Case Help Help	0	590	April 1, 2024

How much incrementalism should I use in my project?

Related topics