Hey community, I’ve been using dbt as part of my job for a while now, and mostly used it in more “simple” pipelines that calculate different things.
Wherever I could, I created incremental models based on event timestamp, update timestamp, or whatever fits the scenario.
I am currently facing a more complex situation where I need to transform my data from the “intermediate”, “silver” or any other name you’d like for the middle part - to the final tier, which will have both clean and neat data at it’s most granular form, but also aggregated tables.
The process of cleaning the data involves multiple types of filters, but the most “problematic” one involves filtering only entities related to specific Salesforce objects we have.
This kind of breaks the entire incremental situation, since if I only load data from the past 3 days for example, and today suddenly an entity becomes relevant based on Salesforce - I will only load the data for it for the last 3 days.
The reason we want to filter that data out is that we have a lot of noise data and we want a clean dataset based on Salesforce entities, so that the data size is significantly smaller and faster.
I’ve searched the community and tried many different approaches but it seems that no matter which approach we go for, it can’t utilize the full depth of the incremental models.
We tried scanning using a complex WHERE clause in the {% if is_incremental() %} block to account for “new” entities that suddenly appear, but it causes the query to scan the entire table regardless, which kind of makes the incremental materialization useless.
Does anyone have any idea how we can attempt to find an incremental solution for such a scenario? Would appreciate any discussion so we can attempt new things.
Bonus question - how do you go about incrementally loading and calculating LIFETIME aggregations?
For example, the total number of logins a user had over it’s entire life as a user