I’m being challenged on some very basic dbt concepts and this is my last resort. If anyone can please provide any insight into this, I would appreciate it.
I’m adding 6 test records into my raw table with a valuation_date of 12/31/2024. I run the model for the first time and it inserts the 6 records into my dbt schema table with a unique batch_id. As a test, I then insert one new record with a valuation_date of 1/31/2025 into my raw data. Per the rules of incremental loads, it should only pick up this one record to process and I should see a completely different batch_id for this one record.
It is not doing this. The below incremental model will insert 7 new records with different batch_ids on top of the 6 that were already inserted into my dbt schema table from the previous run. 6 of those records being exactly the same between the two differing batch_ids.
In another test I did, I included a unique_key in my config. I ran the exact same scenario described above. When I run the model the second time, there are only 7 records in my dbt schema table, which appears correct, but all batch_ids are the same. That is not correct since the one new record I add should have a different batch id.
Again, any help on this is appreciated. I’m doing this on my own, at home, so If I can’t figure this out, I’m SOL. Thank you.
{{
config(
materialized='incremental'
)
}}
with final_hrt as (
select
...redacting listed fields here...
valuation_date,
'{{ invocation_id }}' as batch_id
from {{ source('jaffle_shop', 'cumulative_hrt') }}
{% if is_incremental() %}
where valuation_date >= (select max(valuation_date) from {{ this }} )
{% endif %}
)
select * from final_hrt
I would suggest reviewing the compiled code (target folder → compiled) and checking that all the source tables, variables, and input values are those you’re expecting.
Note: @oferbrmn
originally posted this reply in Slack. It might not have transferred perfectly.
Thank you for your response. My compiled code for this model looks ok to me. I don’t see anything out of the ordinary unless I’m missing something.
with final_hrt as (
select
...field list redacted...
valuation_date,
'bd2d0c2e-717b-40eb-bfce-c3dc3807fd4a' as batch_id
from raw.jaffle_shop.cumulative_hrt
where valuation_date >= (select max(valuation_date) from analytics.dbt_redacted.load_hrttest )
)
select * from final_hrt
Ok, I figured out what my issue is.
My where clause is using > OR EQUAL TO when it should just be >. >= does not work for my date type where a valuation date represents the end of month. The = will always pick up the records that were last loaded. In fact, I don’t see why the equal to should ever be used with incremental since it may pick up records that already existed. But if you only use >, you may miss records in the new load if the date happens to be the same as the max from the last load.
I’m going to play around with this a little more because I’m not fully convinced of the strategies I’ve seen so far. Should I close this since I’ve discovered my initial issue?
I’m going to close this out. I have basically answered my own question. In the messing around that I’m doing in dbt, what I’ve settled on is this:
{% if is_incremental() %}
where etl_create_date > (select max(etl_create_date) from {{ this }} )
or etl_update_date > (select coalesce(max(etl_update_date), '1900-01-01') from {{ this }})
{% endif %}
There is a bit of nuance that goes into an incremental strategy and the answer is not always going to be super clear. With data, I don’t think we hear, as often as we should, about the different ways data is presented to the user or engineer. With an incremental strategy, the source of the data has to be some kind of real time update/insert table where a date value is updating with the current_timestamp(), indicating when it was created or updated.
The way data is presented should always be examined first before building a strategy. For example, where I work, we get full or partial refreshes of data on an iterative basis. An incremental model in dbt wouldn’t really work with this type of presentation of the data.