Running backfills in incremental models, “obsolete records” may persist

The problem I’m having

We’ve seen in the “incremental models” where the date field’s degree of detail drops to lower levels like minutes or seconds, a data problem may occur where incorrect data persists in production on future reloads/backfills from the same period of time. To make it simpler to describe what I am observing, I will do so using the following example at the day level.
We have the current model:

{{config(
  materialized='incremental',
  incremental_strategy='delete+insert',
  unique_key='date_day',
)}}

select date('2023-01-01') as date_day
UNION 
select date('2023-01-02') as date_day
UNION 
select date('2023-01-03') as date_day
UNION 
select date('2023-01-04') as date_day
UNION 
select date('2023-01-05') as date_day
UNION 
select date('2023-01-06') as date_day

I executed it the first time, and the table was created. If I execute it more times, what the model is doing is:

CREATE TEMP_TABLE AS CTE;

DELETE FROM FINAL_TABLE WHERE date_day in (SELECT date_day from TEMP_TABLE);

INSERT INTO FINAL_TABLE SELECT * FROM TEMP_TABLE;

Let’s imagine that we are doing a backfill of the data and this data_day = 2023-01-03 we don’t have data. Then after executing the following code:

{{config(
  materialized='incremental',
  incremental_strategy='delete+insert',
  unique_key='date_day'
)}}

select date('2023-01-01') as date_day
UNION
select date('2023-01-02') as date_day
UNION
select date('2023-01-04') as date_day
UNION
select date('2023-01-05') as date_day
UNION
select date('2023-01-06') as date_day

We will find that 2023-01-03 persist in the final table.
The problem comes from the DELETE command that is not built as a range of dates.

DELETE FROM FINAL_TABLE
WHERE date_day >= (SELECT min(date_date) FROM TEMP_TABLE)
  and date_day <= (SELECT max(date_date) FROM TEMP_TABLE)

The context of why I’m trying to do this

We have some cases where date_key is hours or minutes and this is a case that can happen with some data

What I’ve already tried

I look for a solution of incremental materialization and delete_strategy equal delete+insert, but I did not find anything.The only solution was to aggregate a pre_hook, but it means that we need to know before hand the min an max date that is going to be load in the final table.

{{config(
  materialized='incremental',
  incremental_strategy='delete+insert',
  unique_key='date_day',
  pre_hook="
  {% if is_incremental() %}
    delete from {{ this }} where date_day between date('2023-01-02') and date('2023-01-06')
  {% endif %}"
)}}

The same happens with incremental_strategy=‘merge’

Some example code or error messages

NA

Hi @miandreu, it is correct for dbt to not process records that do not exist in the query’s results.

The way incremental models work is that after the initial build, they only delete+insert/update records present in the later queries. (Check out Incremental models in-depth | dbt Developer Hub for a deeper dive)

Keep in mind that from dbt’s perspective, unique_keys have no relationship to one another (eg they are often md5 hashes), which is why the delete statement is not a range. You control the behaviour of your unique keys in your model, so I would recommend explicitly casting your keys to a date or truncating them to the day level to ensure that no hour/minute data sneaks in.

If you need to re-process the table after some source data has been deleted, you could run dbt run --select some_model --full-refresh which will force the table to be dropped and rebuilt from scratch.