The problem I’m having
We’ve seen in the “incremental models” where the date field’s degree of detail drops to lower levels like minutes or seconds, a data problem may occur where incorrect data persists in production on future reloads/backfills from the same period of time. To make it simpler to describe what I am observing, I will do so using the following example at the day level.
We have the current model:
{{config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='date_day',
)}}
select date('2023-01-01') as date_day
UNION
select date('2023-01-02') as date_day
UNION
select date('2023-01-03') as date_day
UNION
select date('2023-01-04') as date_day
UNION
select date('2023-01-05') as date_day
UNION
select date('2023-01-06') as date_day
I executed it the first time, and the table was created. If I execute it more times, what the model is doing is:
CREATE TEMP_TABLE AS CTE;
DELETE FROM FINAL_TABLE WHERE date_day in (SELECT date_day from TEMP_TABLE);
INSERT INTO FINAL_TABLE SELECT * FROM TEMP_TABLE;
Let’s imagine that we are doing a backfill of the data and this data_day = 2023-01-03
we don’t have data. Then after executing the following code:
{{config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='date_day'
)}}
select date('2023-01-01') as date_day
UNION
select date('2023-01-02') as date_day
UNION
select date('2023-01-04') as date_day
UNION
select date('2023-01-05') as date_day
UNION
select date('2023-01-06') as date_day
We will find that 2023-01-03
persist in the final table.
The problem comes from the DELETE
command that is not built as a range of dates.
DELETE FROM FINAL_TABLE
WHERE date_day >= (SELECT min(date_date) FROM TEMP_TABLE)
and date_day <= (SELECT max(date_date) FROM TEMP_TABLE)
The context of why I’m trying to do this
We have some cases where date_key is hours or minutes and this is a case that can happen with some data
What I’ve already tried
I look for a solution of incremental materialization and delete_strategy equal delete+insert, but I did not find anything.The only solution was to aggregate a pre_hook, but it means that we need to know before hand the min
an max
date that is going to be load in the final table.
{{config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='date_day',
pre_hook="
{% if is_incremental() %}
delete from {{ this }} where date_day between date('2023-01-02') and date('2023-01-06')
{% endif %}"
)}}
The same happens with incremental_strategy=‘merge’
Some example code or error messages
NA