Incremental model backfill missing data

masek94 · October 24, 2024, 11:52am

Hey, I have a use case when using dbt with Redshift where I want to aggregate data daily (the entire history from various tables) and save it with the current date as new records in my table.

So, on October 24th, my dbt model will calculate the aggregations (various dimensions and metrics), assign the date as October 24th, and save the data to the table. The next day, it will do the same but with October 25th. I want to build a historical table of my statistics this way.

I wanted to use the Incremental model in dbt for this, and everything seems pretty straightforward to implement. However, I have an issue where, from time to time, my process might fail. In such cases, when there is a gap in the data for a given day, I should take the data from the previous day and insert it while changing the date to the missing one. This seems simple enough to do manually, but I need to automate it in dbt – I’m just not quite sure how. The only solution I’ve come up with so far is to create a macro that triggers a procedure, and that procedure would contain the logic to fill in the missing data. It seems impossible to achieve this using only dbt models.

I should mention that I’m using dbt Cloud, and in the end, I’d like to have a process like:

Daily aggregation into my table
Tests to check if there are any missing dates, and if so, trigger a process to fill in the missing data.

Do you have any ideas or suggestions on how to approach this? Or there is some way to achieve what I need in simpler way ?

a_slack_user · October 24, 2024, 12:50pm

Is it possible to provide an variable instead of as-of-current?
i.e. by default, pull and aggregate data where (timestamp in column) <= {{ aggregation_date }}

(and the date on the aggregation rows that you add you use that same {{ aggregation_date }})

where aggregation_date, by default is current date, but where you can pass in as a --var an earlier date.

_{Note: @Renee originally posted this reply in Slack. It might not have transferred perfectly.}

masek94 · October 24, 2024, 1:01pm

No it’s not possible to provide any date fields or so.

My query is very simple for that model.

Something like this

SELECT dim1, dim2, getdate() as date, SUM(metric1), SUM(metric2) 
FROM table
GROUP BY 1,2,3

And I just run scheduled job every day that build this model as incremental. So every day I have new records.

So in case of failure I want to fill missing date by using previous available day with data.

a_slack_user · October 24, 2024, 3:21pm

If it were BigQuery I’d suggest using the time travel feature FOR SYSTEM_TIME AS OF if you don’t have any date or timestamp fields in the table to use for selection,
but I don’t think Redshift has that functionality.

Without one or the other, I’m not sure how you would be able to backfill the aggregations for a missing date.

Curious to see what others might suggest here though.

_{Note: @Renee originally posted this reply in Slack. It might not have transferred perfectly.}

Topic		Replies	Views
Running backfills in incremental models, “obsolete records” may persist Help incremental , snowflake	1	3910	April 5, 2023
Is there a way to specify backfill dates for Snowflake? Help	4	1444	May 30, 2023
Best way to backfill table with range of dates? Help	4	2946	February 28, 2024
Insert overwrite static partitions bug -- missing partitions Help incremental , bigquery , dbt-core	0	841	September 17, 2023
Clarification required in dbt docs Help incremental	0	727	October 31, 2023

Incremental model backfill missing data

Related topics