Incremental model backfill missing data

Hey, I have a use case when using dbt with Redshift where I want to aggregate data daily (the entire history from various tables) and save it with the current date as new records in my table.

So, on October 24th, my dbt model will calculate the aggregations (various dimensions and metrics), assign the date as October 24th, and save the data to the table. The next day, it will do the same but with October 25th. I want to build a historical table of my statistics this way.

I wanted to use the Incremental model in dbt for this, and everything seems pretty straightforward to implement. However, I have an issue where, from time to time, my process might fail. In such cases, when there is a gap in the data for a given day, I should take the data from the previous day and insert it while changing the date to the missing one. This seems simple enough to do manually, but I need to automate it in dbt – I’m just not quite sure how. The only solution I’ve come up with so far is to create a macro that triggers a procedure, and that procedure would contain the logic to fill in the missing data. It seems impossible to achieve this using only dbt models.

I should mention that I’m using dbt Cloud, and in the end, I’d like to have a process like:

  • Daily aggregation into my table
  • Tests to check if there are any missing dates, and if so, trigger a process to fill in the missing data.

Do you have any ideas or suggestions on how to approach this? Or there is some way to achieve what I need in simpler way ?

Is it possible to provide an variable instead of as-of-current?
i.e. by default, pull and aggregate data where (timestamp in column) <= {{ aggregation_date }}

(and the date on the aggregation rows that you add you use that same {{ aggregation_date }})

where aggregation_date, by default is current date, but where you can pass in as a --var an earlier date.

Note: @Renee originally posted this reply in Slack. It might not have transferred perfectly.

No it’s not possible to provide any date fields or so.

My query is very simple for that model.

Something like this

SELECT dim1, dim2, getdate() as date, SUM(metric1), SUM(metric2) 
FROM table
GROUP BY 1,2,3

And I just run scheduled job every day that build this model as incremental. So every day I have new records.

So in case of failure I want to fill missing date by using previous available day with data.

If it were BigQuery I’d suggest using the time travel feature FOR SYSTEM_TIME AS OF if you don’t have any date or timestamp fields in the table to use for selection,
but I don’t think Redshift has that functionality.

Without one or the other, I’m not sure how you would be able to backfill the aggregations for a missing date.

Curious to see what others might suggest here though.

Note: @Renee originally posted this reply in Slack. It might not have transferred perfectly.