The data source from which I’m getting raw data is updated by an outside provider every two weeks and, it appears, drops off data older than 5 years. For example, the update they made Aug 15, 2023 included data from August of 2018 but the update on Sept 1 dropped that August 2018 data.
My goal is to not loose data from my models when the source system drops those records off the published file. So, I started looking at incremental materializations, but the data source does not include any timestamp indicating when a record was created or updated. Is there a way, then, to set up an incremental materialization that doesn’t require a date/time value in the input data?
The other approach I’ve thought of, assuming an incremental model wil not work without a date/time, is to alter the ingestion approach to read out of the source file and write all records to a new table where I append a date stamp. This would result in potentially multiple records for the same record key (station_number, year, month) but I could filter from there into a derived source table by taking just the most recent record. That seems like a long way to go, though, so I’m hoping there’s something easier.*
- Unfortunately, it appears getting the data provider to do sane things is not an option.