Is there a macro, or a best practice, to get data either from a ref table if the table has data, or from an external table having its data previously unloaded to S3.
We have models that will create more than 100 million items every day in a table, and our plan is to not keep this data in our Redshift database longer than necessary (a few days/weeks/months).
The idea is unload the daily data to S3 and re-use that data if we need it at a later stage without the need to re-run this complicated model again.
With the large amount of data from this table, we need to be able to run downstream models quickly from the data already in the previous model in the normal runs without having to wait for that model to unload, and that data to be loaded back to the database.
In case of a later rerun of one of the downstream models, it would be nice to have a way to load the data from S3 without the need to rerun the upstream model again.
The normal use case would be the daily load where we:
- Run model A incrementally
- Unload data from A (for this date)
From dependencies we know we now can run model B:
- Run model B
- First check if data is available in model A
- Use data already in model A from previous run
After days/weeks/months, we drop data of a certain age from the database.
For historical updates:
- Run model B
- Check if data is available in model A
** Load model A with data from S3
- Use data in model A
Looking for some documentation on how to do this.
Next thing to try is to at the start of model B, running a macro with a check if there is data in A.
If not, copy all data for this date to another table A-from-S3. If data existed in A already, delete all data in A-from-S3.
In model B use data from a view that makes a UNION of data in A and data in A-from-S3.