Using data from another model or from data in S3

klingsoft · September 6, 2023, 1:37pm

The problem I’m having

Is there a macro, or a best practice, to get data either from a ref table if the table has data, or from an external table having its data previously unloaded to S3.

The context of why I’m trying to do this

We have models that will create more than 100 million items every day in a table, and our plan is to not keep this data in our Redshift database longer than necessary (a few days/weeks/months).

The idea is unload the daily data to S3 and re-use that data if we need it at a later stage without the need to re-run this complicated model again.

With the large amount of data from this table, we need to be able to run downstream models quickly from the data already in the previous model in the normal runs without having to wait for that model to unload, and that data to be loaded back to the database.

In case of a later rerun of one of the downstream models, it would be nice to have a way to load the data from S3 without the need to rerun the upstream model again.

The normal use case would be the daily load where we:

Run model A incrementally
Unload data from A (for this date)

From dependencies we know we now can run model B:

Run model B
First check if data is available in model A
Use data already in model A from previous run

After days/weeks/months, we drop data of a certain age from the database.

For historical updates:

Run model B
Check if data is available in model A
** Load model A with data from S3
Use data in model A

What I’ve already tried

Looking for some documentation on how to do this.

Next thing to try is to at the start of model B, running a macro with a check if there is data in A.
If not, copy all data for this date to another table A-from-S3. If data existed in A already, delete all data in A-from-S3.
In model B use data from a view that makes a UNION of data in A and data in A-from-S3.

klingsoft · September 14, 2023, 7:23am

The solution we implemented is currently:
In model B, use a macro to select ref(“A”) or source(“S3”, “A-unloaded”).

SELECT … lot of new logic based on table A
FROM {{ get_data_from_model_or_unloaded (ref(“A”), source(“S3”, “A-unloaded”), var"refdate") }} A
WHERE a.refdate = var"refdate"
AND …

The macro checks if ref(“A”) exists, and if it has data for “refdate”, if this is the case - return ref(“A”) else return source(…).

Topic		Replies	Views
Unload data from Snowflake to S3 Help incremental , best-practice , snowflake	0	2760	February 16, 2023
Reading data from S3 and loading it into Redshift for the transformation we have to use dbt core Help redshift , dbt-core	3	1580	January 4, 2024
Accessing external schemas in Redshift Help external-tables , redshift	1	1452	September 23, 2022
How to call a model from a macro Help macro	4	4036	January 18, 2024
Redshift Unload Macro Help Help	0	1146	February 24, 2023

Using data from another model or from data in S3

The problem I’m having

The context of why I’m trying to do this

What I’ve already tried

Related topics