Hello!
I want to partition time series data in Redshift and load it into separate (“partitioned”) tables to leverage the benefits described in this beautiful post: What are the best practices to partition big tables in Redshift - #2 by drop_database_public.
And I can’t wrap my head around how the entire data flow might look like.
Layout
Imagine we have three canonical schemas:
- sources
- staging
- marts
- We load data into materialized tables in
sourcesschema - We put views on top of these tables in
staginglayer - We create purpose-built marts from staging views in
martslayer.
All works fine until we have multiple tables that play the role of partitions of a single logical table (e.g., sources.events_2022, sources.events_2023 and so on).
Solution 1
Load partitions into
sourceslayer and union them instaginglayer.
Yet there is a problem.
If I create “partitions” in sources schema dynamically I can’t dynamically reference them in dbt schema yaml files. As far as I understand there is no way to dynamically generate sources for each partition or to make a single source reference multiple partitioned tables.
Having to manually redeploy dbt project every day/month/year with new manually added partition sources is not a viable approach to say the least.
Solution 2
Load partitions into
sourceslayer and union them in a single view in the samesourcelayer.
Now it is possible to always reference a single source “table” in my schema yml file BUT I am no longer able to generate this view in dbt (because populating sources schema is the responsibility of a loader, not a transformer). I need loader to maintain this view which is a violation of dbt best practices.
Also I don’t really know if I can use a view as a source and what are the repercussions.
Solution 3
Load partitions into
sourceslayer without registering them as sources in dbt and union them instaginglayer
This way I need to hardcode tables names in model code in staging layer without using ref() macro. I lose all the benefits of dbt (automatic documentation, the ability to select everything that is downstream a model with source+ syntax, etc). It’s super dirty.
So…
Please help me to choose the most viable approach! Maybe there are other solutions that I didn’t think of.
Thank you!