Hello!
I want to partition time series data in Redshift and load it into separate (“partitioned”) tables to leverage the benefits described in this beautiful post: What are the best practices to partition big tables in Redshift - #2 by drop_database_public.
And I can’t wrap my head around how the entire data flow might look like.
Layout
Imagine we have three canonical schemas:
- sources
- staging
- marts
- We load data into materialized tables in
sources
schema - We put views on top of these tables in
staging
layer - We create purpose-built marts from staging views in
marts
layer.
All works fine until we have multiple tables that play the role of partitions of a single logical table (e.g., sources.events_2022
, sources.events_2023
and so on).
Solution 1
Load partitions into
sources
layer and union them instaging
layer.
Yet there is a problem.
If I create “partitions” in sources
schema dynamically I can’t dynamically reference them in dbt schema yaml files. As far as I understand there is no way to dynamically generate sources for each partition or to make a single source reference multiple partitioned tables.
Having to manually redeploy dbt project every day/month/year with new manually added partition sources is not a viable approach to say the least.
Solution 2
Load partitions into
sources
layer and union them in a single view in the samesource
layer.
Now it is possible to always reference a single source “table” in my schema yml file BUT I am no longer able to generate this view in dbt (because populating sources schema is the responsibility of a loader, not a transformer). I need loader to maintain this view which is a violation of dbt best practices.
Also I don’t really know if I can use a view as a source and what are the repercussions.
Solution 3
Load partitions into
sources
layer without registering them as sources in dbt and union them instaging
layer
This way I need to hardcode tables names in model code in staging layer without using ref()
macro. I lose all the benefits of dbt (automatic documentation, the ability to select everything that is downstream a model with source+
syntax, etc). It’s super dirty.
So…
Please help me to choose the most viable approach! Maybe there are other solutions that I didn’t think of.
Thank you!