End-to-end flow for using dbt with partitioned tables in Redshift.

evgeniy · November 21, 2023, 11:25am

Hello!

I want to partition time series data in Redshift and load it into separate (“partitioned”) tables to leverage the benefits described in this beautiful post: What are the best practices to partition big tables in Redshift - #2 by drop_database_public.

And I can’t wrap my head around how the entire data flow might look like.

Layout

Imagine we have three canonical schemas:

sources
staging
marts

We load data into materialized tables in sources schema
We put views on top of these tables in staging layer
We create purpose-built marts from staging views in marts layer.

All works fine until we have multiple tables that play the role of partitions of a single logical table (e.g., sources.events_2022, sources.events_2023 and so on).

Solution 1

Load partitions into sources layer and union them in staging layer.

Yet there is a problem.

If I create “partitions” in sources schema dynamically I can’t dynamically reference them in dbt schema yaml files. As far as I understand there is no way to dynamically generate sources for each partition or to make a single source reference multiple partitioned tables.

Having to manually redeploy dbt project every day/month/year with new manually added partition sources is not a viable approach to say the least.

Solution 2

Load partitions into sources layer and union them in a single view in the same source layer.

Now it is possible to always reference a single source “table” in my schema yml file BUT I am no longer able to generate this view in dbt (because populating sources schema is the responsibility of a loader, not a transformer). I need loader to maintain this view which is a violation of dbt best practices.

Also I don’t really know if I can use a view as a source and what are the repercussions.

Solution 3

Load partitions into sources layer without registering them as sources in dbt and union them in staging layer

This way I need to hardcode tables names in model code in staging layer without using ref() macro. I lose all the benefits of dbt (automatic documentation, the ability to select everything that is downstream a model with source+ syntax, etc). It’s super dirty.

So…

Please help me to choose the most viable approach! Maybe there are other solutions that I didn’t think of.

Thank you!

drop_database_public · December 3, 2023, 6:13am

possibly you can leverage dynamism in partition maintenance rather than partition design. eg you could use external tables over multiple files, and use materialized views to accelerate the hot portion of the data. the external table remains a single source, and you can always manually do CREATE EXTERNAL TABLE AS SELECT to rebuild/repartition it on the side. Def play around a bit. parquet is slower than internal tables but not bad. this approach gives you a single source. tradeoff is the need to maintain files, but c’est la vie.

Topic		Replies	Views
Postgres partitions Archive	0	3398	October 11, 2021
What are the best practices to partition big tables in Redshift In-Depth Discussions redshift , partitioning	2	31447	April 28, 2020
Load data from a non partition table to partition table Help	0	694	October 20, 2023
How to dynamically generate input to `partitions` config in BigQuery Archive	1	5542	September 18, 2019
Is dbt an ETL tool? Help	7	5501	August 24, 2018

End-to-end flow for using dbt with partitioned tables in Redshift.

Layout

Solution 1

Solution 2

Solution 3

So…

Related topics