End-to-end flow for using dbt with partitioned tables in Redshift.

Hello!

I want to partition time series data in Redshift and load it into separate (“partitioned”) tables to leverage the benefits described in this beautiful post: What are the best practices to partition big tables in Redshift - #2 by drop_database_public.

And I can’t wrap my head around how the entire data flow might look like.

Layout

Imagine we have three canonical schemas:

  • sources
  • staging
  • marts
  1. We load data into materialized tables in sources schema
  2. We put views on top of these tables in staging layer
  3. We create purpose-built marts from staging views in marts layer.

All works fine until we have multiple tables that play the role of partitions of a single logical table (e.g., sources.events_2022, sources.events_2023 and so on).

Solution 1

Load partitions into sources layer and union them in staging layer.

Yet there is a problem.

If I create “partitions” in sources schema dynamically I can’t dynamically reference them in dbt schema yaml files. As far as I understand there is no way to dynamically generate sources for each partition or to make a single source reference multiple partitioned tables.

Having to manually redeploy dbt project every day/month/year with new manually added partition sources is not a viable approach to say the least.

Solution 2

Load partitions into sources layer and union them in a single view in the same source layer.

Now it is possible to always reference a single source “table” in my schema yml file BUT I am no longer able to generate this view in dbt (because populating sources schema is the responsibility of a loader, not a transformer). I need loader to maintain this view which is a violation of dbt best practices.

Also I don’t really know if I can use a view as a source and what are the repercussions.

Solution 3

Load partitions into sources layer without registering them as sources in dbt and union them in staging layer

This way I need to hardcode tables names in model code in staging layer without using ref() macro. I lose all the benefits of dbt (automatic documentation, the ability to select everything that is downstream a model with source+ syntax, etc). It’s super dirty.

So…

Please help me to choose the most viable approach! Maybe there are other solutions that I didn’t think of.

Thank you!

possibly you can leverage dynamism in partition maintenance rather than partition design. eg you could use external tables over multiple files, and use materialized views to accelerate the hot portion of the data. the external table remains a single source, and you can always manually do CREATE EXTERNAL TABLE AS SELECT to rebuild/repartition it on the side. Def play around a bit. parquet is slower than internal tables but not bad. this approach gives you a single source. tradeoff is the need to maintain files, but c’est la vie.