I’ve always thought that partitioning should be a feature of dbt
. The nice thing about it is that you can compose models that are idempotent, which is preferable to an incremental strategy imho (for reasons explained below).
We use the following pattern with a lot of models at my job (it’s the main reason why I put some time into adding the alias
feature). It directly violates the dbt
maxim of “one model, one table” but I’ll try and explain how it addresses the issues you outline in your post.
events_monthly.sql
-- set runtime variables
{% set year = var('year', run_started_at.strftime('%Y')) %}
{% set month = var('month', run_started_at.strftime('%m')) %}
{{
config({
'alias': '_tracking_' + year|string + '_' + month|string,
'schema': 'events'
})
}}
SELECT
*
FROM events
WHERE
date_part('month', "time") = {{ month }}
AND date_part('year', "time") = {{ year }}
events.sql
{{
config({
'alias': 'tracking',
'schema':'events',
'materialized':'view'
})
}}
-- set upstream dependency on monthly model
-- {{ ref('events_monthly') }}
{{ union_by_prefix('events', '_tracking_') }}
Here, we have two models: events_monthly.sql
which builds a table for a discrete monthly partition and events.sql
, which builds a view of all tables that match a common prefix. The heavy lifting is done by a macro, union_by_prefix
, which queries the database to get a list of all tables with a common pattern (in this case, _tracking_
) and UNIONS
them together.
By default, the monthly model will build the current month’s table, but you can also pass in vars
at runtime to build a previous month’s table. This is nice because you can address the challenge of “Late-Arriving Facts” by also regularly building models from previous months. This is an “eventually consistent” approach. Performance of the monthly models can be tuned by increasing / decreasing the size of the partitions (years, quarters, weeks, days, hours, minutes, etc). Previous partitions can be programmatically generated using this script I wrote: Script for backfilling DBT models which use partitions · GitHub
Downstream, other models only reference events.sql
such that, to the analysts that consume these models, there’s really only one events
model. However, the challenge with this approach is isolating production and development environments. Ideally, an analyst shouldn’t need to build all monthly partitions in order to work with the events
model. For that reason, we hardcode the production schema in events.sql
. This means that when you call {{ ref('events') }}
it will only build a view of the monthly partitions in your development schema. Obviously, this is not ideal, but we’ve thoroughly explained this behavior to all dbt
users, so they’re aware of what’s going on.
How could we bring this strategy into the dbt
paradigm without breaking some of its core tenets?