On the limits of incrementality

I’ve always thought that partitioning should be a feature of dbt. The nice thing about it is that you can compose models that are idempotent, which is preferable to an incremental strategy imho (for reasons explained below).

We use the following pattern with a lot of models at my job (it’s the main reason why I put some time into adding the alias feature). It directly violates the dbt maxim of “one model, one table” but I’ll try and explain how it addresses the issues you outline in your post.

events_monthly.sql

-- set runtime variables
{% set year = var('year', run_started_at.strftime('%Y')) %}
{% set month = var('month', run_started_at.strftime('%m')) %}

{{
  config({
    'alias': '_tracking_' + year|string + '_' + month|string,
    'schema': 'events'
  })
}}

SELECT 
  * 
FROM events
WHERE
    date_part('month', "time") = {{ month }}
    AND date_part('year', "time") = {{ year }} 

events.sql

{{
  config({
    'alias': 'tracking',
    'schema':'events',
    'materialized':'view'
  })
}}

-- set upstream dependency on monthly model
--  {{ ref('events_monthly') }}
{{ union_by_prefix('events', '_tracking_') }}

Here, we have two models: events_monthly.sql which builds a table for a discrete monthly partition and events.sql, which builds a view of all tables that match a common prefix. The heavy lifting is done by a macro, union_by_prefix, which queries the database to get a list of all tables with a common pattern (in this case, _tracking_) and UNIONS them together.

By default, the monthly model will build the current month’s table, but you can also pass in vars at runtime to build a previous month’s table. This is nice because you can address the challenge of “Late-Arriving Facts” by also regularly building models from previous months. This is an “eventually consistent” approach. Performance of the monthly models can be tuned by increasing / decreasing the size of the partitions (years, quarters, weeks, days, hours, minutes, etc). Previous partitions can be programmatically generated using this script I wrote: Script for backfilling DBT models which use partitions · GitHub

Downstream, other models only reference events.sql such that, to the analysts that consume these models, there’s really only one events model. However, the challenge with this approach is isolating production and development environments. Ideally, an analyst shouldn’t need to build all monthly partitions in order to work with the events model. For that reason, we hardcode the production schema in events.sql. This means that when you call {{ ref('events') }} it will only build a view of the monthly partitions in your development schema. Obviously, this is not ideal, but we’ve thoroughly explained this behavior to all dbt users, so they’re aware of what’s going on.

How could we bring this strategy into the dbt paradigm without breaking some of its core tenets?

7 Likes