A materialization for problematic initial builds

Some tables that dbt materializes are either so big, or have so much heavy lifting happening in the sql, that running an initial build can be problematic! So I wrote this insert_by_period materialization to help solve this problem.

This materialization is appropriate for event data that can be processed in discrete periods. It is similar in concept to the built-in incremental materialization, but has the added benefit of building the model in one period (e.g. day, week) at a time, even during a full refresh.

For example, assume we have a model called events, as such:

{{
  config(
    materialized = "insert_by_period",
    period = "day",
    timestamp_field = "sent_at",
    start_date = "2018-01-01",
    stop_date = "2018-06-01", 
  )
}}

select
id,
event,
user_id,
sent_at
from public.events
where __PERIOD_FILTER__ -- This will be replaced with a filter in the materialization code

When the model is first run, an empty target table is created:

create table analytics.events as (
    select
    id,
    event,
    user_id,
    sent_at
    from public.events
    where false
)

Then, the model is run for a single period and the results are inserted into the target table. Each subsequent period is inserted as a separate query.

What’s conceptually happening in the materialization follows (but there’s a bit more going on under the hood to handle edge cases):

-- insert sql for day 1
insert into analytics.events (
    select
    id,
    event,
    user_id,
    sent_at
    from public.events
    where (
        sent_at >  '2018-01-01 00:00:00'::timestamp + interval '0 day' and
        sent_at <= '2018-01-01 00:00:00'::timestamp + interval '0 day' + interval '1 day' and
        sent_at < '2018-06-01 00:00:00'::timestamp
    ) ;

-- insert sql for day 2
insert into analytics.events (
    select
    id,
    event,
    user_id,
    sent_at
    from public.events
    where (
        sent_at >  '2018-01-01 00:00:00'::timestamp + interval '1 day' and
        sent_at <= '2018-01-01 00:00:00'::timestamp + interval '1 day' + interval '1 day' and
        sent_at < '2018-06-01 00:00:00'::timestamp
    )
) ;

...

-- insert sql for {{period}} {{n}}
insert into analytics.events (
    select
    id,
    event,
    user_id,
    sent_at
    from public.events
    where (
        {{timestamp_field}} >  '{{start_date}}'::timestamp + interval '{{n-1}} {{period}}' and
        {{timestamp_field}} <= '{{start_date}}'::timestamp + interval '{{n-1}} {{period}}' + interval '1 {{period}}' and
        {{timestamp_field}} < '{{stop_date}}'::timestamp
    )
) ;
...

Progress is logged in the command line, like so:

$ dbt run --models events --full-refresh
Found 1 models, 0 tests, 0 archives, 0 analyses, 122 macros, 0 operations, 0 seed files

20:37:04 | Concurrency: 1 threads (target='prod')
20:37:04 |
20:37:04 | 1 of 1 START insert_by_period model analytics.events [RUN]
         + 20:37:09 Running for day 1 of 151
         + 20:37:12 Ran for day 1 of 151; 47315 records inserted
         + 20:37:12 Running for day 2 of 151
         + 20:37:14 Ran for day 2 of 151; 67037 records inserted
         + 20:37:14 Running for day 3 of 151
         + 20:37:17 Ran for day 3 of 151; 69831 records inserted
         ...
20:43:02 | 1 of 1 OK created insert_by_period model analytics.events [INSERT 16446197 in 356.89s]
20:43:02 |
20:43:02 | Finished running 1 insert_by_period models in 361.13s.

For subsequent runs, the run starts from the timestamp of the most recent record in the target table, and continues to break the model up into periods - in most cases, a deployed model will have no more than one period between runs.

Building the model in this way means that should the run get interrupted, a subsequent run will pick up where the last run left off!

Using the full-refresh flag will drop the existing table, and build the model from scratch.

At present this is only Redshift-compatible, and there’s a few caveats on using this materialization (see README.md for more notes).

Thanks to @drew for sharing his gist that served as the basis of this materialization, and also for giving me a hand! And a special mention to @tpham for testing this out!

7 Likes

This is great! I do something really similar in which i use the var macro to pass in partitions for individual models… (NOTE: this approach requires the alias configuration which should come out in the next release.)

{% set year= var('year', run_started_at.strftime('%Y')) %}
{% set month= var('month', run_started_at.strftime('%m')) %}

{{
  config({
    'alias': 'my_table_' + year + '_' + month,
    'materialized':'table'
  })
}}

SELECT 
   *
FROM {{ ref('my_model') }} 
WHERE year = {{ year }} 
               AND month = {{ month }} 
1 Like

dannnng, really nice use case for alias @abelsonlive!

This is amazing! The only caveat I want to mention is the time field needs to be in part of the final table.

1 Like

I got this running on Postgres 11 making the following modifications (to get rid of the errors).

On the first try I got the error “subquery in FROM must have an alias”.
Add an alias t to the sub-select in insert_by_period_materialization.sql#L49:

  select
    {{target_cols_csv}}
  from (
    {{filtered_sql}}
  ) t -- this is line 49

Once that was fixed I got another error saying “cannot create temporary relation in non-temporary schema”.
Change the schema parameter to None in insert_by_period_materialization.sql#L130:

    {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                               schema=None, type='table') -%}  -- this is line 130

This did the trick and it worked for me. I thought I document it here as I don’t know enough about materialization macros to contribute this to the project.

There is a GitHub issue regarding the modernization of this macro: fishtown-analytics/dbt-utils/issues/192

EDIT: Error messages and GitHub issue.