A materialization for problematic initial builds


#1

Some tables that dbt materializes are either so big, or have so much heavy lifting happening in the sql, that running an initial build can be problematic! So I wrote this insert_by_period materialization to help solve this problem.

This materialization is appropriate for event data that can be processed in discrete periods. It is similar in concept to the built-in incremental materialization, but has the added benefit of building the model in one period (e.g. day, week) at a time, even during a full refresh.

For example, assume we have a model called events, as such:

{{
  config(
    materialized = "insert_by_period",
    period = "day",
    timestamp_field = "sent_at",
    start_date = "2018-01-01",
    stop_date = "2018-06-01", 
  )
}}

select
id,
event,
user_id,
sent_at
from public.events
where __PERIOD_FILTER__ -- This will be replaced with a filter in the materialization code

When the model is first run, an empty target table is created:

create table analytics.events as (
    select
    id,
    event,
    user_id,
    sent_at
    from public.events
    where false
)

Then, the model is run for a single period and the results are inserted into the target table. Each subsequent period is inserted as a separate query.

What’s conceptually happening in the materialization follows (but there’s a bit more going on under the hood to handle edge cases):

-- insert sql for day 1
insert into analytics.events (
    select
    id,
    event,
    user_id,
    sent_at
    from public.events
    where (
        sent_at >  '2018-01-01 00:00:00'::timestamp + interval '0 day' and
        sent_at <= '2018-01-01 00:00:00'::timestamp + interval '0 day' + interval '1 day' and
        sent_at < '2018-06-01 00:00:00'::timestamp
    ) ;

-- insert sql for day 2
insert into analytics.events (
    select
    id,
    event,
    user_id,
    sent_at
    from public.events
    where (
        sent_at >  '2018-01-01 00:00:00'::timestamp + interval '1 day' and
        sent_at <= '2018-01-01 00:00:00'::timestamp + interval '1 day' + interval '1 day' and
        sent_at < '2018-06-01 00:00:00'::timestamp
    )
) ;

...

-- insert sql for {{period}} {{n}}
insert into analytics.events (
    select
    id,
    event,
    user_id,
    sent_at
    from public.events
    where (
        {{timestamp_field}} >  '{{start_date}}'::timestamp + interval '{{n-1}} {{period}}' and
        {{timestamp_field}} <= '{{start_date}}'::timestamp + interval '{{n-1}} {{period}}' + interval '1 {{period}}' and
        {{timestamp_field}} < '{{stop_date}}'::timestamp
    )
) ;
...

Progress is logged in the command line, like so:

$ dbt run --models events --full-refresh
Found 1 models, 0 tests, 0 archives, 0 analyses, 122 macros, 0 operations, 0 seed files

20:37:04 | Concurrency: 1 threads (target='prod')
20:37:04 |
20:37:04 | 1 of 1 START insert_by_period model analytics.events [RUN]
         + 20:37:09 Running for day 1 of 151
         + 20:37:12 Ran for day 1 of 151; 47315 records inserted
         + 20:37:12 Running for day 2 of 151
         + 20:37:14 Ran for day 2 of 151; 67037 records inserted
         + 20:37:14 Running for day 3 of 151
         + 20:37:17 Ran for day 3 of 151; 69831 records inserted
         ...
20:43:02 | 1 of 1 OK created insert_by_period model analytics.events [INSERT 16446197 in 356.89s]
20:43:02 |
20:43:02 | Finished running 1 insert_by_period models in 361.13s.

For subsequent runs, the run starts from the timestamp of the most recent record in the target table, and continues to break the model up into periods - in most cases, a deployed model will have no more than one period between runs.

Building the model in this way means that should the run get interrupted, a subsequent run will pick up where the last run left off!

Using the full-refresh flag will drop the existing table, and build the model from scratch.

At present this is only Redshift-compatible, and there’s a few caveats on using this materialization (see README.md for more notes).

Thanks to @drew for sharing his gist that served as the basis of this materialization, and also for giving me a hand! And a special mention to @tpham for testing this out!


#4

This is great! I do something really similar in which i use the var macro to pass in partitions for individual models… (NOTE: this approach requires the alias configuration which should come out in the next release.)

{% set year= var('year', run_started_at.strftime('%Y')) %}
{% set month= var('month', run_started_at.strftime('%m')) %}

{{
  config({
    'alias': 'my_table_' + year + '_' + month,
    'materialized':'table'
  })
}}

SELECT 
   *
FROM {{ ref('my_model') }} 
WHERE year = {{ year }} 
               AND month = {{ month }} 

#5

dannnng, really nice use case for alias @abelsonlive!