Some tables that dbt materializes are either so big, or have so much heavy lifting happening in the sql, that running an initial build can be problematic! So I wrote this insert_by_period
materialization to help solve this problem.
This materialization is appropriate for event data that can be processed in discrete periods. It is similar in concept to the built-in incremental materialization, but has the added benefit of building the model in one period (e.g. day, week) at a time, even during a full refresh.
For example, assume we have a model called events
, as such:
{{
config(
materialized = "insert_by_period",
period = "day",
timestamp_field = "sent_at",
start_date = "2018-01-01",
stop_date = "2018-06-01",
)
}}
select
id,
event,
user_id,
sent_at
from public.events
where __PERIOD_FILTER__ -- This will be replaced with a filter in the materialization code
When the model is first run, an empty target table is created:
create table analytics.events as (
select
id,
event,
user_id,
sent_at
from public.events
where false
)
Then, the model is run for a single period and the results are inserted into the target table. Each subsequent period is inserted as a separate query.
What’s conceptually happening in the materialization follows (but there’s a bit more going on under the hood to handle edge cases):
-- insert sql for day 1
insert into analytics.events (
select
id,
event,
user_id,
sent_at
from public.events
where (
sent_at > '2018-01-01 00:00:00'::timestamp + interval '0 day' and
sent_at <= '2018-01-01 00:00:00'::timestamp + interval '0 day' + interval '1 day' and
sent_at < '2018-06-01 00:00:00'::timestamp
) ;
-- insert sql for day 2
insert into analytics.events (
select
id,
event,
user_id,
sent_at
from public.events
where (
sent_at > '2018-01-01 00:00:00'::timestamp + interval '1 day' and
sent_at <= '2018-01-01 00:00:00'::timestamp + interval '1 day' + interval '1 day' and
sent_at < '2018-06-01 00:00:00'::timestamp
)
) ;
...
-- insert sql for {{period}} {{n}}
insert into analytics.events (
select
id,
event,
user_id,
sent_at
from public.events
where (
{{timestamp_field}} > '{{start_date}}'::timestamp + interval '{{n-1}} {{period}}' and
{{timestamp_field}} <= '{{start_date}}'::timestamp + interval '{{n-1}} {{period}}' + interval '1 {{period}}' and
{{timestamp_field}} < '{{stop_date}}'::timestamp
)
) ;
...
Progress is logged in the command line, like so:
$ dbt run --models events --full-refresh
Found 1 models, 0 tests, 0 archives, 0 analyses, 122 macros, 0 operations, 0 seed files
20:37:04 | Concurrency: 1 threads (target='prod')
20:37:04 |
20:37:04 | 1 of 1 START insert_by_period model analytics.events [RUN]
+ 20:37:09 Running for day 1 of 151
+ 20:37:12 Ran for day 1 of 151; 47315 records inserted
+ 20:37:12 Running for day 2 of 151
+ 20:37:14 Ran for day 2 of 151; 67037 records inserted
+ 20:37:14 Running for day 3 of 151
+ 20:37:17 Ran for day 3 of 151; 69831 records inserted
...
20:43:02 | 1 of 1 OK created insert_by_period model analytics.events [INSERT 16446197 in 356.89s]
20:43:02 |
20:43:02 | Finished running 1 insert_by_period models in 361.13s.
For subsequent runs, the run starts from the timestamp of the most recent record in the target table, and continues to break the model up into periods - in most cases, a deployed model will have no more than one period between runs.
Building the model in this way means that should the run get interrupted, a subsequent run will pick up where the last run left off!
Using the full-refresh
flag will drop the existing table, and build the model from scratch.
At present this is only Redshift-compatible, and there’s a few caveats on using this materialization (see README.md for more notes).
Thanks to @drew for sharing his gist that served as the basis of this materialization, and also for giving me a hand! And a special mention to @tpham for testing this out!