How we treat big data models in our dbt setup

jt_st · November 21, 2019, 5:29pm

Hi all,

We’ve run into scalability issues with models built off our larger datasets (we call them big data (BD) models). They are incremental models, so they tend to run fine in production, but development is hindered by the sheer amount of time it takes to build/rebuild the graph of the models and their dependencies. I’m writing this post to introduce potential solutions and garner feedback.

We probably have a similar setup to most folks here. Our CI uses snowflake zero copy cloning to clone production, and then only runs/tests changed + downstream models using a modified version of the bash function here. Merging to master triggers a full refresh for the whole db.

However, changing a BD model or introducing a new one takes prohibitively long. Similarly, merging to master triggers a full refresh which also takes exorbitantly long. Here are some potential solutions:

CI, testing stage:

Limit the sample used for BD models when running in dev/CI. This can be achieved by adding conditional filters to BD models like so:

{{% if target == 'ci' %}}

WHERE TO_DATE(event_created_at) = CURRENT_DATE - 1

{{ endif }}

When a BD or downstream model is changed, we should rebuild upstream models as well. This ensures that downstream models will be built off the limited sample version of the model during CI.

CI, merge stage:

Only trigger a dbt run --full refresh on changed models and downstream models.
Extend time out period for gitlab CI

Clustering for BD models on date fields is also promising.

We are dbt cloud customers, so we’ve considered triggering cloud jobs from CI. I like this option, but I’m not sure how to implement cleanly. To only build/rebuild changed models, CI needs to generate a list of changed models, create a custom cloud job, trigger it, and then tear it down after the job finishes. Claire mentioned here that fishtown triggers dbt cloud jobs for its CI as well, which would involve the same steps.

Would love to hear feedback from you all!

tristan · November 23, 2019, 3:17pm

Hey @jt_st! Really good notes. The thing you’re pointing at is very important, and IMO dbt needs some additional constructs to handle this use case effectively. We’ve had a couple of conversations with other folks in similar situations, so you’re not alone here.

My guess is that we’ll be turning our focus towards this and some other similar deployment challenges in both Core and Cloud during Q1. No promises yet, but this is very much on our radar.

Topic		Replies	Views
Sampled development environments in dbt Archive	1	3205	May 8, 2019
Faster data modeling with Looker and dbt (Webinar Q&A with Drizly) Archive	0	3081	October 5, 2020
How we sped up our CI runs by 10x using Slim CI Show and Tell testing , ci-cd	17	26483	April 2, 2024
Release: dbt v0.14.1 Archive	0	2246	September 4, 2019
How to use `dbt clone` when creating a staging warehouse In-Depth Discussions	0	542	December 8, 2023

How we treat big data models in our dbt setup

Related Topics