Hi all,
We’ve run into scalability issues with models built off our larger datasets (we call them big data (BD) models). They are incremental models, so they tend to run fine in production, but development is hindered by the sheer amount of time it takes to build/rebuild the graph of the models and their dependencies. I’m writing this post to introduce potential solutions and garner feedback.
We probably have a similar setup to most folks here. Our CI uses snowflake zero copy cloning to clone production, and then only runs/tests changed + downstream models using a modified version of the bash function here. Merging to master triggers a full refresh for the whole db.
However, changing a BD model or introducing a new one takes prohibitively long. Similarly, merging to master triggers a full refresh which also takes exorbitantly long. Here are some potential solutions:
CI, testing stage:
- Limit the sample used for BD models when running in dev/CI. This can be achieved by adding conditional filters to BD models like so:
{{% if target == 'ci' %}}
WHERE TO_DATE(event_created_at) = CURRENT_DATE - 1
{{ endif }}
- When a BD or downstream model is changed, we should rebuild upstream models as well. This ensures that downstream models will be built off the limited sample version of the model during CI.
CI, merge stage:
-
Only trigger a
dbt run --full refresh
on changed models and downstream models. -
Extend time out period for gitlab CI
Clustering for BD models on date fields is also promising.
We are dbt cloud customers, so we’ve considered triggering cloud jobs from CI. I like this option, but I’m not sure how to implement cleanly. To only build/rebuild changed models, CI needs to generate a list of changed models, create a custom cloud job, trigger it, and then tear it down after the job finishes. Claire mentioned here that fishtown triggers dbt cloud jobs for its CI as well, which would involve the same steps.
Would love to hear feedback from you all!