Edit, October 2021: I wrote this before joining the dbt Labs team, while I was the Head of Data at Education Perfect. Please interpret “we” accordingly!
The core of dbt’s value proposition is that analytics code is an asset, like any other piece of code, so it should be developed and deployed with the same rigorous processes as any other piece of code. A key principle behind this rigour is Continuous Integration–as you modify code, automatically check that your intended changes don’t break the rest of the project. Sounds great, right?
Well, during a recent dbt Staging event, we heard that only about a third of dbt Cloud customers have CI runs configured. This means that everyone else is either finding their bugs in production, or they are manually running tests that a robot could handle for them! [1] Neither of these are great options.
I want to share our CI history as we expanded from a tiny project in early 2020 to several hundred models and 1000+ automated tests today. As our project has scaled up, we’ve taken advantage of more advanced features in dbt Core and Cloud to keep a tight feedback loop and help our team be more confident in their work.
NB: we use dbt Cloud at Education Perfect, so this article focuses on that use case. Everything I describe here can also be done via dbt Core - the Cloud product just makes some of the setup easier (especially artifact management).
The start of our dbt journey
Early on, we only had a few models, so doing a full run of everything was pretty quick.
If we made a mistake, it didn’t take long to find out, AND it didn’t hurt anyone! This was a much happier place to be than our pre-dbt paradigm of pushing to live, then frantically refreshing the changed report so that if we’d forgotten a comma we could revert it before anyone noticed.
Saving time with target-aware configuration
As our project matured, we started to build complex models which didn’t perform well as views. Every time someone hit refresh in Mode, Redshift ran the whole query from scratch, which meant our users had to wait several minutes for results. We changed to materializing these large models as tables, but that made our CI jobs take much longer. Another easy fix: dbt lets you provide a different materialization config for CI or dev environments vs prod.
We added a config
block to the top of relevant files (you can also do this in your dbt_project.yml
file [2]) like this:
which turns into either
or .
It was a huge win to avoid wasting time generating redundant tables over and over again.
Despite these optimisations, by the six month mark our CI jobs tipped over 50 minutes–an inevitability given the project’s size and complexity but much too slow for an effective feedback loop.
Sidebar: How tight is tight enough?
“Tight feedback loop” is one of those sneaky phrases that gets people to nod along without thinking about the specifics. Having experienced run times between 3 minutes and 2 hours over the last year, I think that if you can’t see the results of your changes within a 5-10 minute window, it’s too slow to be effective.
Ironically though, this goal can discourage testing! I noticed that when I should have been adding tests, I was so concerned about our ballooning runtime that I tried to convince myself that new ones weren’t necessary. This is a false economy! Testing our assumptions is critical. We must be notified when our assumptions no longer hold. It saves hours of debugging and ensures that we always have the opportunity to correctly read our data.
So, if we want to maintain solid test coverage, but also want to get results in less than 10 minutes, we have to find another solution.
An obvious question is: “if we’re only changing a web sessions model, why are we bothering to re-test the other 99% of the project?” Most models aren’t directly related to one another. Wouldn’t it be great if we could just tell dbt to ignore the other models and focus on what has changed?
We discover Slim CI
Enter Slim CI! dbt can now detect the things that actually need to be tested by comparing each model’s code to the version that was used in the last successful run, and only testing those that have changed in this PR. When we implemented this by picking the job to compare to in the Cloud UI and adding state:modified
to our --models
selector, we immediately saw a 2.5x increase in performance - CI jobs that once took 50 minutes were complete in 20.
Didn’t you say 10x in the title?
There was one significant downside remaining: dynamic materialization config, our saviour introduced above, was now our most significant source of wasted effort. This was a problem because dbt only compares the CI version of the model file to the production version after rendering Jinja. Ironically, our slowest models were the only ones that were still built and tested on every run, because their CI materialization config was different to production.
Fortunately, there’s an improvement in dbt 0.19.0: if you set your config in your dbt_project.yml
file instead of inline the unrendered config is stored for comparison. When that launched, we moved our configurations and got down to 5 minute runs - a 10x improvement compared to where we were before Slim CI. Historically, best practice has been to put folder-wide settings in the project file, and deviations from the norm inline with the model. In this case, the performance gains are so substantial that I’d recommend adding all dynamic configuration in dbt_project.yml
.
Where to from here?
We’ve started to spend our savings on a more robust pipeline which tests modified models and their descendants, after being bitten a couple of times by a model which passed on its own but broke something downstream.
We initially did this in the most naive way (dbt test -m state:modified+
) but found that to be overkill. Instead, we’re validating the first-level children as well as anything that powers an exposure (dbt test -m state:modified+1 1+exposure:*,state:modified+
). This gives a good balance between catching the most likely failure points while not wasting too much time.
Not sure what all the pluses and stars and commas mean? Check out the docs on graph and set operators, or the appendix below.
The parsing improvements in 0.19.1 are another bonus - by increasing parsing performance by 2-3x, there’s less fixed cost to commands like dbt seed
which parse the whole project even if there’s nothing to do (as is often the case with seeds).
With a combination of Slim CI, moving our materialization configs outside of the model file, and the big speed boosts in the latest versions of dbt, it’s not uncommon for us to now see runs completing in 3-4 minutes, while still maintaining full confidence that we’ll know if something unexpected happens. If you’re not already using CI, start there! But once you’re ready to step it up, it’s very achievable to get even bigger wins in minimal time.
Huge thanks to @Mila for reading early drafts of this post, and @jerco for finding a bug in my job selectors
Appendix: Our full CI job definitions
dbt Cloud automatically adds the --defer
and --state
flags. If you’re using something else to orchestrate CI, you’ll need to include them yourself.
seed
dbt --warn-error seed --select state:modified sales__currencies --full-refresh
Update, November 2023: there’s an easier way to do the below - see Use warn-error-options
in CI to catch all warnings except the unhelpful ones.
We include For example, we might move models whose configuration is defined in --warn-error
in our seed
step to ensure that any project-level warnings are resolved prior to merge.dbt_project.yml
into a different folder. This would normally only be raised as a warning [3], but now we’ll be forced to fix it!
Amusingly, there’s also a warning if your selector returns no results [4], so we have to explicitly rebuild a single tiny seed (sales__currencies
) every time.
Finally, we --full-refresh
every time to ensure that any table changes (new columns, changed types) are applied.
run
dbt run -m state:modified+1 1+exposure:*,state:modified+
state:modified+1
refers to the modified model and its first-order children. 1+exposure:*,state:modified+
uses the intersection operator to get any models which are referenced in an exposure and depend on a modified model.
test
dbt test -m state:modified+1 1+exposure:*,state:modified+ --exclude test_name:relationships test_name:relationships_where tag:fragile
The -m
selector is the same as the run
job above. We exclude all relationship tests [5] to avoid oddities from the deferral process or pipelines being out of sync, as well as a handful of explicitly-tagged fragile tests who cause more trouble than they’re worth in this context.
They could also be writing perfect code first time. ↩︎
Spoilers! ↩︎
‘Configuration paths exist in your dbt_project.yml file which do not apply to any resources’, to be precise. ↩︎
‘Nothing to do. Try checking your model configs and model specification args’, to be precise. ↩︎
Looking forward to getting rid of
relationships_where
in dbt 0.20.0! ↩︎