How we sped up our CI runs by 10x using Slim CI

Edit, October 2021: I wrote this before joining the dbt Labs team, while I was the Head of Data at Education Perfect. Please interpret “we” accordingly!

The core of dbt’s value proposition is that analytics code is an asset, like any other piece of code, so it should be developed and deployed with the same rigorous processes as any other piece of code. A key principle behind this rigour is Continuous Integration–as you modify code, automatically check that your intended changes don’t break the rest of the project. Sounds great, right?

Well, during a recent dbt Staging event, we heard that only about a third of dbt Cloud customers have CI runs configured. This means that everyone else is either finding their bugs in production, or they are manually running tests that a robot could handle for them! [1] Neither of these are great options.

I want to share our CI history as we expanded from a tiny project in early 2020 to several hundred models and 1000+ automated tests today. As our project has scaled up, we’ve taken advantage of more advanced features in dbt Core and Cloud to keep a tight feedback loop and help our team be more confident in their work.

NB: we use dbt Cloud at Education Perfect, so this article focuses on that use case. Everything I describe here can also be done via dbt Core - the Cloud product just makes some of the setup easier (especially artifact management).

The start of our dbt journey

Early on, we only had a few models, so doing a full run of everything was pretty quick.
we'll-take-the-lot

If we made a mistake, it didn’t take long to find out, AND it didn’t hurt anyone! This was a much happier place to be than our pre-dbt paradigm of pushing to live, then frantically refreshing the changed report so that if we’d forgotten a comma we could revert it before anyone noticed.
first-ci-job-slack

Saving time with target-aware configuration

As our project matured, we started to build complex models which didn’t perform well as views. Every time someone hit refresh in Mode, Redshift ran the whole query from scratch, which meant our users had to wait several minutes for results. We changed to materializing these large models as tables, but that made our CI jobs take much longer. Another easy fix: dbt lets you provide a different materialization config for CI or dev environments vs prod.

We added a config block to the top of relevant files (you can also do this in your dbt_project.yml file [2]) like this:
config-unrendered
which turns into either
config-rendered-table or config-rendered-view.
It was a huge win to avoid wasting time generating redundant tables over and over again.

Despite these optimisations, by the six month mark our CI jobs tipped over 50 minutes–an inevitability given the project’s size and complexity but much too slow for an effective feedback loop.

Sidebar: How tight is tight enough?

“Tight feedback loop” is one of those sneaky phrases that gets people to nod along without thinking about the specifics. Having experienced run times between 3 minutes and 2 hours over the last year, I think that if you can’t see the results of your changes within a 5-10 minute window, it’s too slow to be effective.

Ironically though, this goal can discourage testing! I noticed that when I should have been adding tests, I was so concerned about our ballooning runtime that I tried to convince myself that new ones weren’t necessary. This is a false economy! Testing our assumptions is critical. We must be notified when our assumptions no longer hold. It saves hours of debugging and ensures that we always have the opportunity to correctly read our data.

So, if we want to maintain solid test coverage, but also want to get results in less than 10 minutes, we have to find another solution.

An obvious question is: “if we’re only changing a web sessions model, why are we bothering to re-test the other 99% of the project?” Most models aren’t directly related to one another. Wouldn’t it be great if we could just tell dbt to ignore the other models and focus on what has changed?

We discover Slim CI

Enter Slim CI! dbt can now detect the things that actually need to be tested by comparing each model’s code to the version that was used in the last successful run, and only testing those that have changed in this PR. When we implemented this by picking the job to compare to in the Cloud UI and adding state:modified to our --models selector, we immediately saw a 2.5x increase in performance - CI jobs that once took 50 minutes were complete in 20.

Didn’t you say 10x in the title?

There was one significant downside remaining: dynamic materialization config, our saviour introduced above, was now our most significant source of wasted effort. This was a problem because dbt only compares the CI version of the model file to the production version after rendering Jinja. Ironically, our slowest models were the only ones that were still built and tested on every run, because their CI materialization config was different to production.

Fortunately, there’s an improvement in dbt 0.19.0: if you set your config in your dbt_project.yml file instead of inline the unrendered config is stored for comparison. When that launched, we moved our configurations and got down to 5 minute runs - a 10x improvement compared to where we were before Slim CI. Historically, best practice has been to put folder-wide settings in the project file, and deviations from the norm inline with the model. In this case, the performance gains are so substantial that I’d recommend adding all dynamic configuration in dbt_project.yml.

Where to from here?

We’ve started to spend our savings on a more robust pipeline which tests modified models and their descendants, after being bitten a couple of times by a model which passed on its own but broke something downstream.

We initially did this in the most naive way (dbt test -m state:modified+) but found that to be overkill. Instead, we’re validating the first-level children as well as anything that powers an exposure (dbt test -m state:modified+1 1+exposure:*,state:modified+). This gives a good balance between catching the most likely failure points while not wasting too much time.
Not sure what all the pluses and stars and commas mean? Check out the docs on graph and set operators, or the appendix below.

The parsing improvements in 0.19.1 are another bonus - by increasing parsing performance by 2-3x, there’s less fixed cost to commands like dbt seed which parse the whole project even if there’s nothing to do (as is often the case with seeds).

With a combination of Slim CI, moving our materialization configs outside of the model file, and the big speed boosts in the latest versions of dbt, it’s not uncommon for us to now see runs completing in 3-4 minutes, while still maintaining full confidence that we’ll know if something unexpected happens. If you’re not already using CI, start there! But once you’re ready to step it up, it’s very achievable to get even bigger wins in minimal time.

Huge thanks to @Mila for reading early drafts of this post, and @jerco for finding a bug in my job selectors :grimacing:

Appendix: Our full CI job definitions

:warning: dbt Cloud automatically adds the --defer and --state flags. If you’re using something else to orchestrate CI, you’ll need to include them yourself.

seed

dbt --warn-error seed --select state:modified sales__currencies --full-refresh

Update, November 2023: there’s an easier way to do the below - see Use warn-error-options in CI to catch all warnings except the unhelpful ones.

We include --warn-error in our seed step to ensure that any project-level warnings are resolved prior to merge. For example, we might move models whose configuration is defined in dbt_project.yml into a different folder. This would normally only be raised as a warning [3], but now we’ll be forced to fix it!

Amusingly, there’s also a warning if your selector returns no results [4], so we have to explicitly rebuild a single tiny seed (sales__currencies) every time.

Finally, we --full-refresh every time to ensure that any table changes (new columns, changed types) are applied.

run

dbt run -m state:modified+1 1+exposure:*,state:modified+

state:modified+1 refers to the modified model and its first-order children. 1+exposure:*,state:modified+ uses the intersection operator to get any models which are referenced in an exposure and depend on a modified model.

test

dbt test -m state:modified+1 1+exposure:*,state:modified+ --exclude test_name:relationships test_name:relationships_where tag:fragile

The -m selector is the same as the run job above. We exclude all relationship tests [5] to avoid oddities from the deferral process or pipelines being out of sync, as well as a handful of explicitly-tagged fragile tests who cause more trouble than they’re worth in this context.


  1. They could also be writing perfect code first time. ↩︎

  2. Spoilers! ↩︎

  3. ‘Configuration paths exist in your dbt_project.yml file which do not apply to any resources’, to be precise. ↩︎

  4. ‘Nothing to do. Try checking your model configs and model specification args’, to be precise. ↩︎

  5. Looking forward to getting rid of relationships_where in dbt 0.20.0! ↩︎

11 Likes

This is excellent, thanks a lot!

How do you manage multiple branches running on CI? Does each get its own schema?

1 Like

Yes that’s right! dbt Cloud creates a schema for each PR, something like dbt_cloud_TEAM-ID_PR-ID, then automatically drops the schema once the PR is merged. More info here: Continuous integration in dbt Cloud | dbt Developer Hub

3 Likes

What are your thoughts on using the state:modified on production runs? Is that even possible and is the juice worth the squeeze? Perhaps running only views that have changed plus all physical tables? Views don’t take a ton of resources from our warehouse db (snowflake), but when you have a few hundred views to run that starts to add up.

Excellent question! I think it should work - during the beta process, the ability to defer to the same job was added for exactly this use case. I never got around to doing it myself at EP before leaving so can’t give any specific guidance off the top of my head.

If you get it going, please come back and pop anything interesting you learned on the thread!

@joellabes are you holding back on me?! the solution to all our problems is dbt 0.21.0! Configuring incremental models | dbt Docs

I don’t think so!

Normally I’d agree with you - 0.21.0 is very good! I don’t know how incremental models help here though? I don’t recommend moving things to incremental models unless driven by performance needs, and I definitely wouldn’t change all my views to incrementals…

Agreed, there was some hyperbole in my comment. But the on_schema_change definitely helps avoid errors in the slim CI job run.

I’d recommend adding all dynamic configuration in dbt_project.yml

@joellabes could you please elaborate on how to add dynamic config to the dbt_project.yml file? It would be awesome if you could share a snippet as an example, thank you!

Sure! Borrowed from FAQ: Configuring models based on environments:

models:
  materialized: "{{ 'table' if target.name == 'prod' else 'view' }}"
2 Likes

We found a mix of both works well, as sometimes we want a table, view or incremental.

We wrote a macro that we put at the top of the models that aren’t views (we use views by default) and then have environments like you did (sample/ci/full).

Sample: only a very small subset of data (like <= 200 rows)
CI: Last few weeks of data, uses tables/views. Sometimes a table is much faster for an incremental table depending. You pay for the cost in either building the table or when running a test against that view, it has to build anyway, but now it builds twice (once for the test, and again for the downstream model(s)).
Full: full dataset, uses incremental views, tables, views as needed. :slight_smile:

2 Likes

I have also been setting the environment where the CI runs target name to ‘ci’ using an environment variable combined with an if block in the .sql files specifically looking for the target where the dev CI check runs:

with 

long_running_model as (
  select * from {{ ref('fct_model_with_billion_rows')}}

  {% if target.database == 'ci' %}
    limit 1000
  {% endif %}
)
...

which then will only result in 1000 rows in the CI check, rather than all of them, which will speed up the process considerably.

2 Likes

Hi @joellabes ,
In a Slim-CI setup in dbt cloud, does state comparison run properly if we defer to a partial prod run instead of a deferring to a job that runs all the models? Since a manifest file gets generated for the entire dbt project instead of just the models that were run, does it matter if we defer to a run that only executed a subset of models from the project, are there any known caveats?

@padmavathy it doesn’t matter - as you correctly note, dbt generates a full manifest no matter what, so a partial prod run would be fine.

2 posts were split to a new topic: Select tests which are both modified and have a specific tag

A post was split to a new topic: Running clones in CI when prod is a different snowflake database

Hi @joellabes ,
I have created the Slim CI in dbt and did a pull request from one branch to another branch with some changes….
Slim CI should run only the models which are changed and which are there in PR. But, its running all the models present in the branch everytime.
Can you please help me with this???