Run dbt without compiling

davidmasip · February 8, 2021, 2:25pm

Is it possible to do a dbt run with the already compiled code without the compilation being run again?

I have a project that takes a lot to compile and the running time is quite low, and I have to run it 1000s of times.

jerco · February 9, 2021, 10:43am

@davidmasip This isn’t possible today, and I’ll try to give a little bit of background as to why.

To date, one of dbt’s greatest strengths has been the opinionated insistence that its compilation and execution of SQL ought to be idempotent and stateless. You can dbt run and, no matter the objects in your transformed schema, given the same raw data and the same project code, dbt will produce the same transformed results. This makes dbt easy to use and reason about, without needing to deeply understand all the internals from the first go.

Today, this requires each invocation to have end-to-end completeness: dbt must parse and validate anew the code in your project, construct its manifest and DAG, pick an execution order, build up a runtime cache, and then (one by one) compile and run each model in your project. Because model SQL may be dynamically templated based on the results of a previous model, there’s no way to pre-compile all the SQL and ship it off for execution elsewhere—dbt needs to be involved from beginning to end.

If you’re running many models at once—your entire project, or a significant subset—those steps are rolled into a fixed “project loading” cost, paid once upfront, that then enables each individual model run to be efficient and accurate. If you’re invoking dbt thousands of times, each time to run one model, you’re going to pay that same fixed cost over and over again.

Those are the facts. There are a few angles from which to see them:

dbt offers a runtime, not a module. dbt invocations are the primary point of entry, now and for a long time to come. dbt is optimized for compiling and executing hundreds (or thousands) of resources in one invocation. Invoking dbt thousands of times, each time running only one model, is going to be a sub-optimal experience. This can have less-than-ideal implications if you’re trying to, e.g., wrap individual dbt models in Airflow DAGs. There are some related issues about this if you care to read into it further: dbt#2681, dbt#2673. Technically, dbt can be installed as a python module, its tasks and methods hooked into, but its python API is neither stable nor documented today. That’s not to say it never will be—just that, for the foreseeable future, we don’t see this as an optimal way of interacting with dbt.
Invocation startup time is much slower than it should be. We know that project loading (parsing, validating, patching, manifest-constructing) scales linearly with the number of files in a project, and that it’s much too slow today to reasonably scale to projects with 5000+ files. This results in users having to wait dozens of seconds, even multiple minutes, between typing dbt run and seeing the first model execute. That’s unacceptable; all projects should have invocations that start up in a matter of seconds. As such, we’re investing significant time and energy this year in reducing these fixed costs by multiple orders of magnitude, with plans to roll out our progress in stages. The first fruits of this effort will soon be arriving in a performance release, v0.19.1.
Statefulness is powerful, and tricky. dbt has been gesturing in the direction of adding more stateful operations. Today, it’s possible to skip parsing of unchanged files via partial parsing (with a few caveats around var and env_var), and to run only changed models by passing a past invocation’s artifacts. In the long long term, I could envision a world where it’s possible to parse a project, save/pickle/zip the parsed representation, and send that file somewhere else for “execution only,” the kind of shortcut your question imagines. To get there, we’ll need to establish really rigorous contracts between steps—parsing, metadata caching, compilation, execution, artifact production—that have blurrier delineations today. I wouldn’t be surprised if, in the context of our work on #2, we make good progress here as well.

jeremy.harris · January 12, 2022, 9:08pm

This is makes sense as default behavior, but we’re kind of stumped how best to continue our deployment at the moment without an optional --no-compile dbt run parameter. Our idea at the outset was to use Prefect for orchestration of all of our data products in a just-in-time ELT method where 1 master DAG would dbt compile once at the beginning, execute/await source data ingestions, kick off individual dbt models as soon as each model’s source dependencies are met, then kick off each single downstream ML and BI workloads as soon as each exposure’s models have completed their runs and tests.

Since this end-to-end plus just-in-time orchestration is a top requirement of our internal customers for our data + analytics engineering overhaul (and what we sold to them as the only acceptable modern way of running things), is waiting a few minutes for the entire project to compile for each model run our only option? Our models are well written incremental loads, so most actually individually run much faster than the compile step.

Topic		Replies	Views
does dbt run any additional macros at run time after the manifest.json is built? Help jinja , dbt-core	7	1834	May 26, 2023
dbt compile: order of queries to run Help	5	111	December 19, 2024
How to generate the SQL for an incremental model in full_refresh mode Help incremental	2	3444	July 27, 2023
More time coding, less time waiting: Mastering defer in dbt In-Depth Discussions devblog	3	890	June 10, 2024
[v0.17 branch] dbt compile does more than just compiling Archive	4	3543	July 6, 2020

Run dbt without compiling

Related topics