Is it possible to do a dbt run
with the already compiled code without the compilation being run again?
I have a project that takes a lot to compile and the running time is quite low, and I have to run it 1000s of times.
Is it possible to do a dbt run
with the already compiled code without the compilation being run again?
I have a project that takes a lot to compile and the running time is quite low, and I have to run it 1000s of times.
@davidmasip This isn’t possible today, and I’ll try to give a little bit of background as to why.
To date, one of dbt’s greatest strengths has been the opinionated insistence that its compilation and execution of SQL ought to be idempotent and stateless. You can dbt run
and, no matter the objects in your transformed schema, given the same raw data and the same project code, dbt will produce the same transformed results. This makes dbt easy to use and reason about, without needing to deeply understand all the internals from the first go.
Today, this requires each invocation to have end-to-end completeness: dbt must parse and validate anew the code in your project, construct its manifest and DAG, pick an execution order, build up a runtime cache, and then (one by one) compile and run each model in your project. Because model SQL may be dynamically templated based on the results of a previous model, there’s no way to pre-compile all the SQL and ship it off for execution elsewhere—dbt needs to be involved from beginning to end.
If you’re running many models at once—your entire project, or a significant subset—those steps are rolled into a fixed “project loading” cost, paid once upfront, that then enables each individual model run to be efficient and accurate. If you’re invoking dbt thousands of times, each time to run one model, you’re going to pay that same fixed cost over and over again.
Those are the facts. There are a few angles from which to see them:
dbt run
and seeing the first model execute. That’s unacceptable; all projects should have invocations that start up in a matter of seconds. As such, we’re investing significant time and energy this year in reducing these fixed costs by multiple orders of magnitude, with plans to roll out our progress in stages. The first fruits of this effort will soon be arriving in a performance release, v0.19.1.var
and env_var
), and to run only changed models by passing a past invocation’s artifacts. In the long long term, I could envision a world where it’s possible to parse a project, save/pickle/zip the parsed representation, and send that file somewhere else for “execution only,” the kind of shortcut your question imagines. To get there, we’ll need to establish really rigorous contracts between steps—parsing, metadata caching, compilation, execution, artifact production—that have blurrier delineations today. I wouldn’t be surprised if, in the context of our work on #2, we make good progress here as well.This is makes sense as default behavior, but we’re kind of stumped how best to continue our deployment at the moment without an optional --no-compile dbt run parameter. Our idea at the outset was to use Prefect for orchestration of all of our data products in a just-in-time ELT method where 1 master DAG would dbt compile once at the beginning, execute/await source data ingestions, kick off individual dbt models as soon as each model’s source dependencies are met, then kick off each single downstream ML and BI workloads as soon as each exposure’s models have completed their runs and tests.
Since this end-to-end plus just-in-time orchestration is a top requirement of our internal customers for our data + analytics engineering overhaul (and what we sold to them as the only acceptable modern way of running things), is waiting a few minutes for the entire project to compile for each model run our only option? Our models are well written incremental loads, so most actually individually run much faster than the compile step.