Auto refreshing models for incremental models with Schema changes


#1

Is a way in DBT to auto-refresh incremental models when schema of these models change (like drop a column) ?My company is using DBT, and data pipeline often breaks because someone changes the model schema and forgets to refresh the related models.


#2

Hi @chuangl4! Great question. In fact, this is something that people have wanted with dbt for a long time. Here’s an issue that’s been around since July 2017! That’s a good place to start if you want to go deep here.

At a high level, the thing you’re describing would be very useful, and it’s one that we care about. We haven’t implemented it to-date because it’s actually quite a large architectural shift in the way that dbt works.

The core of the issue is one of statefulness. dbt is designed to be stateless: when you type dbt run, dbt compiles and runs your project. Once it’s done, it spins down. It doesn’t maintain any history between runs–it’s just not designed to do that today.

Our goal for dbt is to split it into a client library and a server process. The client library would make requests to the server (things like compile and run) and would be stateless. The server would be stateful. It would be responsible for user authentication, serving requests, and maintaining state between runs. This server process would be the place where we’d want to implement the type of behavior you’re talking about. But–this split doesn’t exist at all today! It’s going to be a major part of our 2018 to get there, and we don’t see a shortcut.

We’re so heavily invested in making this transition because there is actually a whole category of features that would be enabled by a stateful server process. For example: “only run models that I’ve changed since my most recent git commit”. Very useful!

Stay tuned here; we’ll absolutely share more information as we make progress.

In the meantime, if you’re going to use incremental models, it’s critical to trigger a full-refresh of the data using --full-refresh when the schema of the model changes. If you’re scheduling your runs via Sinter that’s quite easy to accomplish. If you’re using Airflow, it can be a bit more tricky to give analysts this kind of control.


#3

thank you. It makes a lot of sense. Look forward to the release of this feature.