dbt do if I add a new column to an existing incremental model? Add the column and keep the value for it as NULL for older rows?
Great question. dbt is “stateless”—it has no idea whether a model is been run before, if a model has changed, if a file has been edited, nothing. As such, when you do a
dbt run, dbt simply generates SQL based on the current code in your project. And the code for an incremental model, essentially, does the following:
- generate a temp table with the contents of your model select statement
- delete rows from your existing table with unique keys that match keys that exist in your temp table
- insert all records from your temp table into the target table
- drop the temp table
The problem comes when you hit step #3. If the columns in your temp table are not the same as the columns in your target table, the insert will likely fail. This is one of the reasons the
--full-refresh flag exists (docs here)! When you make changes to your incremental models, you need to to a
--full-refresh run and dbt will rebuild all of your incremental models from scratch using the current code.
Also: there is an issue here that asks for an auto-full-refresh feature. You can feel free to comment there if you’d like, but I don’t anticipate that this is something that we’re going to work on in the very near future.
Got it. So, if there’s been an incremental model in production running daily for a while and the business determines that it needs to add a field to it, there’s no way to do that in dbt outside of a full refresh? What if the historical source data isn’t fully available?
What would happen if you add the column manually to the target table?
Hey @Ene! Welcome. If you added the column manually, and backfilled it manually, dbt would happily pick up from there no problem at all.
Hi @mehdi.elamine! One of the fundamental assumptions that dbt makes about your database is that your source data is persistent—any dbt-built schema should be able to be dropped at any recreated at any point in time simply be executing
dbt run. If you have source data that is not persistent, you may want to look into snapshotting that data on a scheduled basis so that it becomes persistent.
That is a pretty big assumptions to have imo considering GDPR (data minimization) and that you generally are not allowed to keep source data forever if it contains any personal data.
I think the a solution is to have “another flow/project” for getting raw data into staging.
Then using staging as a persistent staging area (PSA). That way you can remove anything from the raw data that you have no justification for storing.
Then your “sources” in the dbt project are the PSA tables, not the raw data.