full-refresh is not rebuilding the table but replacing it with only new data

The problem I’m having

dbt run with full refresh is not rebuilding the table but replacing the existing table with data for the given partition. All the previous partitions are removed.

The context of why I’m trying to do this

I need to rebuild the table because two columns have been removed from the model. I can’t use on_schema_change: sync_all_columns because databricks delta lake do not support column drop so in this particular use case I’ve to rebuild the table.

config in dbt_project.yml

Please see below the relevant config in dbt_project.yml file.

metrics_store:
    +materialized: incremental
    +incremental_strategy: insert_overwrite
    +on_schema_change: append_new_columns
    +partition_by:
      - user_id
      - report_date

dbt run command:

dbt run --select base_exp_valid_sessions --full-refresh --vars '{"report_date": "2024-07-18", "user_id": "123"}'

Logs

I would expect full-refresh to create a tmp table and then restore the data. In the logs, dbt is running create or replace table command which doesn’t seem correct.


16:14:13  Databricks adapter: Cursor(session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, command-id=Unknown) - Created cursor
16:14:13  SQL status: OK in 0.260 seconds
16:14:13  Databricks adapter: Cursor(session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, command-id=37ff0877-ebc6-45e9-a1db-0b2ef32a0813) - Closing cursor
16:14:13  Writing runtime sql for node "model.metrics_store.base_exp_valid_sessions"
16:14:13  Databricks adapter: DatabricksDBTConnection(id=4933596368, session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, name=model.metrics_store.base_exp_valid_sessions, idle-time=1.022150993347168s, acquire-count=1, language=sql, thread-identifier=(60701, 123145385476096), compute-name=) - Checking idleness
16:14:13  Databricks adapter: DatabricksDBTConnection(id=4933596368, session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, name=model.metrics_store.base_exp_valid_sessions, idle-time=1.0227110385894775s, acquire-count=1, language=sql, thread-identifier=(60701, 123145385476096), compute-name=) - Retrieving connection
16:14:13  Using databricks connection "model.metrics_store.base_exp_valid_sessions"
16:14:13  On model.metrics_store.base_exp_valid_sessions: /* {"app": "dbt", "dbt_version": "1.8.4", "dbt_databricks_version": "1.8.4", "databricks_sql_connector_version": "3.1.2", "profile_name": "metrics_store", "target_name": "int", "node_id": "model.metrics_store.base_exp_valid_sessions"} */



        create or replace table `falcon_int`.`bronze`.`base_exp_valid_sessions`

      using delta

      partitioned by (user_id,report_date)





      as
      with
    filtered as (
        select
            *,
            userid as user_id,
            date('2024-07-18') as report_date
        from `falcon_prd`.`gold`.`session_lookup`
        where
            (
                (userid = '123')
                and (
                    date(session_start) <= date('2024-07-18')
                )

            )
    )

select *
from filtered
where
    user_id = '123'
    and report_date = date('2024-07-18')

16:14:13  Databricks adapter: Cursor(session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, command-id=Unknown) - Created cursor


Question: Why full-refresh is not rebuilding the table i.e. keeping the existing partitions and add the new one? Instead its removing the old partitions and adding the new one.

A full refresh run will run your SQL code under a CREATE AND REPLACE TABLE statement.

Usually you use the is_incremental() macro to separate incremental and full refresh logic. Something like

select *
from filtered
where
    user_id = '123'
    {% if is_incremental() %}
    and report_date = date('2024-07-18')
    {% endif %}

So if you run it incrementally, it will filter the partition. If you run a full refresh, it will ignore the partition filter and will select it all

Hi @brunoszdl, thanks for your answer.

This is quite surprising because the same code has been working for us until we migrated to v1.8.x. Now when we tried running it on previously working version 1.7.x, we see no difference. So we can rule out that the change was released in 1.8.x.

I’ll check if I could find some logs from previous runs that we believed worked without is_incremental() condition.

oh got it, if is something that broke when you upgraded, you could raise an issue in the github repo to let them know