full-refresh is not rebuilding the table but replacing it with only new data

iamatharkhan · July 24, 2024, 4:24pm

The problem I’m having

dbt run with full refresh is not rebuilding the table but replacing the existing table with data for the given partition. All the previous partitions are removed.

The context of why I’m trying to do this

I need to rebuild the table because two columns have been removed from the model. I can’t use on_schema_change: sync_all_columns because databricks delta lake do not support column drop so in this particular use case I’ve to rebuild the table.

config in dbt_project.yml

Please see below the relevant config in dbt_project.yml file.

metrics_store:
    +materialized: incremental
    +incremental_strategy: insert_overwrite
    +on_schema_change: append_new_columns
    +partition_by:
      - user_id
      - report_date

dbt run command:

dbt run --select base_exp_valid_sessions --full-refresh --vars '{"report_date": "2024-07-18", "user_id": "123"}'

Logs

I would expect full-refresh to create a tmp table and then restore the data. In the logs, dbt is running create or replace table command which doesn’t seem correct.


16:14:13  Databricks adapter: Cursor(session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, command-id=Unknown) - Created cursor
16:14:13  SQL status: OK in 0.260 seconds
16:14:13  Databricks adapter: Cursor(session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, command-id=37ff0877-ebc6-45e9-a1db-0b2ef32a0813) - Closing cursor
16:14:13  Writing runtime sql for node "model.metrics_store.base_exp_valid_sessions"
16:14:13  Databricks adapter: DatabricksDBTConnection(id=4933596368, session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, name=model.metrics_store.base_exp_valid_sessions, idle-time=1.022150993347168s, acquire-count=1, language=sql, thread-identifier=(60701, 123145385476096), compute-name=) - Checking idleness
16:14:13  Databricks adapter: DatabricksDBTConnection(id=4933596368, session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, name=model.metrics_store.base_exp_valid_sessions, idle-time=1.0227110385894775s, acquire-count=1, language=sql, thread-identifier=(60701, 123145385476096), compute-name=) - Retrieving connection
16:14:13  Using databricks connection "model.metrics_store.base_exp_valid_sessions"
16:14:13  On model.metrics_store.base_exp_valid_sessions: /* {"app": "dbt", "dbt_version": "1.8.4", "dbt_databricks_version": "1.8.4", "databricks_sql_connector_version": "3.1.2", "profile_name": "metrics_store", "target_name": "int", "node_id": "model.metrics_store.base_exp_valid_sessions"} */



        create or replace table `falcon_int`.`bronze`.`base_exp_valid_sessions`

      using delta

      partitioned by (user_id,report_date)





      as
      with
    filtered as (
        select
            *,
            userid as user_id,
            date('2024-07-18') as report_date
        from `falcon_prd`.`gold`.`session_lookup`
        where
            (
                (userid = '123')
                and (
                    date(session_start) <= date('2024-07-18')
                )

            )
    )

select *
from filtered
where
    user_id = '123'
    and report_date = date('2024-07-18')

16:14:13  Databricks adapter: Cursor(session-id=d0918ca8-bd43-4894-818c-88f91a934ff6, command-id=Unknown) - Created cursor

Question: Why full-refresh is not rebuilding the table i.e. keeping the existing partitions and add the new one? Instead its removing the old partitions and adding the new one.

brunoszdl · July 24, 2024, 8:27pm

A full refresh run will run your SQL code under a CREATE AND REPLACE TABLE statement.

Usually you use the is_incremental() macro to separate incremental and full refresh logic. Something like

select *
from filtered
where
    user_id = '123'
    {% if is_incremental() %}
    and report_date = date('2024-07-18')
    {% endif %}

So if you run it incrementally, it will filter the partition. If you run a full refresh, it will ignore the partition filter and will select it all

iamatharkhan · July 24, 2024, 9:14pm

Hi @brunoszdl, thanks for your answer.

This is quite surprising because the same code has been working for us until we migrated to v1.8.x. Now when we tried running it on previously working version 1.7.x, we see no difference. So we can rule out that the change was released in 1.8.x.

I’ll check if I could find some logs from previous runs that we believed worked without is_incremental() condition.

brunoszdl · July 24, 2024, 9:23pm

oh got it, if is something that broke when you upgraded, you could raise an issue in the github repo to let them know

Topic		Replies	Views
Incremental model runs only like "create or replace table..." Help incremental , databricks	6	4119	October 24, 2023
Partial full refresh of the incremental tables Help dbt-core	2	1198	February 13, 2024
Running an incremental model with full-refresh is not working Help full-refresh , incremental	3	4412	June 19, 2023
Enquiry on incremental materialisation Help incremental	4	1404	November 30, 2022
BigQuery + dbt: Incremental Changes Archive	8	40407	June 19, 2020