Databricks - insert_overwrite with SQLWarehouses

renan.costa · July 12, 2023, 3:09pm

The problem I’m having

Hello,

I have an incremental model being implemented with Databricks, the data should be replaced using a column which is not unique.

I know the insert_overwrite strategy with partition_by should work, but I’m using Databricks SQLWarehouses, which doesn’t support insert_overwrite.

I also tried to use replace_where. This strategy replaces the data based on a boolean condition; however, in our use case, we would need to define the condition based on a subquery, which is not possible:

INSERT INTO <target_table>
REPLACE WHERE key IN (select key from <new_data>) // <- not possible to use a subquery
select * from <new_data>;

To expand on the use case, let’s assume a table and the date column being used to update the table, the current state of the table is:

| date         | value |
|--------------|-------|
| "2023-01-01" | a     |
| "2023-01-02" | b1    |
| "2023-01-02" | b2    |

and the new data is:

| date         | value |
|--------------|-------|
| "2023-01-02" | b3    |

The final result should be:

| date         | value |
|--------------|-------|
| "2023-01-01" | a     |
| "2023-01-02" | b3    |

Does anyone have any other ideas of what I could use to implement this model?

Thank you

rvp13 · July 12, 2023, 5:09pm

Hi within databricks you can see if your use case can utilize the MERGE INTO statement

renan.costa · July 13, 2023, 3:59pm

Hi @rvp13 unfortunately it doesn’t work for this use case because the column used to update the table is not unique.

brunoszdl · July 14, 2023, 3:14am

Even if you user MERGE INTO ON FALSE?

renan.costa · July 25, 2023, 9:26am

Hi @brunoszdl, the merge into won’t work because the unique_key is not unique in this case.

hendrikfrentrup · May 29, 2025, 1:47am

@renan.costa the REPLACE WHERE key being dynamic is in deed tricky, especially because dbt generally creates a temporary view for the <new_data> part.

How about passing in the predicates dynamically via CLI? Would that work in your (admittedly quite old) case? I’ve just tested the following and it works really well with a SQL datawarehouse:

{{ config(
   materialized='incremental',
   incremental_strategy='replace_where',
   incremental_predicates=["date in (" ~ var('date', '1999-12-31') ~ ")"],
   partition_by=['date']
) }}

select * from {{ ref('table_upstream') }}
where date in ({{ var("date", "1999-12-31") }})

and run it like this:
dbt run --vars "{\"date\": \"'2023-01-03', '2023-01-02'\"}"

which compiles to

insert into `table_downstream`
  replace where
    date in ('2023-01-03', '2023-01-02')
  table `table_downstream__dbt_tmp`

So I think the replace_where strategy is the right one to take there.

Topic		Replies	Views
Handling Merge to update the existing records and inserting the New records Help incremental , bigquery , dbt-core	1	1464	May 23, 2024
incremental model (Merge) without update Help incremental	7	9993	April 27, 2023
Re-open thread on insert only in incremental load In-Depth Discussions	1	1761	April 2, 2023
Handling merge statements in incremental models Help	2	14068	July 23, 2019
Incremental_strategy = 'merge' Help incremental , postgres , dbt-core	3	553	August 19, 2024

Databricks - insert_overwrite with SQLWarehouses

The problem I’m having

Related topics