DBT BigQuery Scanning Entire table in merge

First post, first problem :frowning:

What’s the problem?

BigQuery is scanning the entire table being incrementally altered in the merge statement when I run incremental updates.

For big tables (40GB plus), this is a pretty decent cost and exactly the thing I’m trying to avoid with incremental updates.

Just to be clear I am avoiding the cost of querying the entire raw table, but I’m not avoiding the cost of querying the entire dbt table. You can see the new results are a handful of KBs, but the merge statement hits the entire table (73MB).

Is this a bug/something wrong with my setup/something un-avoidable?

Details

I’m running a decent number of incremental models on BigQuery. They all look (in the basic form) something like this:

The example below is an example running on Google ads data, but they’re not too different. They run on a partitioned table in BigQuery, munge some data and return

{{
    config(
        materialized="incremental",
        partition_by={"field": "pt", "data_type": "timestamp", "granularity": "day"},
        incremental_strategy="insert_overwrite",
    )
}}

with
    new_api_lookup as (
        select
            campaign_id as campaign_id,
            ad_group_id as adgroup_id,
            _partitiontime as pt,
            sum(metrics_clicks) as total_clicks,
            sum(metrics_impressions) as total_impressions,
            sum(metrics_conversions_value) as conversion_value
        from `{{ var("ad_bq_export")["new"]["t_adGroupStats"] }}`
        where
            date(_partitiontime) > date('{{ var("bq_ad_api_migrate_date") }}')
            {% if is_incremental() %}

            -- When rerun this will delete the last days data.
            and _partitiontime >= _dbt_max_partition
            {% endif %}
        group by campaign_id, adgroup_id, pt
    )
    {% if var("bq_ads_old_api") %}
    ,
    old_api_lookup as (
        select
            campaignid as campaign_id,
            adgroupid as adgroup_id,
            _partitiontime as pt,
            sum(clicks) as total_clicks,
            sum(impressions) as total_impressions,
            sum(conversionvalue) as conversion_value
        from `{{ var("ad_bq_export")["old"]["t_adGroupStats"] }}`
        where
            date(_partitiontime) <= date('{{ var("bq_ad_api_migrate_date") }}')
            {% if is_incremental() %}

            -- When rerun this will delete the last days data.
            and _partitiontime >= _dbt_max_partition
            {% endif %}
        group by campaign_id, adgroup_id, pt
    )
    {% endif %}
select *
from
    (
        select *
        from new_api_lookup
        {% if var("bq_ads_old_api") %}
        union all
        select *
        from old_api_lookup
        {% endif %}
    )

This generates into a dbt model of which the problematic part is:

   -- 3. run the merge statement
      

    merge into `my-project`.`table`.`_ads_adGroup_stats` as DBT_INTERNAL_DEST
        using (
        select
        * from `my-project`.`table`.`_ads_adGroup_stats__dbt_tmp`
      ) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and timestamp(timestamp_trunc(DBT_INTERNAL_DEST.pt, day)) in unnest(dbt_partitions_for_replacement) 
        then delete

    when not matched then insert
        (`campaign_id`, `adgroup_id`, `pt`, `total_clicks`, `total_impressions`, `conversion_value`)
    values
        (`campaign_id`, `adgroup_id`, `pt`, `total_clicks`, `total_impressions`, `conversion_value`)

According to the blog post on this topic, this shouldn’t be happening?

_dbt_max_partition does contain the date expected. What am I missing?

This was caused by a bug in 1.4 and mostly fixed by moving to 1.5.