First post, first problem
What’s the problem?
BigQuery is scanning the entire table being incrementally altered in the merge statement when I run incremental updates.
For big tables (40GB plus), this is a pretty decent cost and exactly the thing I’m trying to avoid with incremental updates.
Just to be clear I am avoiding the cost of querying the entire raw table, but I’m not avoiding the cost of querying the entire dbt table. You can see the new results are a handful of KBs, but the merge statement hits the entire table (73MB).
Is this a bug/something wrong with my setup/something un-avoidable?
Details
I’m running a decent number of incremental models on BigQuery. They all look (in the basic form) something like this:
The example below is an example running on Google ads data, but they’re not too different. They run on a partitioned table in BigQuery, munge some data and return
{{
config(
materialized="incremental",
partition_by={"field": "pt", "data_type": "timestamp", "granularity": "day"},
incremental_strategy="insert_overwrite",
)
}}
with
new_api_lookup as (
select
campaign_id as campaign_id,
ad_group_id as adgroup_id,
_partitiontime as pt,
sum(metrics_clicks) as total_clicks,
sum(metrics_impressions) as total_impressions,
sum(metrics_conversions_value) as conversion_value
from `{{ var("ad_bq_export")["new"]["t_adGroupStats"] }}`
where
date(_partitiontime) > date('{{ var("bq_ad_api_migrate_date") }}')
{% if is_incremental() %}
-- When rerun this will delete the last days data.
and _partitiontime >= _dbt_max_partition
{% endif %}
group by campaign_id, adgroup_id, pt
)
{% if var("bq_ads_old_api") %}
,
old_api_lookup as (
select
campaignid as campaign_id,
adgroupid as adgroup_id,
_partitiontime as pt,
sum(clicks) as total_clicks,
sum(impressions) as total_impressions,
sum(conversionvalue) as conversion_value
from `{{ var("ad_bq_export")["old"]["t_adGroupStats"] }}`
where
date(_partitiontime) <= date('{{ var("bq_ad_api_migrate_date") }}')
{% if is_incremental() %}
-- When rerun this will delete the last days data.
and _partitiontime >= _dbt_max_partition
{% endif %}
group by campaign_id, adgroup_id, pt
)
{% endif %}
select *
from
(
select *
from new_api_lookup
{% if var("bq_ads_old_api") %}
union all
select *
from old_api_lookup
{% endif %}
)
This generates into a dbt model of which the problematic part is:
-- 3. run the merge statement
merge into `my-project`.`table`.`_ads_adGroup_stats` as DBT_INTERNAL_DEST
using (
select
* from `my-project`.`table`.`_ads_adGroup_stats__dbt_tmp`
) as DBT_INTERNAL_SOURCE
on FALSE
when not matched by source
and timestamp(timestamp_trunc(DBT_INTERNAL_DEST.pt, day)) in unnest(dbt_partitions_for_replacement)
then delete
when not matched then insert
(`campaign_id`, `adgroup_id`, `pt`, `total_clicks`, `total_impressions`, `conversion_value`)
values
(`campaign_id`, `adgroup_id`, `pt`, `total_clicks`, `total_impressions`, `conversion_value`)
According to the blog post on this topic, this shouldn’t be happening?
_dbt_max_partition does contain the date expected. What am I missing?