hello - trying to figure out why this model has the same or more compute costs when merging 0 rows as opposed to a full-refresh that builds it from scratch. seems pretty basic, not sure what i’m missing.
let’s say a full-refresh processes 5GB
if i run it as it is below after that full-refresh, it will merge 0 rows and process 10GB
if i run it w/ unique_key='id' removed after that full-refresh, it will merge 0 rows and process 5GB
it is true that the time for the runs goes down for the incremental runs.
“_last_updated” is a timestamp.
{{
config(
materialized='incremental',
unique_key='id'
)
}}
select
*
from {{ref('my_table_name_here')}}
{% if is_incremental() %}
where _last_updated > (select max(_last_updated) from {{ this }})
{% endif %}
Hey @testing, I assume you’re using BigQuery since you’re talking about data scanned?
I’m not super familiar with how BQ bills, but I wonder whether it’s having to scan the whole table checking the _last_updated value for each column. Does BQ have indexes you can add or anything?
One way to work around this might be to run a separate query to check the value. Something like this:
{% if is_incremental() %}
{% set max_last_updated_input = dbt_utils.get_single_value("select max(_last_updated) from {{ ref('the_table') }}" %}
{% set max_last_updated_this = dbt_utils.get_single_value("select max(_last_updated) from {{ this }}" %}
{% if max_last_updated_input != max_last_updated_this %}
where _last_updated > '{{ max_last_updated_this }}'
{% endif %}
{% endif %}
In BigQuery incremental model with the merge strategy scans the source table with specified filters and the whole target table to look for changes using uniquer_id. If your target table grows, that means that you scan more data each time to check for any changes to any row.
If you have a large partitioned target table, you can use incremental model with insert_overwrite strategy. In this case, dbt will not scan the full target table but rather drop particular partitions and insert new data in place.