How to dynamically generate input to `partitions` config in BigQuery

Doing some work on BigQuery and trying to do some cost-optimisation. We are working under the assumption that using the dbt BigQuery partitions config will minimise table scans and merge statements vs just using incremental with a partition_by config.

As a result, we would like to do something like the following:

• write a macro that identifies which partitions need updating
• pass that macro’s output to the partitions config in the dbt model
• have it re-write each of those partitions with write_truncate instead of merge (which I think is the behaviour with the partitions config?)

I was trying to test this with basically the following code:

{% macro get_partitions(partition_column, base_table) %}

{% set pc = partition_column %}
{%- call statement('partitions', fetch_result=True) -%}

    select distinct DATE({{pc}}) as part from {{ ref(base_table) }}

{%- endcall -%}

{% set partitions_result = load_result('partitions') %}
{% set partitions  = [] %}
{% for partition in partitions_result['data']  %}
{{ partitions.append(partition[0].strftime('%Y%m%d')) }}
{% endfor %}
{{ log( 'Partitions | -> ' ~ partitions, info=True) }}
{{ return(partitions) }}

{% endmacro %}
{{
    config(materialized='table',
           partition_by='DATE(refund_authorisation_date_time)',
           partitions=get_partitions('refund_authorisation_date_time', 'raw_refunds'),
           verbose=True
           )
}}

with source as (

    select * from {{ ref('raw_refunds') }}

),

stage_refunds as (

    select
       *
    from source
    where DATE(refund_authorisation_date_time) =  PARSE_DATE('%Y%m%d','[DBT__PARTITION_DATE]')

)

select * from stage_refunds

I had two questions:

• Is the thinking around the optimisation correct?
• The macro seems to not run the SQL during the compile stage and it breaks the run. It tries to run it with a CREATE TABLE, instead of updating each partition. It works if I force the macro to pass a specific list of dates. Am I missing something obvious to force that to happen?

Hey @dylanbaker! The approach you’re using here creates an ingestion-time partitioned table. This is actually doing something totally different than a “regular” incremental model in dbt.

In a “regular” incremental model, dbt will:

  • execute a create table as statement if the model doesn’t already exist in BQ
  • execute a merge (or delete+insert) statement to upsert data into the table if it already exits

In an ingestion-time partitioned model, dbt instead does something totally different:

  • if the table doesn’t already exist, dbt will use the first partition to make an API call to BigQuery which runs the query specified in your model (for the given partition). This query job specifies a destination table for the result set (ie. it does not use a create table as statement, but it’s the same idea)
    • dbt will then modify that table’s partitioning_type to be 'DAY', converting the table into an ingestion-time partitioned table via the API
  • for each subsequent partition specified in the partitions config, dbt will execute a query and persist the result set to a table (again via the API) specifying a “table decorator”. This table decorator looks like my_table_name$20190101 – the 20190101 here specifies the partition for the loaded data, and is accessible via the __PARTITIONTIME pseudo-column.

I tried to describe this as simply as I could, but… it’s complicated! Fortunately, BigQuery released column-level partitioning which is much simpler and actually reasonable to use for data modeling use cases like yours.

If you want to further explore how to optimize your spend using column partitions, I’m super happy to explore that further. We’re aware of issues around how the merge statement is billed in practice, and I think there are some viable workarounds that can be implemented in user-space, as well as some things that we can do in dbt to make it less onerous to implement.

To, uh, actually answer your question though :wink: :

  • your general thinking here sounds appropriate to me (with the noted caveats about ingestion-time partitioned tables)
  • models configs like partitions are collected at parse-time, and dbt (intentionally) doesn’t run queries in statement blocks at parse-time. So, the behavior you’re seeing is expected, though I certainly agree that it’s unpleasant!

If you choose to keep going down this path, you can:

  1. use adapter.execute directly (not documented, or recommended!). Queries made by adapter.execute will run at parse-time, though that might change in the future. Check out how statements are implemented for a usage example.
  2. leverage some built-in dbt macros to generate a list of dates in a given range. There are some examples of how to use these macros in the docs which might be able to address your use case!

Let me know what you think about all this - hope it helps!