Support for partitioned tables when using partition expiration in Terraform

RiccardoGrigoletto · April 17, 2024, 9:17am

We are facing issues using the configuration argument default_partition_expiration_ms in Terraform in combination with partition_expiration_days configuration argument in dbt.
Specifically, when the partition_expiration_days is higher than days(default_partition_expiration_ms).

The issue arises when inserting data to an existing table. If the partition date of the data to write is higher than days(default_partition_expiration_ms) and lower than partition_expiration_days, the data in that partition is not available, when it should be.

NB: This issue happens only if using “insert_overwrite” as incremental_strategy, in the dbt model definition.

How to reproduce the issue:

Create dataset via Terraform, setting default_partition_expiration_ms as 2592000000 (that is 30 days).
You can write something like this:

resource "google_bigquery_dataset" "datasets" {
  project = var.project_id
  ...
  ...
  default_partition_expiration_ms  = 2592000000
  ...
}

Run the dbt model on that dataset you created at step 1, set partition_expiration_days as 60. If everything works correctly, the created table in BigQuery will have ‘Partition expiration’ equal to 60 days.
You can write something like this:

{{ config(
    materialized="incremental",
    incremental_strategy = "insert_overwrite",
    partition_by={
        "field": "my_partition_date",
        "data_type": "date",
        "granularity": "day"
    },
    partition_expiration_days = 60,
) }}

WITH source_data AS (
    SELECT
       1 AS my_id,
        DATE_SUB(CURRENT_DATE(), INTERVAL 45 DAY) AS my_partition_date
)

SELECT
    my_id,
    my_partition_date
FROM 
    source_data

Re-run the dbt model, this time dbt will do an incremental operation on the table. You will need to change the line 15 with something like:

 ...
 DATE_SUB(CURRENT_DATE(), INTERVAL 50 DAY) AS my_partition_date
 ...

In BigQuery, run the query below and inspect results.

SELECT * FROM <PROJECT_ID>.<DATASET_ID>.<TABLE_NAME>

Actual behaviour

The result of the query in step 4 contains only 1 row, namely the one that was added during step 2.
The row added on step 3 is not available.
(Note that there are 2 partitions in the table).

Expected behaviour

The result of the query in step 4 contains 2 rows.

Possible cause

We investigated how the insert of the table is performed under-the-hood in dbt and it looks like the problem is that it creates a tmp partitioned table in the same dataset as the destination table and therefore, the tmp table has partition expiration set to days(default_partition_expiration_ms ), i.e. 30 days.

version used
dbt-core: 1.7.11
dbt-bigquery: 1.7.7

References:

Terraform BigQuery configuration: Terraform Registry
dbt BigQuery configuration: BigQuery configurations | dbt Developer Hub

Topic		Replies	Views
Advice needed around BigQuery partitioned tables Help bigquery , dbt-cloud	3	2286	October 16, 2023
DBT Incremental insert_overwrite static partitions Help incremental , bigquery	0	2781	July 4, 2023
Configuration Change for Incremental Model on DBT Help incremental	3	2337	December 31, 2021
DBT BigQuery table creation code generation issue Help incremental , bigquery	7	2421	April 1, 2023
Insert overwrite static partitions bug -- missing partitions Help incremental , bigquery , dbt-core	0	834	September 17, 2023

Support for partitioned tables when using partition expiration in Terraform

Related topics