We are facing issues using the configuration argument default_partition_expiration_ms in Terraform in combination with partition_expiration_days configuration argument in dbt.
Specifically, when the partition_expiration_days is higher than days(default_partition_expiration_ms).
The issue arises when inserting data to an existing table. If the partition date of the data to write is higher than days(default_partition_expiration_ms) and lower than partition_expiration_days, the data in that partition is not available, when it should be.
NB: This issue happens only if using “insert_overwrite” as incremental_strategy, in the dbt model definition.
How to reproduce the issue:
- Create dataset via Terraform, setting default_partition_expiration_ms as 2592000000 (that is 30 days).
You can write something like this:
resource "google_bigquery_dataset" "datasets" {
project = var.project_id
...
...
default_partition_expiration_ms = 2592000000
...
}
- Run the dbt model on that dataset you created at step 1, set partition_expiration_days as 60. If everything works correctly, the created table in BigQuery will have ‘Partition expiration’ equal to 60 days.
You can write something like this:
{{ config(
materialized="incremental",
incremental_strategy = "insert_overwrite",
partition_by={
"field": "my_partition_date",
"data_type": "date",
"granularity": "day"
},
partition_expiration_days = 60,
) }}
WITH source_data AS (
SELECT
1 AS my_id,
DATE_SUB(CURRENT_DATE(), INTERVAL 45 DAY) AS my_partition_date
)
SELECT
my_id,
my_partition_date
FROM
source_data
- Re-run the dbt model, this time dbt will do an incremental operation on the table. You will need to change the line 15 with something like:
...
DATE_SUB(CURRENT_DATE(), INTERVAL 50 DAY) AS my_partition_date
...
- In BigQuery, run the query below and inspect results.
SELECT * FROM <PROJECT_ID>.<DATASET_ID>.<TABLE_NAME>
Actual behaviour
The result of the query in step 4 contains only 1 row, namely the one that was added during step 2.
The row added on step 3 is not available.
(Note that there are 2 partitions in the table).
Expected behaviour
The result of the query in step 4 contains 2 rows.
Possible cause
We investigated how the insert of the table is performed under-the-hood in dbt and it looks like the problem is that it creates a tmp partitioned table in the same dataset as the destination table and therefore, the tmp table has partition expiration set to days(default_partition_expiration_ms ), i.e. 30 days.
version used
dbt-core: 1.7.11
dbt-bigquery: 1.7.7
References:
- Terraform BigQuery configuration: Terraform Registry
- dbt BigQuery configuration: BigQuery configurations | dbt Developer Hub