I’m encountering an issue post running a dbt model adapted from iceberg example in the docs & the sql just transforms data by reading iceberg data present in Athena and dumping it in AWS Athena. We have a table created via this mentioned dbt process, core_acct_monthly & this needs to run for multiple partition dates.
Initially, we loaded data for the partition 20241231
, resulting in approximately 45 million records(expected & correct). Subsequently, we loaded data for the partition 20241130
, containing about 44 million records(expected & right).
My dbt model config is like this:
{{
config(
alias='core_acct_monthly',
materialized='incremental',
format='parquet',
table_type='iceberg',
partitioned_by=['ptn_val_txt']
unique_key = ['acct_id'],
incremental_strategy='merge'
)
}}
When querying the table info in Athena via $partitions command and using other iceberg related commands & snapshots, the record counts for the partitions appear correct when viewed individually (44 million for 20241130
and 45 million for 20241231
when checked immediately after load). In S3, I’m also seeing a lot of files that also reflect these numbers (data and metadata folders are getting created correctly too).
However, when I do count(*) for that partition in Athena, the record count for the older partition (20241231
) drastically reduces to approximately 700,000 records without any deletion or any operation apart from running the dbt model. The count for the newer partition (20241130
) remains correct at 44 million.
This discrepancy only appears in Athena queries. The metadata and files in the underlying S3 bucket appear to reflect the original, higher counts for both partitions. It seems the issue is with how Athena/Iceberg is presenting the data, despite the underlying data existing.
Could you please help me understand why this issue of incorrect record count is happening? Is there anything I’m missing from dbt-athena and iceberg adapter? We’re concerned that Athena is not correctly reflecting the data stored by Iceberg. Is this expected Iceberg behavior related to snapshots or some other mechanism? If so, how can we ensure Athena queries accurately reflect the actual data in S3? Is there something wrong with Glue metadata creation in this case?