dbt heavy lifting for Iceberg tables into AWS Athena gives incorrect count

cs_bot97 · February 7, 2025, 3:54am

I’m encountering an issue post running a dbt model adapted from iceberg example in the docs & the sql just transforms data by reading iceberg data present in Athena and dumping it in AWS Athena. We have a table created via this mentioned dbt process, core_acct_monthly & this needs to run for multiple partition dates.

Initially, we loaded data for the partition `20241231` , resulting in approximately 45 million records(expected & correct). Subsequently, we loaded data for the partition `20241130` , containing about 44 million records(expected & right).

My dbt model config is like this:

{{
config(
alias='core_acct_monthly',
materialized='incremental',
format='parquet',
table_type='iceberg',
partitioned_by=['ptn_val_txt']
unique_key = ['acct_id'],
incremental_strategy='merge'
)
}}

When querying the table info in Athena via $partitions command and using other iceberg related commands & snapshots, the record counts for the partitions appear correct when viewed individually (44 million for 20241130 and 45 million for 20241231 when checked immediately after load). In S3, I’m also seeing a lot of files that also reflect these numbers (data and metadata folders are getting created correctly too).

However, when I do count(*) for that partition in Athena, the record count for the older partition (20241231) drastically reduces to approximately 700,000 records without any deletion or any operation apart from running the dbt model. The count for the newer partition (20241130) remains correct at 44 million.

This discrepancy only appears in Athena queries. The metadata and files in the underlying S3 bucket appear to reflect the original, higher counts for both partitions. It seems the issue is with how Athena/Iceberg is presenting the data, despite the underlying data existing.

Could you please help me understand why this issue of incorrect record count is happening? Is there anything I’m missing from dbt-athena and iceberg adapter? We’re concerned that Athena is not correctly reflecting the data stored by Iceberg. Is this expected Iceberg behavior related to snapshots or some other mechanism? If so, how can we ensure Athena queries accurately reflect the actual data in S3? Is there something wrong with Glue metadata creation in this case?

Topic		Replies	Views
Incremental model error in dbt-athena Help incremental	1	148	January 6, 2025
On the limits of incrementality In-Depth Discussions incremental , best-practice	17	70520	November 2, 2023
Store count of inserted messages from an incremental model into a separate metadata table Help	0	140	July 12, 2024
Schema evolution dbt + Iceberg Help	1	1271	January 29, 2025
BigQuery + dbt: Incremental Changes Archive	8	39931	June 19, 2020

dbt heavy lifting for Iceberg tables into AWS Athena gives incorrect count

Initially, we loaded data for the partition 20241231 , resulting in approximately 45 million records(expected & correct). Subsequently, we loaded data for the partition 20241130 , containing about 44 million records(expected & right).

Related topics

Initially, we loaded data for the partition `20241231` , resulting in approximately 45 million records(expected & correct). Subsequently, we loaded data for the partition `20241130` , containing about 44 million records(expected & right).