I’m new to dbt and diving deep. I believe there is a bug in how snapshot insertions are processed and handled.
For starters, it is not uncommon for tables to have an etl_create_date and an etl_update_date. In testing my snapshot, I created a raw table with mocked up records, giving them an etl_create_date of current_timestamp() and a null etl_update_date. When updates are made to the table, the etl_update_date gets updated with the current_timestamp(). This is a valid, normal scenario.
I run the snapshot with etl_update_date as my update_at configuration, and the snapshot table does not populate the dbt_updated_at or dbt_valid_from date. What this tells me is dbt won’t accept null values in the updated_at config value. I don’t agree with that.
If you examine the logs, the insertions_source cte has the dbt_updated_at and dbt_valid_from value as my null etl_update_date. So at this point, all the dbt dates in my snapshot are null. At the very least, why doesn’t dbt just add a coalesce around this and populated current_timestamp() if my value is null?
insertions_source_data as (
select *,
claim_num as dbt_unique_key
,
etl_update_date as dbt_updated_at,
etl_update_date as dbt_valid_from,
coalesce(nullif(etl_update_date, etl_update_date), null)
as dbt_valid_to
,
md5(coalesce(cast(claim_num as varchar ), '')
|| '|' || coalesce(cast(etl_update_date as varchar ), '')
) as dbt_scd_id
from snapshot_query
)
If I then mock up one of my raw record with a change, to see if the snapshot table adds the subsequent record, nothing will change in my snapshot because the dbt dates are all null. The insertions and updates cte in the logs reference this below. Again, why don’t they use a coalesce around the dbt_valid_from value and use something like ‘1900-01-01’? It does not logically make sense that the updated_at config value cannot be null.
snapshotted_data.dbt_valid_from < source_data.etl_update_date