Handling hard-deletes from source tables in Snapshots

Long-overdue update: Here’s the final macro we’re using to identify hard deletes and create a is_deleted column.

First and foremost, it’s worth saying that if you can use @claire’s setup, you definitely should. For us, some of the hard deletes that we’re snapshotting are associations between two entities which could be deleted then re-inserted, which makes the is_deleted column important instead of just closing out the dbt_valid_to field.

We’ve landed on a two-step approach:

  1. Take a standard dbt snapshot of all live data
  2. Take another snapshot, this time of the live data plus any records which exist in the first snapshot but not in the live data (i.e. those which have been deleted).

Here’s the macro we use (I’ve deleted my earlier post to remove my first attempt at the macro, it doesn’t work)

{% macro snapshot_hard_delete(current_data_ref, scratch_snapshot_ref, unique_key) %}
    with live as (
      select * 
      from {{current_data_ref}}
    ), 

    snap as (
      select *
      from {{scratch_snapshot_ref}}
    ),

    deleted_records as (
      select 
        {{dbt_utils.star(scratch_snapshot_ref, except=['dbt_scd_id', 'dbt_updated_at', 'dbt_valid_from', 'dbt_valid_to'])}} 
        from snap
      where {{unique_key}} not in (
        select {{unique_key}} from live
      )
      and dbt_valid_to is null
    ),

    final as (
      select *, false as is_deleted
      from live

      union
      
      select *, true as is_deleted
      from deleted_records
    )

    select * 
    from final
{% endmacro %}

Usage:

{% snapshot snap_sales__revenuedetails %}

  {% set key_pattern = 'id' %}
    {{
        config(
          target_database='reporting',
          target_schema='snapshots',
          unique_key=key_pattern,
          
          strategy='check',
          check_cols = 'all',
        )
    }}

    {{
      snapshot_hard_delete(
        current_data_ref = source('sales', 'revenuedetails'), 
        scratch_snapshot_ref = ref('snap_sales__revenuedetails_scratch'),
        unique_key = key_pattern
      )
    }}

{% endsnapshot %}

Worth noting that key_pattern can be anything that evaluates to valid SQL, so could be a combination of several columns.

1 Like