Long-overdue update: Here’s the final macro we’re using to identify hard deletes and create a is_deleted column.
First and foremost, it’s worth saying that if you can use @claire’s setup, you definitely should. For us, some of the hard deletes that we’re snapshotting are associations between two entities which could be deleted then re-inserted, which makes the is_deleted column important instead of just closing out the dbt_valid_to field.
We’ve landed on a two-step approach:
- Take a standard dbt snapshot of all live data
- Take another snapshot, this time of the live data plus any records which exist in the first snapshot but not in the live data (i.e. those which have been deleted).
Here’s the macro we use (I’ve deleted my earlier post to remove my first attempt at the macro, it doesn’t work)
{% macro snapshot_hard_delete(current_data_ref, scratch_snapshot_ref, unique_key) %}
with live as (
select *
from {{current_data_ref}}
),
snap as (
select *
from {{scratch_snapshot_ref}}
),
deleted_records as (
select
{{dbt_utils.star(scratch_snapshot_ref, except=['dbt_scd_id', 'dbt_updated_at', 'dbt_valid_from', 'dbt_valid_to'])}}
from snap
where {{unique_key}} not in (
select {{unique_key}} from live
)
and dbt_valid_to is null
),
final as (
select *, false as is_deleted
from live
union
select *, true as is_deleted
from deleted_records
)
select *
from final
{% endmacro %}
Usage:
{% snapshot snap_sales__revenuedetails %}
{% set key_pattern = 'id' %}
{{
config(
target_database='reporting',
target_schema='snapshots',
unique_key=key_pattern,
strategy='check',
check_cols = 'all',
)
}}
{{
snapshot_hard_delete(
current_data_ref = source('sales', 'revenuedetails'),
scratch_snapshot_ref = ref('snap_sales__revenuedetails_scratch'),
unique_key = key_pattern
)
}}
{% endsnapshot %}
Worth noting that key_pattern
can be anything that evaluates to valid SQL, so could be a combination of several columns.