Handling hard-deletes from source tables in Snapshots

Hi all,

We’ve got a number of tables we’re snapshotting daily, but their source tables are hard deleted so these changes are never marked as expired in the snapshots. It’s not possible for us to change to soft deletes on the application side, so I’m trying to work out what the best approach is to work it out from dbt.

An initial assumption:
One assumption I’ve made is that an explicit is_deleted column which starts a new dbt_valid_from range:

id | price| quantity | is_deleted | dbt_valid_from | dbt_valid_to
1  | 100  | 500      | false      | 2020-01-01     | 2020-01-03
1  | 100  | 500      | true       | 2020-01-03     | null 

is better than just closing off dbt_valid_to:

id | price| quantity | dbt_valid_from  | dbt_valid_to
1  | 100  | 500      |  2020-01-01     | 2020-01-03

because I can’t think of a dbt-y way to update those records after the snapshot is taken (maybe a post-run-hook, but it feels fragile to not do everything inside of the one transaction.)

Finding the records
To get the missing records, I can select from the source table and then do an outer join on the most recent snapshot to find any additional rows:

select 
  coalesce(src.id, snap.id) as id,
  coalesce(src.price, snap.price) as price,
  coalesce(src.quantity, snap.quantity) as quantity,
  coalesce(src.modifieddate, getdate()) as modifieddate,
  src.id is null as is_deleted
from {{source('sales', 'revenuedetails')}} as src
full outer join {{ref('sales__revenuedetails')}} as snap on src.id = snap.id 
  and snap.dbt_valid_to is null

Right now we’re using the timestamp strategy since we have a modification date column, however as far as I can tell I’d have to use getdate() for the deleted rows, meaning that every snapshot we take would create another row (for deleted records at least) and I’m effectively getting close to creating a SCD-4 history table.

The alternative is to move away from timestamps to the check strategy (excluding the modifieddate column) but this table easily has a dozen columns that need checked and that feels equally gross.

Three questions:

  • Am I correct that it’s better to add an is_deleted column over just updating the final dbt_valid_to column?
  • If so, are my only choices creating redundant rows for deleted records due to using getdate() as the modified date, or only having one row per “true” change but have to provide a very long check_cols argument?
  • If so, do you have a preference?

Thank you!

Hi,

This is a great question and here’s the way we have gone about handling hard deletes:

  1. Make a table with the view of your data (ex. schema.current_table), but add a pre-hook that creates a table based on what is currently in prod (ex. CREATE OR REPLACE TABLE schema.previous_table AS SELECT * FROM schema.current_table). The table created via the pre-hook will help you identify deletes.
  2. Create a table that has the deleted records by using the MINUS operator (disclosure: I use Snowflake). The new table will be called something like schema.table_deleted_records.
SELECT
   column_a
  , column_b
FROM schema.table_previous

MINUS
SELECT
  column_a
  , column_b
FROM schema.table_current
  1. Then in another table, union the data and add a column is_deleted (FALSE for current data, TRUE for deleted_records).
    SELECT
      column_a
     , column_b
    , TRUE AS is_deleted
    FROM schema.previous_table
    INNER JOIN schema.table_deleted_records
       ON << unique key >>

    UNION

    SELECT
     column_a
     , column_b
     , FALSE AS is_deleted
    FROM schema.current_table
  1. Then, you will run your snapshots on the table created in Step 3 with the check strategy (we know that it is best to use a timestamp, so we do that elsewhere for tables that don’t have a hard delete).

This is just what has worked for us and is by no means the only way to go about it. For example, you could join based on the historical snapshot table as you mentioned. I think the is_deleted column is better than overwriting the dbt_valid_to column.

2 Likes

Thanks @ddaltonrtr, that’s really useful! We’re on Redshift which doesn’t look to support OR REPLACE so I think I’ll have to stick with joining back to the historical snapshot table, but it’s still great to know that I should stick with the check strategy and add an is_deleted column.

I quite like the idea of doing two SELECTs and then UNIONing them with a true/false for is_deleted, that might be a bit more elegant than having to COALESCE everything.

Hi - In Redshift, you could truncate the table first and then create the table if you wanted to follow this pattern. Otherwise, you could continue to join back onto the historical snapshot table.

The one part I don’t understand with truncating/replacing the table is that if you create table_deleted_records based solely on the current contents of table, and then subtract one from the other, won’t your result be an empty set? How do you keep track of the records which have been deleted if you truncate table_deleted_records and rebuild from scratch every time?

Hi, I’m just now coming across the hard delete complications. I’m trying to follow some of what you each have listed, did either of you try joining the snapshot to the source and create a circular reference?

Encountered an error:
Found a cycle: snapshot.project.snapshot_table --> model.project.added_is_deleted_table --> snapshot.project.snapshot_table

The error makes sense :slight_smile: It is!
Did anyone hit this and find a way past while adding the is_deleted column to a table that is then snapshotted?
I’m assuming I’d need to remove a {{ref(...)}} somewhere, but feels hacky and against the point of dbt.

P.S. In the original post, you mentioned: " just closing off dbt_valid_to :" - Do you know a way to do that? I’d find that much neater for my scenarios if there’s a way? I thought Issue #249 implied there was no way at the moment though.
Also - thanks for the ideas for me to try on this problem!

Hi,
I typically create the table_deleted_records based on what is actively in your materialized table in the database right at that moment (before you recreate the table with your latest data). Then, I create the new table and minus that from the table_deleted_records. This will only be an empty set if there are not any deleted records. I keep track of the deleted records in the snapshot table. Once they are deleted and recorded in the snapshot table, I don’t need to store that information anywhere else.

I’m not sure I understand the circular reference part of your question, but I can comment on closing off dbt_valid_to. For that, we just use a macro (which we call the_distant_future and define in the dbt_project.yml. Then, we just coalesce the dbt_valid_to and the variable!

Another approach is to write your own custom snapshot strategy that handles hard deletes. Here’s a Loom video of some work I did on this front! (https://www.loom.com/share/001412e57d8b40d4952590c2f6f0459e)

Here’s a link to how I added this just to my (test) project. And one for an open PR on dbt to change this core functionality.

Hi Joel,
For my purposes (I;m still in prototype), I’ve probably done similar. I’ve abandoned the {{ref(snapshot_table)}} and just used FROM snapshot_table. Would like to find a dbt-friendly way for a Prod deploy though.
My only other observation after doing this was that I realised the chances of getting a raw source table named is_deleted were high so I renamed that on 2nd pass. Renaming it after I’d already run it the 1st time was not a great idea :slight_smile:
Agree on macro!

Hi Claire,
That look awesome! Much neater than what I had. I gave it a go for my scenario.

  • Note - I care about strategy=check & check_cols="all" only - that’s the scenario & data I have.
  • I am testing an edit in page (between snapshots)
  • When I run the next snapshot, I get a syntax error for a trailing “)”
  • After reading logs/dbt.log, it looks like staging_table_insertions is running (but this is very new territory to me)

I’ve made it work by going to line 31 in your macros/snapshot_check_cols_with_hard_delete.sql , and changing this:

{{ current_rel }}.{{ primary_key }}) is null or

To this (remove that closing-round-bracket after primary_key):

{{ current_rel }}.{{ primary_key }} is null or

Is there another reason that closing bracket is in there? I’m only testing my own scenario, I have no idea what else is out there!

That closing bracket was giving me a little bit of grief — I knew it shouldn’t be there, but when I was using check_cols=['col_a', 'col_b'] I couldn’t get it to work without it. Rather than go too deep, I figured it was easier to commit slightly broken code in the hopes that someone else would fix it :joy:

Long-overdue update: Here’s the final macro we’re using to identify hard deletes and create a is_deleted column.

First and foremost, it’s worth saying that if you can use @claire’s setup, you definitely should. For us, some of the hard deletes that we’re snapshotting are associations between two entities which could be deleted then re-inserted, which makes the is_deleted column important instead of just closing out the dbt_valid_to field.

We’ve landed on a two-step approach:

  1. Take a standard dbt snapshot of all live data
  2. Take another snapshot, this time of the live data plus any records which exist in the first snapshot but not in the live data (i.e. those which have been deleted).

Here’s the macro we use (I’ve deleted my earlier post to remove my first attempt at the macro, it doesn’t work)

{% macro snapshot_hard_delete(current_data_ref, scratch_snapshot_ref, unique_key) %}
    with live as (
      select * 
      from {{current_data_ref}}
    ), 

    snap as (
      select *
      from {{scratch_snapshot_ref}}
    ),

    deleted_records as (
      select 
        {{dbt_utils.star(scratch_snapshot_ref, except=['dbt_scd_id', 'dbt_updated_at', 'dbt_valid_from', 'dbt_valid_to'])}} 
        from snap
      where {{unique_key}} not in (
        select {{unique_key}} from live
      )
      and dbt_valid_to is null
    ),

    final as (
      select *, false as is_deleted
      from live

      union
      
      select *, true as is_deleted
      from deleted_records
    )

    select * 
    from final
{% endmacro %}

Usage:

{% snapshot snap_sales__revenuedetails %}

  {% set key_pattern = 'id' %}
    {{
        config(
          target_database='reporting',
          target_schema='snapshots',
          unique_key=key_pattern,
          
          strategy='check',
          check_cols = 'all',
        )
    }}

    {{
      snapshot_hard_delete(
        current_data_ref = source('sales', 'revenuedetails'), 
        scratch_snapshot_ref = ref('snap_sales__revenuedetails_scratch'),
        unique_key = key_pattern
      )
    }}

{% endsnapshot %}

Worth noting that key_pattern can be anything that evaluates to valid SQL, so could be a combination of several columns.

1 Like

TLDR: I also have the use case where records get deleted and then brought back to life. I am pro writing to dbt_valid_to on deletion and creating a new record on reinstatement to make sure I have a complete history. I’m super interested to hear if people have a similar problem and if there’s another way of looking at it :slight_smile:

We have a similar case, where we have a source system that represents relationships between entities in a table with a composite key of the unique identifiers of the entities. These relationships can be hard-deleted and then reinstated.

Since dbt snapshots are so convenient for tracking history over time and understanding what has happened. I wanted to make sure how we tracked these deletes and reinstantiation fit my mental model that I was using for snapshots.

What I am trying to achieve with snapshots:

  1. For any point in time (at the granularity of the frequency of the snapshot runs), we can understand what was the status of the world at that point in time. (IE, who were members of which organization)
  2. For any relationship, what is the history of those two entities (IE. was this user a member of this organization through august, then it was deleted, but they’ve just been invited back)

dbt_valid_to and dbt_valid_from are integral pieces for me to be able to understand what existed at any point in time and recreate the history. So I wanted the case of a hard delete, to lead to invalidating the previous existing record, ie set dbt_valid_to. When the record is reinstated, I wanted it to create a new row with the current time’s timestamp. That way I can understand the period of time when the record had been deleted as the delta of time between the previous dbt_valid_to and the current record’s dbt_valid_from.

Heres the logic I used to achieve this:

{% macro snapshot_check_delete_strategy(node, snapshotted_rel, current_rel, config, target_exists) %}
    {% set check_cols_config = config['check_cols'] %}
    {% set primary_key = config['unique_key'] %}
    {% set select_current_time -%}
        select {{ snapshot_get_time() }} as snapshot_start
    {%- endset %}
    {# don't access the column by name, to avoid dealing with casing issues on snowflake #}
    {%- set now = run_query(select_current_time)[0][0] -%}
    {% if now is none or now is undefined -%}
        {%- do exceptions.raise_compiler_error('Could not get a snapshot start time from the database') -%}
    {%- endif %}
    {% set updated_at = snapshot_string_as_time(now) %}
    {% set column_added = false %}
    {% if check_cols_config == 'all' %}
        {% set column_added, check_cols = snapshot_check_all_get_existing_columns(node, target_exists) %}
    {% elif check_cols_config is iterable and (check_cols_config | length) > 0 %}
        {% set check_cols = check_cols_config %}
    {% else %}
        {% do exceptions.raise_compiler_error("Invalid value for 'check_cols': " ~ check_cols_config) %}
    {% endif %}
    {%- set row_changed_expr -%}
    (
    {%- if column_added -%}
        TRUE
    {%- else -%}
    -- this indicates that the source record was hard-deleted
    ({{ current_rel }}.{{ primary_key }} is null) or
    {% for col in check_cols -%}
        {{ snapshotted_rel }}.{{ col }} != {{ current_rel }}.{{ col }}
        or
        ({{ snapshotted_rel }}.{{ col }} is null) != ({{ current_rel }}.{{ col }} is null)
        {%- if not loop.last %} or {% endif -%}
    {%- endfor -%}
    {%- endif -%}
    )
    {%- endset %}
    {% set scd_id_expr = snapshot_hash_arguments([primary_key, updated_at]) %}
    {% do return({
        "unique_key": primary_key,
        "updated_at": updated_at,
        "row_changed": row_changed_expr,
        "scd_id": scd_id_expr
    }) %}
{% endmacro %}
{% macro snapshot_staging_table(strategy, source_sql, target_relation) -%}
    with snapshot_query as (
        {{ source_sql }}
    ),
    snapshotted_data as (
        select *,
            {{ strategy.unique_key }} as dbt_unique_key
        from {{ target_relation }}
    ),
    insertions_source_data as (
        select
            *,
            {{ strategy.unique_key }} as dbt_unique_key,
            {{ strategy.updated_at }} as dbt_updated_at,
            {{ strategy.updated_at }} as dbt_valid_from,
            nullif({{ strategy.updated_at }}, {{ strategy.updated_at }}) as dbt_valid_to,
            {{ strategy.scd_id }} as dbt_scd_id
        from snapshot_query
    ),
    updates_source_data as (
        select
            *,
            {{ strategy.unique_key }} as dbt_unique_key,
            {{ strategy.updated_at }} as dbt_updated_at,
            {{ strategy.updated_at }} as dbt_valid_from,
            {{ strategy.updated_at }} as dbt_valid_to

        from snapshot_query
    ),
    deleted_records as (
      select
        {{ strategy.unique_key }} as dbt_unique_key
        from snapshotted_data
        group by 1 having  count(dbt_valid_to) = count(*)
    ),

    insertions as (

        select
            'insert' as dbt_change_type,
            source_data.*
        from insertions_source_data as source_data
        left outer join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where snapshotted_data.dbt_unique_key is null
           or (
                snapshotted_data.dbt_unique_key is not null
            and snapshotted_data.dbt_valid_to is null
            and (
                {{ strategy.row_changed }}
            )
        )
    ),
    updates as (
        select
            'update' as dbt_change_type,
            source_data.* except(dbt_valid_to),
            coalesce(source_data.dbt_valid_to,{{ strategy.updated_at }}) as dbt_valid_to,
            snapshotted_data.dbt_scd_id
        from snapshotted_data
        left join updates_source_data as source_data
        on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where snapshotted_data.dbt_valid_to is null
        and (
            {{ strategy.row_changed }}
        )
    ),

    reinsertions as (
      select
            'insert' as dbt_change_type,
            source_data.*
        from insertions_source_data as source_data
        join deleted_records on deleted_records.dbt_unique_key = source_data.dbt_unique_key
    )

    select * from insertions
    union all
    select * from updates
    union all
    select * from reinsertions


{%- endmacro %}

Curious if anyone else has different ways of thinking about this problem :smiley: