Handling hard-deletes from source tables in Snapshots

Hi all,

We’ve got a number of tables we’re snapshotting daily, but their source tables are hard deleted so these changes are never marked as expired in the snapshots. It’s not possible for us to change to soft deletes on the application side, so I’m trying to work out what the best approach is to work it out from dbt.

An initial assumption:
One assumption I’ve made is that an explicit is_deleted column which starts a new dbt_valid_from range:

id | price| quantity | is_deleted | dbt_valid_from | dbt_valid_to
1  | 100  | 500      | false      | 2020-01-01     | 2020-01-03
1  | 100  | 500      | true       | 2020-01-03     | null 

is better than just closing off dbt_valid_to:

id | price| quantity | dbt_valid_from  | dbt_valid_to
1  | 100  | 500      |  2020-01-01     | 2020-01-03

because I can’t think of a dbt-y way to update those records after the snapshot is taken (maybe a post-run-hook, but it feels fragile to not do everything inside of the one transaction.)

Finding the records
To get the missing records, I can select from the source table and then do an outer join on the most recent snapshot to find any additional rows:

  coalesce(src.id, snap.id) as id,
  coalesce(src.price, snap.price) as price,
  coalesce(src.quantity, snap.quantity) as quantity,
  coalesce(src.modifieddate, getdate()) as modifieddate,
  src.id is null as is_deleted
from {{source('sales', 'revenuedetails')}} as src
full outer join {{ref('sales__revenuedetails')}} as snap on src.id = snap.id 
  and snap.dbt_valid_to is null

Right now we’re using the timestamp strategy since we have a modification date column, however as far as I can tell I’d have to use getdate() for the deleted rows, meaning that every snapshot we take would create another row (for deleted records at least) and I’m effectively getting close to creating a SCD-4 history table.

The alternative is to move away from timestamps to the check strategy (excluding the modifieddate column) but this table easily has a dozen columns that need checked and that feels equally gross.

Three questions:

  • Am I correct that it’s better to add an is_deleted column over just updating the final dbt_valid_to column?
  • If so, are my only choices creating redundant rows for deleted records due to using getdate() as the modified date, or only having one row per “true” change but have to provide a very long check_cols argument?
  • If so, do you have a preference?

Thank you!


This is a great question and here’s the way we have gone about handling hard deletes:

  1. Make a table with the view of your data (ex. schema.current_table), but add a pre-hook that creates a table based on what is currently in prod (ex. CREATE OR REPLACE TABLE schema.previous_table AS SELECT * FROM schema.current_table). The table created via the pre-hook will help you identify deletes.
  2. Create a table that has the deleted records by using the MINUS operator (disclosure: I use Snowflake). The new table will be called something like schema.table_deleted_records.
  , column_b
FROM schema.table_previous

  , column_b
FROM schema.table_current
  1. Then in another table, union the data and add a column is_deleted (FALSE for current data, TRUE for deleted_records).
     , column_b
    , TRUE AS is_deleted
    FROM schema.previous_table
    INNER JOIN schema.table_deleted_records
       ON << unique key >>


     , column_b
     , FALSE AS is_deleted
    FROM schema.current_table
  1. Then, you will run your snapshots on the table created in Step 3 with the check strategy (we know that it is best to use a timestamp, so we do that elsewhere for tables that don’t have a hard delete).

This is just what has worked for us and is by no means the only way to go about it. For example, you could join based on the historical snapshot table as you mentioned. I think the is_deleted column is better than overwriting the dbt_valid_to column.


Thanks @ddaltonrtr, that’s really useful! We’re on Redshift which doesn’t look to support OR REPLACE so I think I’ll have to stick with joining back to the historical snapshot table, but it’s still great to know that I should stick with the check strategy and add an is_deleted column.

I quite like the idea of doing two SELECTs and then UNIONing them with a true/false for is_deleted, that might be a bit more elegant than having to COALESCE everything.

Hi - In Redshift, you could truncate the table first and then create the table if you wanted to follow this pattern. Otherwise, you could continue to join back onto the historical snapshot table.

The one part I don’t understand with truncating/replacing the table is that if you create table_deleted_records based solely on the current contents of table, and then subtract one from the other, won’t your result be an empty set? How do you keep track of the records which have been deleted if you truncate table_deleted_records and rebuild from scratch every time?

Hi, I’m just now coming across the hard delete complications. I’m trying to follow some of what you each have listed, did either of you try joining the snapshot to the source and create a circular reference?

Encountered an error:
Found a cycle: snapshot.project.snapshot_table --> model.project.added_is_deleted_table --> snapshot.project.snapshot_table

The error makes sense :slight_smile: It is!
Did anyone hit this and find a way past while adding the is_deleted column to a table that is then snapshotted?
I’m assuming I’d need to remove a {{ref(...)}} somewhere, but feels hacky and against the point of dbt.

P.S. In the original post, you mentioned: " just closing off dbt_valid_to :" - Do you know a way to do that? I’d find that much neater for my scenarios if there’s a way? I thought Issue #249 implied there was no way at the moment though.
Also - thanks for the ideas for me to try on this problem!

I typically create the table_deleted_records based on what is actively in your materialized table in the database right at that moment (before you recreate the table with your latest data). Then, I create the new table and minus that from the table_deleted_records. This will only be an empty set if there are not any deleted records. I keep track of the deleted records in the snapshot table. Once they are deleted and recorded in the snapshot table, I don’t need to store that information anywhere else.

I’m not sure I understand the circular reference part of your question, but I can comment on closing off dbt_valid_to. For that, we just use a macro (which we call the_distant_future and define in the dbt_project.yml. Then, we just coalesce the dbt_valid_to and the variable!

@Jeb yep, I tried setting this up properly yesterday and discovered that flaw in my cunning plan. I’m currently working towards setting up a “snapshot sandwich” so that dbt can be in charge of dealing with new columns etc, instead of doing it as a pre-hook.

So far this is the approach I’ve taken (made into a macro so I can reuse it):

{% macro snapshot_hard_delete(current_data_ref, scratch_snapshot_ref) %}
    with live as (
      select * 
      from {{current_data_ref}}

    snap as (
      select {{dbt_utils.star(scratch_snapshot_ref, except=['dbt_scd_id', 'dbt_updated_at', 'dbt_valid_from', 'dbt_valid_to'])}} 
      from {{scratch_snapshot_ref}}

    deleted_records as (
      select * from snap
      select * from live

    final as (
      select *, false as is_deleted
      from live

      union all
      select *, true as is_deleted
      from deleted_records

    select * 
    from final
{% endmacro %}

scratch_snapshot_ref being the first snapshot I take, which is in its own schema and will never be used for any modelling. If I get it all sorted and reliable, I’ll post the full summary back here, but hopefully that helps in the meantime.

Another approach is to write your own custom snapshot strategy that handles hard deletes. Here’s a Loom video of some work I did on this front!

Here’s a link to how I added this just to my (test) project. And one for an open PR on dbt to change this core functionality.

Hi Joel,
For my purposes (I;m still in prototype), I’ve probably done similar. I’ve abandoned the {{ref(snapshot_table)}} and just used FROM snapshot_table. Would like to find a dbt-friendly way for a Prod deploy though.
My only other observation after doing this was that I realised the chances of getting a raw source table named is_deleted were high so I renamed that on 2nd pass. Renaming it after I’d already run it the 1st time was not a great idea :slight_smile:
Agree on macro!

Hi Claire,
That look awesome! Much neater than what I had. I gave it a go for my scenario.

  • Note - I care about strategy=check & check_cols="all" only - that’s the scenario & data I have.
  • I am testing an edit in page (between snapshots)
  • When I run the next snapshot, I get a syntax error for a trailing “)”
  • After reading logs/dbt.log, it looks like staging_table_insertions is running (but this is very new territory to me)

I’ve made it work by going to line 31 in your macros/snapshot_check_cols_with_hard_delete.sql , and changing this:

{{ current_rel }}.{{ primary_key }}) is null or

To this (remove that closing-round-bracket after primary_key):

{{ current_rel }}.{{ primary_key }} is null or

Is there another reason that closing bracket is in there? I’m only testing my own scenario, I have no idea what else is out there!

That closing bracket was giving me a little bit of grief — I knew it shouldn’t be there, but when I was using check_cols=['col_a', 'col_b'] I couldn’t get it to work without it. Rather than go too deep, I figured it was easier to commit slightly broken code in the hopes that someone else would fix it :joy: