Hi,
This is a great question and here’s the way we have gone about handling hard deletes:
- Make a table with the view of your data (ex. schema.current_table), but add a pre-hook that creates a table based on what is currently in prod (ex.
CREATE OR REPLACE TABLE schema.previous_table AS SELECT * FROM schema.current_table
). The table created via the pre-hook will help you identify deletes. - Create a table that has the deleted records by using the MINUS operator (disclosure: I use Snowflake). The new table will be called something like schema.table_deleted_records.
SELECT
column_a
, column_b
FROM schema.table_previous
MINUS
SELECT
column_a
, column_b
FROM schema.table_current
- Then in another table, union the data and add a column is_deleted (FALSE for current data, TRUE for deleted_records).
SELECT
column_a
, column_b
, TRUE AS is_deleted
FROM schema.previous_table
INNER JOIN schema.table_deleted_records
ON << unique key >>
UNION
SELECT
column_a
, column_b
, FALSE AS is_deleted
FROM schema.current_table
- Then, you will run your snapshots on the table created in Step 3 with the check strategy (we know that it is best to use a timestamp, so we do that elsewhere for tables that don’t have a hard delete).
This is just what has worked for us and is by no means the only way to go about it. For example, you could join based on the historical snapshot table as you mentioned. I think the is_deleted
column is better than overwriting the dbt_valid_to column.