We have a ‘lakehouse’ set up in which the source data is brought into a staging schema, then history is tracked and stored in an ods schema on top of that using SCD type 2.
The tables I am now modelling for the warehouse side have many months of history already stored in this ods layer as SCD type 2. The changes are tracked across all columns in this ods layer.
However, for our dimensional data warehouse, only some of the columns in this table need to be tracked for changes to build an timestamped accumulating snapshot fact table.
dbt snapshotting capability seems to be solely for tracking changes which occur to SCD type 1 table (overwrite).
If I follow this rule, I would need to create the snapshot from our staging schema. The baseline snapshot would then be the most recent version of each row, and I would lose all of the accumulated history already stored in the ods.
Alternatively, I try to create an SCD type 2 table over the existing SCD type 2 table (ods schema), tracking against fewer columns using the dbt ‘check strategy’. There are then two issues to tackle:
- We need an initial load which re-constructs the SCD type 2 (but tracking changes over fewer columns) for all the existing history.
- dbt needs to be able to track changes which are loaded as new rows rather than overwrites.
Issue number (2) in particular I would expect dbt to have a solution for. It is not uncommon to have a source table which is incrementally loaded rather than type 1 updates. We have plenty of data sources which are event based xml, so each time a change is made the whole row is extracted and sent to us as xml. If dbt is unable to create SCD on this type of incrementally loaded source data, this is a serious limitation.
Has anyone found a solution to this?