Snapshotting source and dimension tables?

ShaunLF · November 18, 2022, 12:21pm

Hi all,

I have read that snapshotting source data is best practise, but we are planning on snapshotting our dimension tables once we have modelled the data.

Should we do both? Is the source snapshot just used as a fail-safe to roll back to any point in time if anything goes seriously wrong? Or is there any additional advantages of doing this?

Thanks!

viciwuoha · November 18, 2022, 2:01pm

@ShaunLF, I actually think one is fine, except if you are performing serious upstream changes on your source tables to arrive at your dimension…
If Yes. Then it’s logically efficient to stick to your dimensions when they are SCD’s.
Yes it can serve for fail over, but I assume your source tables might have landed from a process that can be regenerated.
Its a two way thing depending on what works for your use case.
I won’t want to create snapshots and pay for extra storage on sources if I can regenerate them from a data lake layer on cases of failures. That’s why dimension snapshoting will be better to track changes over time.

ShaunLF · November 19, 2022, 9:23am

Thanks Victor - appreciate your response!

We are currently ingesting from a SQL Database that overwrites changes so we wouldn’t be able to roll back to a given point in time.

joellabes · November 25, 2022, 2:44am

The use case for snapshotting your source data is described here: Add snapshots to your DAG | dbt Developer Hub

Also check out the best practices detailed further down on the same page: Add snapshots to your DAG | dbt Developer Hub

The reason to avoid snapshotting your final dimension tables is that if you find a bug and need to change your modelling code, you can find yourself with inaccurate results and no way of recalculating the correct data. By contrast, if you snapshot the source data and build on top of that, you can always recalculate everything from first principles if/when you need to.

I certainly snapshot some final dimensional tables when the additional effort to make the table fully independent of time (or I can’t be bothered ), but it’s pretty much always in addition to snapshotting the source table instead of a replacement.

ShaunLF · November 25, 2022, 1:32pm

Thanks for your input Joel - much appreciated! Pretty much what I was thinking and we have now reworked our project to snapshot source data

system · December 2, 2022, 1:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SCD type 2 in using dbt Archive	2	4792	March 23, 2021
DIM table creation(SCD2), source selection Help best-practice , snowflake	0	978	May 20, 2023
Dimension Tables and SCD2 best practices Help dbt-core	1	980	August 20, 2024
Initialise snapshot with data from a dimension table Help snowflake	2	1000	September 1, 2023
Strategies for change data capture in dbt In-Depth Discussions devblog	2	3180	October 5, 2023

Snapshotting source and dimension tables?

Related topics