Data idempotence

Hey there, I was implementing a new model yesterday that runs as a daily snapshot. This particular model would really benefit from being able to backfill it, but because we use Stitch and get incremental updates, there’s no way to access prior states of the data.

I’ve tinkered with some of the ideas written about here, but I didn’t spend a ton of time experimenting, with other work to do, and all.

I’m wondering how other people deal with data idempotence (or don’t bother dealing with it at all).

1 Like

Hey @jesse! This is a great topic :smiley:

dbt has a feature called archive that is meant for exactly this purpose! The current version of archival builds type-2 slowly-changing-dimension tables (if I remember the definitions correctly), which generally achieves the same result as snapshotting every day while avoiding the overhead of maintaining duplicate historical data. In the future, archive will support more “strategies”–that’s coming fairly soon actually.

Have you taken a look at archive yet? If not, highly recommended! We definitely do not recommend using dbt’s modeling functionality to build models that are non-idempotent.

Hey, thanks for the quick reply. For whatever reason I’d never looked closely at archive. So do you typically archive “raw” tables?

Yes 100%. Archive is meant to run on top of raw, or “source” data, and then it creates additional “source” data. This process is not idempotent, because as you say, Stitch-ingested data is (in most cases) mutable. So you need to treat your archival data very differently than you’d treat your modeled data. Generally we put it in a different schema at a minimum.

1 Like