But I’m curious if this is worth it in the long-run?
The example in that forum post syncs things from within the hard coded last 24 hours but that seems not smart since I would actually want a non-idempotent system to sync all rows starting from the last synced row. So that no rows would ever be missed. So I feel like the example in that post is a bit unfair.
My main question though is about the long-term aspects of using dbt which is that as my dataset grows larger and larger over time, I’ll end up having millions of rows, and so I would expect my dbt runs to get slower over time right?
Instead, if dbt only processed all new rows then it wouldn’t get slower over time.
How do you think about this trade-off? Is it okay because we expect the processing time to only increase sub-linearly with respect to how fast the number of rows increases, since sql transformations are efficient?
The hard thing is that data transformation logic changes pretty regularly. At least, in all the projects that I’ve ever worked on. And that combined with a team working on the code, and all the other things that can be forgotten and go wrong - just simpler to say “every time we process the data we redo it all every time. That way we can be sure that there are no lingering issues with state or somebody forgetting to redo/purge something that changed.”
Now, can you get large sets of data that make it difficult to reprocess every thing every time? Yes, absolutely. On Snowflake, BigQuery, or similar systems, I would say millions of rows are small, but if you get into the billions and need to reprocess your data every hour or even every day it can be a little expensive. In those cases I’ve tended to use incremental models, which do then introduce some notion of state if schema or transformation logic change. Then what I’ve usually done is do a once a week --full-refresh which rebuilds the incremental models from scratch so even if we forget to redo something during the regular scheduled runs, the --full-refresh catches it.
In short - I absolutely think that idempotency is still worth it in the long run, even balanced against increased processing costs. Every data warehousing project that I worked on in the past that wasn’t using an idempotent approach gradually built up a huge amount of technical debt in dealing with all of these potential failure modes, and the team had to spend a lot of time working around them, troubleshooting them, or in some cases, having incorrect or silent errors/bugs in the processing of the data because of missing them. Yes there are some compromises or workarounds required at large data volumes in dbt (like incremental models or maybe materialized views in some cases which have to be manually maintained separately.)
This is possible if you take advantage of the {{ this }} variable, as described here. In short, as long as you have a reliable high-water mark in your table, you can do something like where src.id > (select max(id) from {{ this }})
instead of a hard-coded window.
On the limits of incrementality is another good post as a chaser to your original link. And as @josh says, occasionally doing a --full-refresh is a way to clean up if you have very late arriving facts that you care about.
The hard thing is that data transformation logic changes pretty regularly.
What if in some cases you want values to only have been processed once and then keep them fixed?
Would you suggest using dbt snapshots to keep track of how rows in a model have changed over time?
Yes, snapshots are a great way to keep track of data where it’s very important that it never change again (e.g. numbers that are given in a board report etc).
In that case, you’re trading off flexibility for reliability. If you know that’s the right decision for your context, you’re OK!