@tristan – Great post, thank you.
One additional thing that I would like to see further discussed here. You mentioned that incrementality is good except in cases that involve windowing. However, a classic case involving windowing is the need to deduplicate incoming data. I’ll usually write a window function to do this, partitioning over the common fields in the incoming data that would have the same values.
When calling APIs or getting any sort of event stream data, normally an ETL architect has two choices.
- At-least once: If there is any problem with an API call, batch, workflow, or anything along the way, the given data or set of data will be reprocessed and in some cases may be re-sent to the target (which is likely your raw/staging area in the database which dbt will then consult as Sources.) Meaning that incoming data may be duplicated on import but this is an acceptable price to pay for ensuring that incoming data is for sure (for some practical and limited definition of sure) going to arrive even if there is a problem along the way. Usually used in distributed systems where having fully reliable and incoming data is more of a priority than performance and overhead of extra validation and deduplication of data at the target.
- At-most once: Meaning that incoming data may be dropped along the way if there is a problem somewhere. Hopefully in practice this happens rarely but still possible. Usually used in systems where directional accuracy is more important than 100% reliability and the team wants to avoid the extra complexity, delay, and performance overhead involved in issue detection and deduplication.
More info on at-least-once versus at-most once here: https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
In practice, I have almost always used “at-least-once” semantics. Getting users to agree to “at-most-once” is hard because their very logical question in response will be “Tell me how much data I am going to lose and when?” to which there is no great response. You can’t know that until you actually get all of the data, validate it, and thoroughly verify against the source where the data is coming from, and by the time you have done that you have probably already built an at-least-once system.
So, if we are often building systems that are at-least-once, and therefore requiring incoming data to be deduplicated, and the easiest way to do this is windowing, doesn’t that mean that in most cases this whole discussion about performant incrementality is moot because we have to do windowing?
If others are mostly building at-most-once systems where windowing and/or deduplication is not required then I would be interested to hear about it. My experience is generally “at-least-once” and requiring windowing.