Thanks for the detailed question! As I understand it, the reason you wouldn’t want to leverage the insert_overwrite
strategy is because
sometimes there’s going to be records from several days ago getting inserted as well as yesterdays data
If you used the dynamic insert_overwrite
strategy, you’d risk replacing a full partition of preexisting data with a only handful of new records.
I have two thoughts:
-
It sounds like you may want something in between simple
merge
and dynamicinsert_overwrite
: a strategy that dynamically calculates which partitions to be merging into, and then performs the merge with aunique_key
, rather than replacing the entire partition. The code for that is definitely possible, though slower thaninsert_overwrite
. If this is a common problem that folks run into, we may want to add support for it as a new incremental strategy in a future release. -
How many late-arriving records do you have? Depending on the data source, we often find that the proportion of late-arriving facts (e.g. web/mobile events) is low enough that we can ignore any records which arrive more than 3 days late during incremental runs, and fold them in during weekly or monthly full refreshes. If you’ve got 10+ TB of source data, that $50+ full-refresh build feels expensive, but paying for it once a week or once a month is far cheaper than trying to manage more complex code to handle 0.001% of your edge cases. Plus, if you’re always running static
insert_overwrite
jobs for the prior 3 days, you save enough in both time and money to more than make up for the occasional full rebuild.