BigQuery + dbt: Incremental Changes

Thanks for the detailed question! As I understand it, the reason you wouldn’t want to leverage the insert_overwrite strategy is because

sometimes there’s going to be records from several days ago getting inserted as well as yesterdays data

If you used the dynamic insert_overwrite strategy, you’d risk replacing a full partition of preexisting data with a only handful of new records.

I have two thoughts:

  1. It sounds like you may want something in between simple merge and dynamic insert_overwrite: a strategy that dynamically calculates which partitions to be merging into, and then performs the merge with a unique_key, rather than replacing the entire partition. The code for that is definitely possible, though slower than insert_overwrite. If this is a common problem that folks run into, we may want to add support for it as a new incremental strategy in a future release.

  2. How many late-arriving records do you have? Depending on the data source, we often find that the proportion of late-arriving facts (e.g. web/mobile events) is low enough that we can ignore any records which arrive more than 3 days late during incremental runs, and fold them in during weekly or monthly full refreshes. If you’ve got 10+ TB of source data, that $50+ full-refresh build feels expensive, but paying for it once a week or once a month is far cheaper than trying to manage more complex code to handle 0.001% of your edge cases. Plus, if you’re always running static insert_overwrite jobs for the prior 3 days, you save enough in both time and money to more than make up for the occasional full rebuild.

2 Likes