We have an incremental model that will update a very large Snowflake table (several billions of rows). The table is clustered by date and will be updated by the incremental model each day.
We’re weighing up the pros and cons of the “merge” strategy vs. the “append” strategy.
One potential con of the “merge” strategy is that it has to perform a table scan of the destination table to look for matching rows, which could be expensive. However, given that the destination table is clustered by date, I’m wondering if the merge will take advantage of this clustering and only scan the necessary micro-partitions. Does anyone know if this is the case?
Is your data event data or dimensional data? Is it feasible for a new row that updates or overwrites an old entity to arrive significantly later? What is the unique key of the data?
The reason to merge, is because there is a good reason to scan all the data and update old rows rather than simply writing new ones into the table. Do you have such a good reason? If not, you can just append.
You can also limit how far back your merge statement will scan the DESTINATION table for rows to update with the <Incremental models | dbt Developer Hub config>
Thank you, that’s super helpful. I just learned about incremental predicates from the link you sent, which could be perfect for improving the efficiency of the merge (if we decide to use it).
We can probably get away with using an append strategy, but we’re just slightly nervous of duplicates sneaking into the destination table, which we can’t afford to happen.
Well if you need it to be dead-on balls accurate and there’s no heuristic you can use to simplify the search for duplicates (eg, there is no possibility for a duplicate more than 1 month after a row is created) then you simply cannot avoid doing a whole table scan. That’s like, physics, you’re always going to need to read all the data so you can compare it. It definitely is expensive, but the expense comes from your requirement that you can’t have any duplicates, not from MERGE
In Snowflake, when performing a merge operation on a table that is clustered, Snowflake will indeed leverage the clustering key to optimize the process. The clustering key helps Snowflake efficiently locate and access the relevant data, minimizing the need for a full table scan.
When you perform a merge operation on a clustered table, Snowflake will typically only need to scan the micro-partitions relevant to the data being merged. This can significantly reduce the amount of data that needs to be processed compared to a full table scan, especially if your clustering key aligns well with the merge operation (such as updating rows based on date).
So, leveraging the “merge” strategy with a clustered table in Snowflake is generally efficient and can scale well even for tables with several billions of rows. It combines the benefits of both the merge operation for updating existing rows and the efficiency of the clustering key for minimizing unnecessary data scans.
Agreed. And if you included your cluster field (date) in the list of unique keys, it will force a join on the date field in the merge, so nothing is left to chance.