The behavior of cluster_by with incremental tables is a bit counterintuitive and isn’t described thoroughly in the docs. I ran into this issue myself and saw that several other people were discussing it on Slack. I thought I’d summarize the behavior of the cluster_by config when used with incremental tables. Please chime in if I’ve made a mistake or if you have more to add!
TL;DR - If you add a cluster_by config to an existing incremental table, you have to either do a full refresh or manually run the alter table… cluster by… statement.
Background
The snowflake adapter has an additional config option, “cluster_by”, which will tell Snowflake to cluster the table on the specified columns. The docs for that config are here. For a regular “table” materialization, setting this config does two things:
- Sorts the table by the clustering keys by adding an “order by” clause at the end of the SQL statement.
- After creating the table, it runs an additional SQL statement, altering the table to cluster by the keys.
Using this with incremental models
When cluster_by is set for an incremental table, it:
- Always sorts by the clustering keys (for both incremental runs and full refreshes)
- Alters the table to set clustering keys ONLY when you do a full refresh or the table is created for the first time.
Therefore, if you add a cluster_by config to an existing incremental model, you must either:
- Do a full-refresh, or
- Manually run the SQL statement that sets clustering keys for the table. Here are the Snowflake docs for that statement type.
Code details
- The cluster by behavior lives in the snowflake__create_table_as macro.
- If there is a value for cluster_by, it includes the order by statement here
- If there is a value for cluster_by and the target table is not temporary, it runs the alter table…cluster by… statement here
- When you do a dbt run with an incremental materialization, and…
- You are doing a full refresh, or the target table doesn’t exist, or the target table is a view, it will run the create_table_as macro where the target table is your actual model’s final table. Therefore it will order by the cluster keys AND run the alter table…cluster by… command. See the code here for a full-refresh, here for a target that doesn’t exist, and here for a view.
- Otherwise, if it’s an incremental run on an existing table, it will run the create_table_as macro on a tmp table. It will always sort the incoming data on the cluster keys. But, because it’s targeting a temporary table, it will not do any alter table… cluster by… statement on an incremental run.