My data engineering team has recently been working on converting the generation/maintenance of some very large data sets that our data science team has been managing. We use snowflake, and for many of our large data sets we define custom cluster keys on the tables. As we have been working on these conversions, the query times for producing some of these large models (dozens of billions of rows) were far longer than we had expected. Examination of the query plans showed that the longest step was a sort on the full data set, which is not something that we had in our dbt model definition. After some digging, we discovered that this is a product of how dbt decided to handle snowflake clustering.
Basically, dbt takes advantage of the fact that if you insert sorted data to an empty table (or as part of a ctas), and then apply clustering on the same sort keys, then the data is already clustered by default. So if you have added the “cluster by” config to your model, when the model gets compiled, dbt adds a sort on the cluster key(s) to the end of the sql. This decision appears to have been made partially in response to the fact that snowflake’s normal clustering mechanism involves a background process that reorganizes the underlying data for clustered data sets until they are close to optimal, and that process does have expense ($$) associated with it.
Another way to think about it is that dbt enforces immediate consistency on clustered tables, while normally you would have eventual consistency. This makes sense in the context of a dbt process on its own: presumably you have clustered your data to make it more efficient to use it downstream, and in a dbt process presumably your downstream use is right away, so you want immediate consistency.
However, in our use case, we do actually want eventual consistency…we want to leverage snowflake’s background process. The trade off we are making is that we want the dbt process to run faster, with the knowledge that at some point in time later on people will be leveraging the outputs, and they want it clustered.
Anyway, we thought those were some really interesting “down in the weeds” details on how dbt snowflake clustering actually works, and wanted to share with the community.
Also, our solution for our particular case is now to not the “cluster by” config to apply the clustering to our models, but instead to apply them via a post-hook. That way dbt does not compile in the sort on the data set, but we still get that “eventual consistency” behavior.