Snowflake cluster keys in dbt -- learnings and thoughts

nadelman · September 17, 2022, 4:18pm

My data engineering team has recently been working on converting the generation/maintenance of some very large data sets that our data science team has been managing. We use snowflake, and for many of our large data sets we define custom cluster keys on the tables. As we have been working on these conversions, the query times for producing some of these large models (dozens of billions of rows) were far longer than we had expected. Examination of the query plans showed that the longest step was a sort on the full data set, which is not something that we had in our dbt model definition. After some digging, we discovered that this is a product of how dbt decided to handle snowflake clustering.

References:

github.com/dbt-labs/dbt-core

Support "cluster by" on Snowflake

opened 03:28PM - 05 Jan 18 UTC

closed 07:13PM - 12 Sep 19 UTC

drewbanin

enhancement snowflake good_first_issue

Docs: https://docs.snowflake.net/manuals/sql-reference/sql/create-table.html …The `cluster by` keyword also requires a column definition list in the `create table as` statement. Example: ```sql create table "dbt_dbanin".test_cluster (id int, name string) cluster by(id) as ( select 1 as id, 2 as name ); ``` This can be implemented with a custom table materialization override for Snowflake.

Basically, dbt takes advantage of the fact that if you insert sorted data to an empty table (or as part of a ctas), and then apply clustering on the same sort keys, then the data is already clustered by default. So if you have added the “cluster by” config to your model, when the model gets compiled, dbt adds a sort on the cluster key(s) to the end of the sql. This decision appears to have been made partially in response to the fact that snowflake’s normal clustering mechanism involves a background process that reorganizes the underlying data for clustered data sets until they are close to optimal, and that process does have expense ($$) associated with it.

Another way to think about it is that dbt enforces immediate consistency on clustered tables, while normally you would have eventual consistency. This makes sense in the context of a dbt process on its own: presumably you have clustered your data to make it more efficient to use it downstream, and in a dbt process presumably your downstream use is right away, so you want immediate consistency.

However, in our use case, we do actually want eventual consistency…we want to leverage snowflake’s background process. The trade off we are making is that we want the dbt process to run faster, with the knowledge that at some point in time later on people will be leveraging the outputs, and they want it clustered.

Anyway, we thought those were some really interesting “down in the weeds” details on how dbt snowflake clustering actually works, and wanted to share with the community.

Also, our solution for our particular case is now to not the “cluster by” config to apply the clustering to our models, but instead to apply them via a post-hook. That way dbt does not compile in the sort on the data set, but we still get that “eventual consistency” behavior.

Topic		Replies	Views
Using cluster_by with incremental tables (Snowflake) Archive	0	10935	July 26, 2022
Avoid Sort on Dbt Core Model - Snowflake Help	0	538	March 25, 2024
Incremental model use DB cluster column Help incremental , snowflake	5	4549	September 8, 2022
Release: dbt v0.14.1 Archive	0	2450	September 4, 2019
Dbt unique key use multiple clustered columns Archive	0	2526	March 14, 2022

Snowflake cluster keys in dbt -- learnings and thoughts

Related topics