SET spark.sql.shuffle.partitions=auto

The problem I’m having

Me and my team are running our dbt project on an all purpose cluster. When running some of our larger tables we get a performance hint in the Spark UI telling us to increase the spark.sql.shuffle.partitions setting to increase performance. I found this stack overflow answer on how to set this setting in dbt. Instead of putting this setting in the model config I added it to dbt_project.yml as you can see below. When I ran one of the big models again I still get the performance warning so it seems that the setting did not have the effect I was hoping for.

Is there anybody here that knows how to set this?

Im not sure but I think this setting is on the cluster level in Databricks. So it does not really make sense to set this for every model.

Context

We are running Azure Databricks and we tried this on an all purpose cluster with 96 cores and 384 Gib memory.

What I’ve already tried

Some example code or error messages

models:
  +file_format: delta
  +pre-hook:
    sql: "SET spark.sql.shuffle.partitions=auto"
    transaction: false

Databricks / Spark saves the configured spark.sql.shuffle.partitions into the stream checkpoint.

I had to do a full refresh on the aggregation tables in question to get my new setting to take effect.

See e.g. Solved: Changing shuffle.partitions with spark.conf in a s... - Databricks Community - 27389