SET spark.sql.shuffle.partitions=auto

martin_larsson · October 6, 2023, 11:12am

The problem I’m having

Me and my team are running our dbt project on an all purpose cluster. When running some of our larger tables we get a performance hint in the Spark UI telling us to increase the spark.sql.shuffle.partitions setting to increase performance. I found this stack overflow answer on how to set this setting in dbt. Instead of putting this setting in the model config I added it to dbt_project.yml as you can see below. When I ran one of the big models again I still get the performance warning so it seems that the setting did not have the effect I was hoping for.

Is there anybody here that knows how to set this?

Im not sure but I think this setting is on the cluster level in Databricks. So it does not really make sense to set this for every model.

Context

We are running Azure Databricks and we tried this on an all purpose cluster with 96 cores and 384 Gib memory.

What I’ve already tried

Some example code or error messages

models:
  +file_format: delta
  +pre-hook:
    sql: "SET spark.sql.shuffle.partitions=auto"
    transaction: false

charlbotha · November 2, 2024, 8:30am

Databricks / Spark saves the configured spark.sql.shuffle.partitions into the stream checkpoint.

I had to do a full refresh on the aggregation tables in question to get my new setting to take effect.

See e.g. Solved: Changing shuffle.partitions with spark.conf in a s... - Databricks Community - 27389

Topic		Replies	Views
How to dynamically generate input to `partitions` config in BigQuery Archive	1	5602	September 18, 2019
BigQuery date partitioning different from UI? Archive	3	2894	April 17, 2020
BigQuery + dbt: Incremental Changes Archive	8	40486	June 19, 2020
Release: dbt v0.14.1 Archive	0	2471	September 4, 2019
Postgres partitions Archive	0	3487	October 11, 2021

SET spark.sql.shuffle.partitions=auto

The problem I’m having

Context

What I’ve already tried

Some example code or error messages

Related topics