The problem I’m having
Me and my team are running our dbt project on an all purpose cluster. When running some of our larger tables we get a performance hint in the Spark UI telling us to increase the spark.sql.shuffle.partitions
setting to increase performance. I found this stack overflow answer on how to set this setting in dbt. Instead of putting this setting in the model config I added it to dbt_project.yml
as you can see below. When I ran one of the big models again I still get the performance warning so it seems that the setting did not have the effect I was hoping for.
Is there anybody here that knows how to set this?
Im not sure but I think this setting is on the cluster level in Databricks. So it does not really make sense to set this for every model.
Context
We are running Azure Databricks and we tried this on an all purpose cluster with 96 cores and 384 Gib memory.
What I’ve already tried
Some example code or error messages
models:
+file_format: delta
+pre-hook:
sql: "SET spark.sql.shuffle.partitions=auto"
transaction: false