Extremely slow dbt run (15 min+) with Databricks Unity Catalog Schema with over 3500 tables

takahiro.watanabe1 · November 25, 2023, 6:39pm

Extremely slow dbt run (15 min+) with Databricks Unity Catalog Schema with over 3500 tables.

Setup

python3.11
dbt-core=1.7.2
dbt-spark=1.7.1
dbt-databricks=1.7.1
dbt-sql-connector=2.9.3
Multi-node cluster running Databricks Runtime 12.2 LTS ML (Apache Spark 3.3.2, Scala 2.12)

Issue

When we dbt run our models, even before the actual queries themselves run it seems to parse the entirety of the schema tables, and although the model itself takes only 2 seconds to run, the parsing portion takes over 15 mins. Our schema has over 3500+ tables, and for governance purposes we do not have the ability to create our own dedicated schema as recommended by the dbt documentation.

We have tried some cache related CLI options such as --cache-selected-only, --select, and --no-populate-cache but it seems to have no impact.

Requested Outcome

What configurations can we make besides creating our own separate schema, in order to shorten this 15 min schema parsing?

Reference

For context I have been following this GitHub issue for some time, waiting for a solution:

github.com/dbt-labs/dbt-spark

[CT-202] Workaround for some limitations due to `list_relations_without_caching` method

opened 03:29PM - 21 Sep 21 UTC

danfran

enhancement good_first_issue jira

### Describe the feature I am currently facing an issue using DBT with Spark on… AWS/Glue/EMR environment as discussed already in https://github.com/dbt-labs/dbt-spark/issues/215 (but already raised here https://github.com/dbt-labs/dbt-spark/issues/93). The current issue is about the adapter's method `list_relations_without_caching`: https://github.com/dbt-labs/dbt/blob/HEAD/core/dbt/include/global_project/macros/adapters/common.sql#L240 which in the Spark Adapter implementation is: https://github.com/dbt-labs/dbt-spark/blob/a8a85c54d10920af1c5efcbb4d2a51eb7cfcad11/dbt/include/spark/macros/adapters.sql#L133-L139 In this case you can see that the command `show table extended in {{ relation }} like '*'` is executed. It will force Spark to go through all the tables info in the schema (as Spark has not Information Schema Layer) in a sort of "discovery mode" and this approach produces two main issues: 1) Bad performance: some environments can have hundreds or even thousands of tables generated not only by DBT but also by other processes in the same schema. In that case this operation can be very costly, especially when you have different DBT processes that run some updates at different times on a few tables. 2) Instability, as I verified in AWS/Glue/EMR environment, where you can have views without "S3 Location" defined, like an Athena/Presto view, that will make crash a DBT process running SparkSql on EMR with errors like: ``` show table extended in my_schema like '*' ' with 62bb6394-b13a-4b79-92dd-2e0918831cf3 21/09/18 13:00:22 INFO SparkExecuteStatementOperation: Running query with 62bb6394-b13a-4b79-92dd-2e0918831cf3 21/09/18 13:01:03 INFO DAGScheduler: Asked to cancel job group 62bb6394-b13a-4b79-92dd-2e0918831cf3 21/09/18 13:01:03 ERROR SparkExecuteStatementOperation: Error executing query with 62bb6394-b13a-4b79-92dd-2e0918831cf3, currentState RUNNING, org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string ``` ### Describe alternatives you've considered I do not see the reason why DBT process should care of the "rest of the world" like the Athena views from before or tables created from other processes that are in the same schema. So I can think ideally to replace the method: `show table extended in <schema> like '*'` with something like: `show table extended in <schema> like ('<table1>|<table2>|…')` where my `<table1>`, `<table2>`, etc. are determined automatically when I run a command like `dbt run --models my_folder` where `my_folder` contains the files: `table1.sql`, `table2.sql`, etc but from the current method interface, only the schema params can be passed. Two questions here: How can I infer automatically the name of the tables involved when a command like `dbt run --models my_folder` run and how can I pass them eventually to the `list_relations_without_caching`? ### Additional context I found it relevant for Spark on AWS environment but can be potentially a similar issue for other implementations. ### Who will this benefit? On DBT's slack channel I talked to another used "affected" by similar issue, but probably whoever is going to use Spark in distributed environment can be affected by this (AWS and non). ### Are you interested in contributing this feature? Sure, both coding and testing.

amy · December 4, 2023, 4:14pm

Hello @takahiro.watanabe1
When you go through the dbt logs - are you able to see if it’s running a lot of describe extended queries? Also if you don’t mind opening an issue on the dbt-databricks adapter repo - this would be the best way for the Databricks team to triage issues.

Topic		Replies	Views
`dbt run` performance regression in databricks 1.7 Help	1	678	January 17, 2024
FAQ: Cleaning up removed models from your production schema Show and Tell	22	54694	July 9, 2025
Schema mismatch on dbt python model run Help bigquery , python-models , dbt-cloud	1	784	November 29, 2024
Disabling information schema queries Help dbt-core	1	113	December 12, 2024
dbt cloud Databricks Unity Catalog Help dbt-cloud , databricks	2	1790	November 21, 2022