Extremely slow dbt run (15 min+) with Databricks Unity Catalog Schema with over 3500 tables.
Setup
python3.11
dbt-core=1.7.2
dbt-spark=1.7.1
dbt-databricks=1.7.1
dbt-sql-connector=2.9.3
Multi-node cluster running Databricks Runtime 12.2 LTS ML (Apache Spark 3.3.2, Scala 2.12)
Issue
When we dbt run
our models, even before the actual queries themselves run it seems to parse the entirety of the schema tables, and although the model itself takes only 2 seconds to run , the parsing portion takes over 15 mins . Our schema has over 3500+ tables , and for governance purposes we do not have the ability to create our own dedicated schema as recommended by the dbt documentation.
We have tried some cache related CLI options such as --cache-selected-only
, --select
, and --no-populate-cache
but it seems to have no impact.
Requested Outcome
What configurations can we make besides creating our own separate schema, in order to shorten this 15 min schema parsing?
Reference
For context I have been following this GitHub issue for some time, waiting for a solution:
opened 03:29PM - 21 Sep 21 UTC
enhancement
good_first_issue
jira
### Describe the feature
I am currently facing an issue using DBT with Spark on… AWS/Glue/EMR environment as discussed already in https://github.com/dbt-labs/dbt-spark/issues/215 (but already raised here https://github.com/dbt-labs/dbt-spark/issues/93).
The current issue is about the adapter's method `list_relations_without_caching`:
https://github.com/dbt-labs/dbt/blob/HEAD/core/dbt/include/global_project/macros/adapters/common.sql#L240
which in the Spark Adapter implementation is:
https://github.com/dbt-labs/dbt-spark/blob/a8a85c54d10920af1c5efcbb4d2a51eb7cfcad11/dbt/include/spark/macros/adapters.sql#L133-L139
In this case you can see that the command `show table extended in {{ relation }} like '*'` is executed. It will force Spark to go through all the tables info in the schema (as Spark has not Information Schema Layer) in a sort of "discovery mode" and this approach produces two main issues:
1) Bad performance: some environments can have hundreds or even thousands of tables generated not only by DBT but also by other processes in the same schema. In that case this operation can be very costly, especially when you have different DBT processes that run some updates at different times on a few tables.
2) Instability, as I verified in AWS/Glue/EMR environment, where you can have views without "S3 Location" defined, like an Athena/Presto view, that will make crash a DBT process running SparkSql on EMR with errors like:
```
show table extended in my_schema like '*'
' with 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:00:22 INFO SparkExecuteStatementOperation: Running query with 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:01:03 INFO DAGScheduler: Asked to cancel job group 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:01:03 ERROR SparkExecuteStatementOperation: Error executing query with 62bb6394-b13a-4b79-92dd-2e0918831cf3, currentState RUNNING,
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string
```
### Describe alternatives you've considered
I do not see the reason why DBT process should care of the "rest of the world" like the Athena views from before or tables created from other processes that are in the same schema.
So I can think ideally to replace the method:
`show table extended in <schema> like '*'`
with something like:
`show table extended in <schema> like ('<table1>|<table2>|…')`
where my `<table1>`, `<table2>`, etc. are determined automatically when I run a command like
`dbt run --models my_folder`
where `my_folder` contains the files: `table1.sql`, `table2.sql`, etc
but from the current method interface, only the schema params can be passed.
Two questions here:
How can I infer automatically the name of the tables involved when a command like `dbt run --models my_folder` run and how can I pass them eventually to the `list_relations_without_caching`?
### Additional context
I found it relevant for Spark on AWS environment but can be potentially a similar issue for other implementations.
### Who will this benefit?
On DBT's slack channel I talked to another used "affected" by similar issue, but probably whoever is going to use Spark in distributed environment can be affected by this (AWS and non).
### Are you interested in contributing this feature?
Sure, both coding and testing.
amy
December 4, 2023, 4:14pm
2
Hello @takahiro.watanabe1
When you go through the dbt logs - are you able to see if it’s running a lot of describe extended
queries? Also if you don’t mind opening an issue on the dbt-databricks adapter repo - this would be the best way for the Databricks team to triage issues.