I’m having a problem when populating a Databricks table from a Python script that is structured something like this:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import (
col, when, lit, split, sum as _sum, min as _min, max as _max, coalesce, date_format, udf, current_timestamp, to_date, explode
)
import pyspark.sql.functions as F
from functools import reduce
import datetime
from pyspark.sql.types import DoubleType
def do_the_work (dbt: SparkSession)-> DataFrame:
# Lots of stuff in here that manipulates my data to populate a dataframe
# This takes about 30 minutes
return mydata_df;
# ------------------------------------------------------------------------
# Main
# ------------------------------------------------------------------------
def model(dbt, session: SparkSession) -> DataFrame:
dbt.config(
materialized="table",
file_format="delta",
unique_key=['my_id'],
submission_method="job_cluster",
job_cluster_config={
"autoscale": {
"min_workers": 2,
"max_workers": 4
},
"spark_version": "15.4.x-scala2.12",
"node_type_id": "i3.2xlarge"
}
)
processed_df = do_the_work (dbt)
processed_df = processed_df.withColumn("update_timestamp",
current_timestamp())
return processed_df
The extraction of data and dataframe construction takes time, about 30 minutes, and could no doubt be improved - that’s not what I’m asking for help with (yet).
At the end of the process it adds a timestamp column that is populated with a current_timestamp() value - this means that the date/time it records is the date/time it is written to the database, not the time the row was added to the dataframe.
This code is activated from a regular dbt run, alongside a load of regular sql-based tables. It all works fine, everything gets populated as expected, but there’s a big time delay coming from somewhere. For example, when I look at this morning’s run, I can see that every row in the table has an update timestamp of 2026-02-18T01:55:50.098+00:00. When I look at the log file, I see that the processing of that table started at 01:24:32, but it finished at 02:37:43. So that part of the job was doing something for more than 40 minutes after the data was written to the table.
How can I find out what’s happening in that dead time, and how can I stop it happening? This job is already taking a long time to run, but it then takes even longer to do nothing!
I found this, supposedly fixed, bug that may be related: [Bug] dbt is waiting for a long time even after the Python model has already completed (dataproc batch succeeded) · Issue #1102 · dbt-labs/dbt-adapters · GitHub