Long delay after writing table from dataframe

chuntuk · February 18, 2026, 12:35pm

I’m having a problem when populating a Databricks table from a Python script that is structured something like this:

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import (
    col, when, lit, split, sum as _sum, min as _min, max as _max, coalesce, date_format, udf, current_timestamp, to_date, explode
)
import pyspark.sql.functions as F
from functools import reduce
import datetime
from pyspark.sql.types import DoubleType

def do_the_work (dbt: SparkSession)-> DataFrame:

    # Lots of stuff in here that manipulates my data to populate a dataframe
    # This takes about 30 minutes

    return mydata_df;

# ------------------------------------------------------------------------
# Main
# ------------------------------------------------------------------------

def model(dbt, session: SparkSession) -> DataFrame:

    dbt.config(
        materialized="table",
        file_format="delta",
        unique_key=['my_id'],
        submission_method="job_cluster",
        job_cluster_config={
            "autoscale": {
                "min_workers": 2,
                "max_workers": 4
            },
            "spark_version": "15.4.x-scala2.12",
            "node_type_id": "i3.2xlarge"
        }
    )

    processed_df = do_the_work (dbt)
    processed_df = processed_df.withColumn("update_timestamp",
                                           current_timestamp())
    return processed_df

The extraction of data and dataframe construction takes time, about 30 minutes, and could no doubt be improved - that’s not what I’m asking for help with (yet).

At the end of the process it adds a timestamp column that is populated with a current_timestamp() value - this means that the date/time it records is the date/time it is written to the database, not the time the row was added to the dataframe.

This code is activated from a regular dbt run, alongside a load of regular sql-based tables. It all works fine, everything gets populated as expected, but there’s a big time delay coming from somewhere. For example, when I look at this morning’s run, I can see that every row in the table has an update timestamp of 2026-02-18T01:55:50.098+00:00. When I look at the log file, I see that the processing of that table started at 01:24:32, but it finished at 02:37:43. So that part of the job was doing something for more than 40 minutes after the data was written to the table.

How can I find out what’s happening in that dead time, and how can I stop it happening? This job is already taking a long time to run, but it then takes even longer to do nothing!

I found this, supposedly fixed, bug that may be related: [Bug] dbt is waiting for a long time even after the Python model has already completed (dataproc batch succeeded) · Issue #1102 · dbt-labs/dbt-adapters · GitHub

Topic		Replies	Views
DBT pyspark gcp Pyspark process datafram but can't write to table on bigquery at the end Help bigquery , orchestration-and-deployment , dbt-cloud , devblog , dbt-core	2	1403	May 10, 2024
DBT Pyspark error: requested entity not found Help	0	507	May 10, 2024
`dbt run` performance regression in databricks 1.7 Help	1	724	January 17, 2024
Extremely slow dbt run (15 min+) with Databricks Unity Catalog Schema with over 3500 tables Help databricks	1	2188	December 4, 2023
Python model running in Snowflake and IDE but not as dbtCloud job Help python-models , dbt-cloud	2	1546	August 24, 2023

Long delay after writing table from dataframe

Related topics