DBT Pyspark error: requested entity not found

gsahid · May 10, 2024, 3:09pm

I’m currently utilizing PySpark to execute custom code on my dbt models. My tables are hosted on BigQuery and I’m operating a Spark cluster through Google Cloud’s Dataproc service.

During the execution of the model, despite initial analyzes of my Spark logs suggesting that the Spark processing functions correctly, an intermittent failure happens when the system attempts to write onto the table:

/usr/lib/spark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/usr/lib/spark/python/lib/pyspark.zip/pyspark/pandas/utils.py:1016: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `to_spark`, the existing index is lost when converting to Spark DataFrame.
24/05/10 14:56:26 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-central1-335962519198-wojzyb24/fa977bf1-a1f3-47b6-aceb-7f265f113b11/spark-job-history/application_1715347348557_0011.inprogress [CONTEXT ratelimit_period="1 MINUTES [skipped: 66]" ]
24/05/10 14:56:26 WARN TaskSetManager: Stage 20 contains a task of very large size (3799 KiB). The maximum recommended task size is 1000 KiB.
24/05/10 14:56:29 WARN TaskSetManager: Lost task 4.0 in stage 20.0 (TID 314) (feature-forge-cluster-test-m.c.codapro-csl-stg.internal executor 1): com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Could not create write-stream after multiple retries
	at com.google.cloud.bigquery.connector.common.BigQueryDirectDataWriterHelper.<init>(BigQueryDirectDataWriterHelper.java:97)
	at com.google.cloud.spark.bigquery.write.context.BigQueryDirectDataWriterContext.<init>(BigQueryDirectDataWriterContext.java:74)
	at com.google.cloud.spark.bigquery.write.context.BigQueryDirectDataWriterContextFactory.createDataWriterContext(BigQueryDirectDataWriterContextFactory.java:72)
	at com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler.call(DataSourceWriterContextPartitionHandler.java:57)
	at com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler.call(DataSourceWriterContextPartitionHandler.java:34)
	at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
	at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found.
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:592)
	at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:551)
	at com.google.cloud.bigquery.connector.common.BigQueryDirectDataWriterHelper.retryCallable(BigQueryDirectDataWriterHelper.java:150)
	at com.google.cloud.bigquery.connector.common.BigQueryDirectDataWriterHelper.retryCreateWriteStream(BigQueryDirectDataWriterHelper.java:117)
	at com.google.cloud.bigquery.connector.common.BigQueryDirectDataWriterHelper.<init>(BigQueryDirectDataWriterHelper.java:95)
	... 22 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found.
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:90)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)

Any pointers would be appreciated

Thanks in advance!

Topic		Replies	Views
DBT pyspark gcp Pyspark process datafram but can't write to table on bigquery at the end Help bigquery , orchestration-and-deployment , dbt-cloud , devblog , dbt-core	2	1370	May 10, 2024
DBT Fal or Python with Spark (EMR) Help	1	1712	August 19, 2023
Schema mismatch on dbt python model run Help bigquery , python-models , dbt-cloud	1	803	November 29, 2024
dbt Python Model error: 501 received http2 header with status: 404 Help bigquery , python-models	2	2168	November 25, 2022
jaffle shop on dbt cloud with spark: raw_orders keeps on running Help dbt-cloud	1	66	April 30, 2025

DBT Pyspark error: requested entity not found

Related topics