I’m currently utilizing PySpark to execute custom code on my dbt models. My tables are hosted on BigQuery and I’m operating a Spark cluster through Google Cloud’s Dataproc service.
During the execution of the model, despite initial analyzes of my Spark logs suggesting that the Spark processing functions correctly, an intermittent failure happens when the system attempts to write onto the table:
/usr/lib/spark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
/usr/lib/spark/python/lib/pyspark.zip/pyspark/pandas/utils.py:1016: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `to_spark`, the existing index is lost when converting to Spark DataFrame.
24/05/10 14:56:26 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-central1-335962519198-wojzyb24/fa977bf1-a1f3-47b6-aceb-7f265f113b11/spark-job-history/application_1715347348557_0011.inprogress [CONTEXT ratelimit_period="1 MINUTES [skipped: 66]" ]
24/05/10 14:56:26 WARN TaskSetManager: Stage 20 contains a task of very large size (3799 KiB). The maximum recommended task size is 1000 KiB.
24/05/10 14:56:29 WARN TaskSetManager: Lost task 4.0 in stage 20.0 (TID 314) (feature-forge-cluster-test-m.c.codapro-csl-stg.internal executor 1): com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Could not create write-stream after multiple retries
at com.google.cloud.bigquery.connector.common.BigQueryDirectDataWriterHelper.<init>(BigQueryDirectDataWriterHelper.java:97)
at com.google.cloud.spark.bigquery.write.context.BigQueryDirectDataWriterContext.<init>(BigQueryDirectDataWriterContext.java:74)
at com.google.cloud.spark.bigquery.write.context.BigQueryDirectDataWriterContextFactory.createDataWriterContext(BigQueryDirectDataWriterContextFactory.java:72)
at com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler.call(DataSourceWriterContextPartitionHandler.java:57)
at com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler.call(DataSourceWriterContextPartitionHandler.java:34)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found.
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:592)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:551)
at com.google.cloud.bigquery.connector.common.BigQueryDirectDataWriterHelper.retryCallable(BigQueryDirectDataWriterHelper.java:150)
at com.google.cloud.bigquery.connector.common.BigQueryDirectDataWriterHelper.retryCreateWriteStream(BigQueryDirectDataWriterHelper.java:117)
at com.google.cloud.bigquery.connector.common.BigQueryDirectDataWriterHelper.<init>(BigQueryDirectDataWriterHelper.java:95)
... 22 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found.
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:90)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
Any pointers would be appreciated
Thanks in advance!