HI
I am using pyspark as a means to run custom code on dbt models. My tables are on bigquery and the spark cluster on gcp (dataproc).
When running the model it fails. Looking at the spark logs, the spark processing works fine, until it tries to write the table and fails. Sometimes 1/10 it succeeds (can find a pattern for this).
We are using :
dbt-bigquery 1.6.4
dbt-core 1.6.2
dbt-extractor 0.4.1
spark log
23/11/09 15:30:15 ERROR org.apache.spark.scheduler.TaskSetManager: task 0.0 in stage 3.0 (TID 3) had a not serializable result: com.google.cloud.spark.bigquery.repackaged.io.grpc.Status
Serialization stack:
- object not serializable (class: com.google.cloud.spark.bigquery.repackaged.io.grpc.Status, value: Status{code=NOT_FOUND, description=Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc4Yy0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ, cause=null})
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException, com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc4Yy0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException, com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc4Yy0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class java.util.concurrent.ExecutionException, java.util.concurrent.ExecutionException: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc4Yy0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.bigquery.connector.common.BigQueryConnectorException, com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Could not retrieve AppendRowsResponse)
- field (class: com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1, name: val$e, type: class java.lang.Exception)
- object (class com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1, com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1@527b3afb)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1); not retrying
23/11/09 15:30:15 ERROR org.apache.spark.scheduler.TaskSetManager: task 1.0 in stage 3.0 (TID 4) had a not serializable result: com.google.cloud.spark.bigquery.repackaged.io.grpc.Status
Serialization stack:
- object not serializable (class: com.google.cloud.spark.bigquery.repackaged.io.grpc.Status, value: Status{code=NOT_FOUND, description=Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc3Zi0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ, cause=null})
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException, com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc3Zi0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException, com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc3Zi0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class java.util.concurrent.ExecutionException, java.util.concurrent.ExecutionException: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc3Zi0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.bigquery.connector.common.BigQueryConnectorException, com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Could not retrieve AppendRowsResponse)
- field (class: com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1, name: val$e, type: class java.lang.Exception)
- object (class com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1, com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1@4b29b286)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1); not retrying
23/11/09 15:30:15 WARN com.google.cloud.spark.bigquery.write.context.BigQueryDirectDataSourceWriterContext: BigQuery Data Source writer 6cd8b42c-b849-41dc-a7cc-482ec82e7afe aborted
Traceback (most recent call last):
File "/tmp/spark-af92d649-568e-4334-8535-abca0670a49b/dummy_py_model.py", line 196, in <module>
df.write \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.save.
: com.google.cloud.bigquery.connector.common.BigQueryConnectorException: unexpected issue trying to save [pat_id: string, task_uuid: string ... 2 more fields]
at com.google.cloud.spark.bigquery.write.BigQueryDataSourceWriterInsertableRelation.insert(BigQueryDataSourceWriterInsertableRelation.java:128)
at com.google.cloud.spark.bigquery.write.CreatableRelationProviderHelper.createRelation(CreatableRelationProviderHelper.java:54)
at com.google.cloud.spark.bigquery.v2.Spark31BigQueryTableProvider.createRelation(Spark31BigQueryTableProvider.java:65)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:362)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 in stage 3.0 (TID 3) had a not serializable result: com.google.cloud.spark.bigquery.repackaged.io.grpc.Status
Serialization stack:
- object not serializable (class: com.google.cloud.spark.bigquery.repackaged.io.grpc.Status, value: Status{code=NOT_FOUND, description=Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc4Yy0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ, cause=null})
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException, com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc4Yy0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException, com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc4Yy0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class java.util.concurrent.ExecutionException, java.util.concurrent.ExecutionException: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.NotFoundException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found. Entity: projects/X/datasets/ai_dw_mart_normal/tables/dummy_py_model/streams/CiQ3ZDg4ODc4Yy0wMDAwLTI4YTctYjg4Yi0yNDA1ODg3NzZiOTQ)
- writeObject data (class: java.lang.Throwable)
- object (class com.google.cloud.bigquery.connector.common.BigQueryConnectorException, com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Could not retrieve AppendRowsResponse)
- field (class: com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1, name: val$e, type: class java.lang.Exception)
- object (class com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1, com.google.cloud.spark.bigquery.write.DataSourceWriterContextPartitionHandler$1@527b3afb)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2304)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2253)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2252)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2252)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2491)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2433)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2422)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:902)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2204)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2225)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2244)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2269)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:362)
at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:361)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at com.google.cloud.spark.bigquery.write.BigQueryDataSourceWriterInsertableRelation.insert(BigQueryDataSourceWriterInsertableRelation.java:89)
... 34 more