How to config to allow new fields when running PySpark Model in GCP DataProc Serverless

The problem I’m having

When updating my python model to have more fields, PySpark job give warning that the number of fields is mismatched:
WARN BigQueryDataSourceWriterInsertableRelation: unexpected issue trying to save [col1: string, col2: timestamp … 12 more fields]
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Inserted row has wrong column count; Has 14, expected 8 at [4:30]

The context of why I’m trying to do this

We have a python model that write to a Bigquery table
PySpark Job is submit to DataProc Serverless
Problem occur when we update the model to add new fields

What I’ve already tried

  • Add properties allowFieldAddition in profiles.yml
runtime_config:
  properties: 
    allowFieldAddition: 'true'
  • Set spark config in python model
    global spark
    spark.conf.set("temporaryGcsBucket","temp_bucket")
    spark.conf.set("allowFieldAddition","true")

Some example code or error messages

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
GET https://www.googleapis.com/bigquery/v2/projects/*******/queries/*******************************?location=**************&maxResults=0&prettyPrint=false
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "location" : "q",
    "locationType" : "parameter",
    "message" : "Inserted row has wrong column count; Has 14, expected 8 at [4:30]",
    "reason" : "invalidQuery"
  } ],
  "message" : "Inserted row has wrong column count; Has 14, expected 8 at [4:30]",
  "status" : "INVALID_ARGUMENT"
}
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:439)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:525)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:466)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:576)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:692)
	... 60 more
23/08/08 05:08:26 WARN BigQueryDirectDataSourceWriterContext: BigQuery Data Source writer c0f75ced-4543-4722-b974-0be9bceecc4a aborted

1 Like

Hi <@U05LS2DELJF> any luck on this? I am having the exact same problem.
cheers

Note: @r.barata originally posted this reply in Slack. It might not have transferred perfectly.

Hi pips,
Have you tried running the model on newly released Dataproc cluster or Dataproc Serverless? Dataproc Serverless release notes  |  Dataproc Serverless Documentation  |  Google Cloud
We were experiencing the same painful problem, but with the new Dataproc images/runtime version it disappears, and we are now able to seamlessly add and remove columns