How to config to allow new fields when running PySpark Model in GCP DataProc Serverless

The problem I’m having

When updating my python model to have more fields, PySpark job give warning that the number of fields is mismatched:
WARN BigQueryDataSourceWriterInsertableRelation: unexpected issue trying to save [col1: string, col2: timestamp … 12 more fields]
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Inserted row has wrong column count; Has 14, expected 8 at [4:30]

The context of why I’m trying to do this

We have a python model that write to a Bigquery table
PySpark Job is submit to DataProc Serverless
Problem occur when we update the model to add new fields

What I’ve already tried

  • Add properties allowFieldAddition in profiles.yml
runtime_config:
  properties: 
    allowFieldAddition: 'true'
  • Set spark config in python model
    global spark
    spark.conf.set("temporaryGcsBucket","temp_bucket")
    spark.conf.set("allowFieldAddition","true")

Some example code or error messages

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
GET https://www.googleapis.com/bigquery/v2/projects/*******/queries/*******************************?location=**************&maxResults=0&prettyPrint=false
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "location" : "q",
    "locationType" : "parameter",
    "message" : "Inserted row has wrong column count; Has 14, expected 8 at [4:30]",
    "reason" : "invalidQuery"
  } ],
  "message" : "Inserted row has wrong column count; Has 14, expected 8 at [4:30]",
  "status" : "INVALID_ARGUMENT"
}
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:439)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:525)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:466)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:576)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:692)
	... 60 more
23/08/08 05:08:26 WARN BigQueryDirectDataSourceWriterContext: BigQuery Data Source writer c0f75ced-4543-4722-b974-0be9bceecc4a aborted

1 Like

Hi <@U05LS2DELJF> any luck on this? I am having the exact same problem.
cheers

Note: @r.barata originally posted this reply in Slack. It might not have transferred perfectly.

Hi pips,
Have you tried running the model on newly released Dataproc cluster or Dataproc Serverless? Dataproc Serverless release notes  |  Dataproc Serverless Documentation  |  Google Cloud
We were experiencing the same painful problem, but with the new Dataproc images/runtime version it disappears, and we are now able to seamlessly add and remove columns

I’m not sure the issue is from an old image/version since this same issue happens in sql models for bigquery.
What I think is happening is the dbt backend is working with a cached version of the table schema that does not get updated even though you set the config to allow field addition. This is especially true in cases where the columns are not explicitly stated in the model/code. An example would be a select * in a model that has upstream changes that are not reflected in the downstream model’s new schema, hence the mismatch of columns.
I think this is a dbt bug.