How to config to allow new fields when running PySpark Model in GCP DataProc Serverless

toan.le · August 8, 2023, 6:28am

The problem I’m having

When updating my python model to have more fields, PySpark job give warning that the number of fields is mismatched:
WARN BigQueryDataSourceWriterInsertableRelation: unexpected issue trying to save [col1: string, col2: timestamp … 12 more fields]
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Inserted row has wrong column count; Has 14, expected 8 at [4:30]

The context of why I’m trying to do this

We have a python model that write to a Bigquery table
PySpark Job is submit to DataProc Serverless
Problem occur when we update the model to add new fields

What I’ve already tried

Add properties allowFieldAddition in profiles.yml

runtime_config:
  properties: 
    allowFieldAddition: 'true'

Set spark config in python model

    global spark
    spark.conf.set("temporaryGcsBucket","temp_bucket")
    spark.conf.set("allowFieldAddition","true")

Some example code or error messages

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
GET https://www.googleapis.com/bigquery/v2/projects/*******/queries/*******************************?location=**************&maxResults=0&prettyPrint=false
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "location" : "q",
    "locationType" : "parameter",
    "message" : "Inserted row has wrong column count; Has 14, expected 8 at [4:30]",
    "reason" : "invalidQuery"
  } ],
  "message" : "Inserted row has wrong column count; Has 14, expected 8 at [4:30]",
  "status" : "INVALID_ARGUMENT"
}
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:439)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:525)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:466)
	at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:576)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:692)
	... 60 more
23/08/08 05:08:26 WARN BigQueryDirectDataSourceWriterContext: BigQuery Data Source writer c0f75ced-4543-4722-b974-0be9bceecc4a aborted

a_slack_user · November 15, 2023, 4:38pm

Hi <@U05LS2DELJF> any luck on this? I am having the exact same problem.
cheers

_{Note: @r.barata originally posted this reply in Slack. It might not have transferred perfectly.}

yekaterina.sun · December 21, 2023, 5:28am

Hi pips,
Have you tried running the model on newly released Dataproc cluster or Dataproc Serverless? Dataproc Serverless release notes | Dataproc Serverless Documentation | Google Cloud
We were experiencing the same painful problem, but with the new Dataproc images/runtime version it disappears, and we are now able to seamlessly add and remove columns

cole · May 14, 2024, 4:45pm

I’m not sure the issue is from an old image/version since this same issue happens in sql models for bigquery.
What I think is happening is the dbt backend is working with a cached version of the table schema that does not get updated even though you set the config to allow field addition. This is especially true in cases where the columns are not explicitly stated in the model/code. An example would be a select * in a model that has upstream changes that are not reflected in the downstream model’s new schema, hence the mismatch of columns.
I think this is a dbt bug.

Topic		Replies	Views
Schema mismatch on dbt python model run Help bigquery , python-models , dbt-cloud	1	779	November 29, 2024
DBT pyspark gcp Pyspark process datafram but can't write to table on bigquery at the end Help bigquery , orchestration-and-deployment , dbt-cloud , devblog , dbt-core	2	1348	May 10, 2024
Python model doesn't create table in BigQuery Help bigquery	0	693	October 12, 2023
DBT Pyspark error: requested entity not found Help	0	451	May 10, 2024
dbt Python Model error: 501 received http2 header with status: 404 Help bigquery , python-models	2	2126	November 25, 2022

How to config to allow new fields when running PySpark Model in GCP DataProc Serverless

The problem I’m having

The context of why I’m trying to do this

What I’ve already tried

Some example code or error messages

Related topics