The problem I’m having
We have encountered discrepancies between Spark and BigQuery when using dbt unit tests. Specifically, these issues arise when handling input fixture data for columns of type ARRAY. Below is a more detailed explanation with examples to illustrate the problems.
The context of why I’m trying to do this
We are working on a dbt project migration from Spark to BigQuery . To ensure that our models behave consistently across both platforms, we rely on dbt unit tests with mock input data. However, discrepancies in how Spark and BigQuery handle ARRAY<STRING>
columns in input fixtures are making it difficult to write and execute these tests effectively.
- In Spark:
- If an input column of type
ARRAY<STRING>
is unused in the test, it can be omitted without causing any issues. - When a column is omitted, Spark handles it gracefully by interpreting it as
CAST(NULL AS ARRAY<STRING>)
.
- If an input column of type
- In BigQuery:
- If a column is not mocked, BigQuery attempts to generate a
SAFE_CAST(NULL AS ARRAY)
statement, which results in a syntax error because BigQuery does not allow castingNULL
to an array without specifying the type.
- If a column is not mocked, BigQuery attempts to generate a
Root Cause
The differences in behavior appear to stem from differences in how Spark and BigQuery handle NULL
values and array data during compilation and execution:
- Spark allows implicit casting of
NULL
toARRAY<STRING>
without requiring explicit type information. - BigQuery requires explicit type information when casting
NULL
to an array,
What I’ve already tried
I tried to do all the way to mock the data to csv with different ways like below.
1.Tried with “[‘a’,‘b’] or “[””“a”“”, “”“b”“”]"
2.Tried to setup the column type in schema file of dbt with data_type as array
3.Tried safe_cast in model like ( tags as array) as tags
4. Tried with the yml style which works fine.
Some example code or error messages
id,tags
1,"['a', 'b', 'c']"
2,"['x', 'y']"
3,NULL