A dynamic schema test for nested & repeated BigQuery fields

Nested and repeated records (STRUCT and ARRAY of STRUCT types) in BigQuery are really powerful for performance and organizing, but can be frustrating to test in dbt because they require some extra SQL (namely, the use of UNNEST) to expose correctly.

I spent some time a few months ago working on a macro to dynamically unnest a column and run a not null test based on a . delimited specifier. I learned a lot about Jinja, especially its recursive capabilities. Here’s how I did it.

Example case

Imagine we have a nested order/order items table called orders that looks like this:

order_id items.item_id items.quantity promo_codes address.city address.state
6301418492 13939 1 [] New York NY
32482 4
59221 1
4929138818 13193 3 [HOLIDAY2019, EMP40] Baltimore MD

In this table, items is an ARRAY of STRUCT (nested and repeated), promo_codes is an ARRAY of STRING (repeated but not nested), and address is a STRUCT (nested but not repeated).

Our goal is to write a generalizable schema test called not_null_nr that finds null records in any kind of nested/repeated data at any depth of nesting.

We should be able to write our schema.yml file like this:

columns:
  - name: items.item_id
    tests:
      - not_null_nr
  - name: promo_codes
    tests:
      - not_null_nr
  - name: address.city
    tests:
      - not_null_nr

Writing the tests manually

Let’s start by imaging how we would write these tests manually in SQL. If we want to test that order_id is not null, we can write a simple test like this:

select order_id
from orders
where order_id is null

If we try to write the same test on a nested, repeated field like items.item_id

select items.item_id
from orders
where items.item_id is null

…we get an error from BigQuery:

Cannot access field item_id on a value with type ARRAY

Turns out, if we want to test item_id, we have to UNNEST and join our items field to our orders table. Now our test might look like this:

select order_id, item.item_id
from orders
left join unnest(items) item
where item.item_id is null

What about our address.city field, which is nested but not repeated? In that case the test is simpler because we don’t need to UNNEST.

select order_id, address.city
from orders
where address.city is null

But wait, there’s more! It’s also possible that we have a field that is repeated but not made up of STRUCT types, like an ARRAY of INT64.

Take a look at our promo_codes column, which falls into this category. We want to make sure none of those promo codes are null. We write a test in that situation like this:

select order_id, code
from orders
left join unnest(orders.promo_codes) code
where code is not null

In this test we still need to UNNEST, but we don’t need a . selector because the items in the array are not STRUCT type.

Generalizing the schema test

So to generalize this schema test properly, it needs to cover 3 cases*:

  1. The field is nested but not repeated, e.g. {'item_id': 100, 'qty': 10}
  2. The field is nested and repeated, e.g. [{'item_id': 100, 'qty': 1}, {'item_id': 200, 'qty': 2}]
  3. The field is not nested but is repeated, e.g. [1, 2, 3, 4]

*There is one more case, neither nested nor repeated, but I’m leaving it up to the user to run regular schema tests on those columns instead.

The macro should accept the name of the model and the name of the field to be tested. We should be able to specify any depth of nesting in the field name, e.g. level_1.level_2.level_3.level_4 and the macro should perform the appropriate UNNEST. This suggests some kind of recursion, which we’ll get into in a minute.

Constructing the macro

I start off by defining the macro itself and the parameters it should accept. It has to start with test_ so dbt recognizes it as a custom schema test.

{% macro test_not_null_nr(model, column_name) -%}

We’re going to be getting into some nested loops, and Jinja doesn’t have the ability to track variables from outer loops within the inner loop (see the section here on Scoping Behavior).

As a workaround, I create a namespace called vars with two variables, level and column_list, which will allow us to track some variables globally.

{%- set vars = namespace(level=1, column_list=[]) -%}

Next, I split the column_name on . so we can see the requested levels of nesting and assign it to a new variable called hierarchy.

{%- set hierarchy = column_name.split('.') -%}

I pull all the database fields from the model and get the one that corresponds to our top level hierarchy field so we have a rich representation of the column and its children. For example, dbt has a flag to tell us if columns are repeated or not based on the information it gets from the BigQuery.

The selectattr filter here gets the item in the list columns that has attribute name equal to the first element in our hierarchy list.

{%- set columns = adapter.get_columns_in_table(model.schema, model.table) -%}
{%- set root = columns | selectattr('name', 'equalto', hierarchy[0]) | list -%}

Now that we’ve set up our variables, I inject the basic SQL to construct the test, counting the number of rows with a FROM that corresponds to our first level (the model itself).

select count(*)
from {{ model }} level_{{ vars.level }}

Next, I’m going to recurse through the children of our root object until we arrive at the final level of hierarchy. Jinja2 has a recursive loop definition.

{% for child in root recursive %}

The loop is responsible for generating the correct UNNEST statements to unroll the table to a state where we can access the final field in hierarchy. Within the loop, I do the following:

  1. Add the name of the child to the column_list, which we’ll use for creating strings like level_1.level_2.level_3
  2. Whenever we reach a child that is a repeated field (as indicated by the mode attribute on the child), we increase vars.level, we need to UNNEST and join the field so we can access it.
  3. If the child has a fields attribute, we need to go deeper, so I call the loop variable to recurse on the field that matches the name in the corresponding level of hierarchy.
{% for child in root recursive %}
  {%- set vars.column_list = vars.column_list + [child.name] -%}
  {%- if child.mode == 'REPEATED' -%}
    {%- set vars.level = vars.level + 1 %}
    left join unnest(level_{{ vars.level - 1 }}.{{ vars.column_list | join('.') | string }}) as level_{{ vars.level }}
    {%- set vars.column_list = [] -%}
  {%- endif -%}

  {%- if child.fields -%}
    {{ loop(child.fields | selectattr('name', 'equalto', hierarchy[loop.depth]) | list) }}
  {%- endif -%}

{% endfor %}

Finally, I add a WHERE clause to perform the actual test. At this point, vars.level corresponds to the deepest level (the loop exited at) and we can build the field selector from our column_list.

where level_{{ vars.level }}
{%- if vars.column_list -%}
.{{ vars.column_list | join('.') | string }}
{%- endif %}
is null

The completed macro

As far as I can tell (I’ve tested this on a bunch of sample combinations), this works. Of course, adapting this for a unique test or other type of custom schema test will require more effort, but the bones should be there.

If you end up identifying any issues or improving on this in any way, let me know!

{% macro test_not_null_nested(model, column_name) -%}

{# Workaround to avoid Jinja loop scope issues #}
{%- set vars = namespace(level=1, column_list=[]) -%}

{%- set hierarchy = column_name.split('.') -%}
{%- set columns = adapter.get_columns_in_table(model.schema, model.table) -%}
{%- set root = columns | selectattr('name', 'equalto', hierarchy[0]) | list -%}
select count(*)
from {{ model }} level_{{ vars.level }}

{% for child in root recursive %}
{%- set vars.column_list = vars.column_list + [child.name] -%}
{%- if child.mode == 'REPEATED' -%}
{%- set vars.level = vars.level + 1 %}
left join unnest(level_{{ vars.level - 1 }}.{{ vars.column_list | join('.') | string }}) as level_{{ vars.level }}
{%- set vars.column_list = [] -%}
{%- endif -%}

{%- if child.fields -%}
{{ loop(child.fields | selectattr('name', 'equalto', hierarchy[loop.depth]) | list) }}
{%- endif -%}

{% endfor %}

where level_{{ vars.level }}
{%- if vars.column_list -%}
.{{ vars.column_list | join('.') | string }}
{%- endif %}
is null

{%- endmacro %}
5 Likes