Nested and repeated records (STRUCT
and ARRAY
of STRUCT
types) in BigQuery are really powerful for performance and organizing, but can be frustrating to test in dbt because they require some extra SQL (namely, the use of UNNEST
) to expose correctly.
I spent some time a few months ago working on a macro to dynamically unnest a column and run a not null test based on a .
delimited specifier. I learned a lot about Jinja, especially its recursive capabilities. Here’s how I did it.
Example case
Imagine we have a nested order/order items table called orders
that looks like this:
order_id | items.item_id | items.quantity | promo_codes | address.city | address.state |
---|---|---|---|---|---|
6301418492 | 13939 | 1 | New York | NY | |
32482 | 4 | ||||
59221 | 1 | ||||
4929138818 | 13193 | 3 | [HOLIDAY2019, EMP40] | Baltimore | MD |
In this table, items
is an ARRAY
of STRUCT
(nested and repeated), promo_codes
is an ARRAY of STRING
(repeated but not nested), and address
is a STRUCT
(nested but not repeated).
Our goal is to write a generalizable schema test called not_null_nr
that finds null records in any kind of nested/repeated data at any depth of nesting.
We should be able to write our schema.yml
file like this:
columns:
- name: items.item_id
tests:
- not_null_nr
- name: promo_codes
tests:
- not_null_nr
- name: address.city
tests:
- not_null_nr
Writing the tests manually
Let’s start by imaging how we would write these tests manually in SQL. If we want to test that order_id
is not null, we can write a simple test like this:
select order_id
from orders
where order_id is null
If we try to write the same test on a nested, repeated field like items.item_id
…
select items.item_id
from orders
where items.item_id is null
…we get an error from BigQuery:
Cannot access field item_id on a value with type ARRAY
Turns out, if we want to test item_id
, we have to UNNEST
and join our items
field to our orders
table. Now our test might look like this:
select order_id, item.item_id
from orders
left join unnest(items) item
where item.item_id is null
What about our address.city
field, which is nested but not repeated? In that case the test is simpler because we don’t need to UNNEST
.
select order_id, address.city
from orders
where address.city is null
But wait, there’s more! It’s also possible that we have a field that is repeated but not made up of STRUCT
types, like an ARRAY
of INT64
.
Take a look at our promo_codes
column, which falls into this category. We want to make sure none of those promo codes are null. We write a test in that situation like this:
select order_id, code
from orders
left join unnest(orders.promo_codes) code
where code is not null
In this test we still need to UNNEST
, but we don’t need a .
selector because the items in the array are not STRUCT
type.
Generalizing the schema test
So to generalize this schema test properly, it needs to cover 3 cases*:
- The field is nested but not repeated, e.g.
{'item_id': 100, 'qty': 10}
- The field is nested and repeated, e.g.
[{'item_id': 100, 'qty': 1}, {'item_id': 200, 'qty': 2}]
- The field is not nested but is repeated, e.g.
[1, 2, 3, 4]
*There is one more case, neither nested nor repeated, but I’m leaving it up to the user to run regular schema tests on those columns instead.
The macro should accept the name of the model and the name of the field to be tested. We should be able to specify any depth of nesting in the field name, e.g. level_1.level_2.level_3.level_4
and the macro should perform the appropriate UNNEST
. This suggests some kind of recursion, which we’ll get into in a minute.
Constructing the macro
I start off by defining the macro itself and the parameters it should accept. It has to start with test_
so dbt recognizes it as a custom schema test.
{% macro test_not_null_nr(model, column_name) -%}
We’re going to be getting into some nested loops, and Jinja doesn’t have the ability to track variables from outer loops within the inner loop (see the section here on Scoping Behavior).
As a workaround, I create a namespace
called vars
with two variables, level
and column_list
, which will allow us to track some variables globally.
{%- set vars = namespace(level=1, column_list=[]) -%}
Next, I split the column_name
on .
so we can see the requested levels of nesting and assign it to a new variable called hierarchy
.
{%- set hierarchy = column_name.split('.') -%}
I pull all the database fields from the model and get the one that corresponds to our top level hierarchy
field so we have a rich representation of the column and its children. For example, dbt has a flag to tell us if columns are repeated or not based on the information it gets from the BigQuery.
The selectattr
filter here gets the item in the list columns
that has attribute name
equal to the first element in our hierarchy
list.
{%- set columns = adapter.get_columns_in_table(model.schema, model.table) -%}
{%- set root = columns | selectattr('name', 'equalto', hierarchy[0]) | list -%}
Now that we’ve set up our variables, I inject the basic SQL to construct the test, counting the number of rows with a FROM
that corresponds to our first level (the model itself).
select count(*)
from {{ model }} level_{{ vars.level }}
Next, I’m going to recurse through the children of our root
object until we arrive at the final level of hierarchy
. Jinja2 has a recursive loop definition.
{% for child in root recursive %}
The loop is responsible for generating the correct UNNEST
statements to unroll the table to a state where we can access the final field in hierarchy
. Within the loop, I do the following:
- Add the
name
of the child to thecolumn_list
, which we’ll use for creating strings likelevel_1.level_2.level_3
- Whenever we reach a child that is a repeated field (as indicated by the
mode
attribute on the child), we increasevars.level
, we need toUNNEST
and join the field so we can access it. - If the child has a
fields
attribute, we need to go deeper, so I call theloop
variable to recurse on the field that matches the name in the corresponding level ofhierarchy
.
{% for child in root recursive %}
{%- set vars.column_list = vars.column_list + [child.name] -%}
{%- if child.mode == 'REPEATED' -%}
{%- set vars.level = vars.level + 1 %}
left join unnest(level_{{ vars.level - 1 }}.{{ vars.column_list | join('.') | string }}) as level_{{ vars.level }}
{%- set vars.column_list = [] -%}
{%- endif -%}
{%- if child.fields -%}
{{ loop(child.fields | selectattr('name', 'equalto', hierarchy[loop.depth]) | list) }}
{%- endif -%}
{% endfor %}
Finally, I add a WHERE
clause to perform the actual test. At this point, vars.level
corresponds to the deepest level (the loop exited at) and we can build the field selector from our column_list
.
where level_{{ vars.level }}
{%- if vars.column_list -%}
.{{ vars.column_list | join('.') | string }}
{%- endif %}
is null
The completed macro
As far as I can tell (I’ve tested this on a bunch of sample combinations), this works. Of course, adapting this for a unique test or other type of custom schema test will require more effort, but the bones should be there.
If you end up identifying any issues or improving on this in any way, let me know!
{% macro test_not_null_nested(model, column_name) -%}
{# Workaround to avoid Jinja loop scope issues #}
{%- set vars = namespace(level=1, column_list=[]) -%}
{%- set hierarchy = column_name.split('.') -%}
{%- set columns = adapter.get_columns_in_table(model.schema, model.table) -%}
{%- set root = columns | selectattr('name', 'equalto', hierarchy[0]) | list -%}
select count(*)
from {{ model }} level_{{ vars.level }}
{% for child in root recursive %}
{%- set vars.column_list = vars.column_list + [child.name] -%}
{%- if child.mode == 'REPEATED' -%}
{%- set vars.level = vars.level + 1 %}
left join unnest(level_{{ vars.level - 1 }}.{{ vars.column_list | join('.') | string }}) as level_{{ vars.level }}
{%- set vars.column_list = [] -%}
{%- endif -%}
{%- if child.fields -%}
{{ loop(child.fields | selectattr('name', 'equalto', hierarchy[loop.depth]) | list) }}
{%- endif -%}
{% endfor %}
where level_{{ vars.level }}
{%- if vars.column_list -%}
.{{ vars.column_list | join('.') | string }}
{%- endif %}
is null
{%- endmacro %}