I’m really diving into data testing at the moment (thanks to @joshtemple’s Medium post!). By default, dbt comes with 4 schema tests: not null, unique, relationships, and accepted values. Also, dbt allows for custom schema test which lets you define additional schema tests.
I’m interested in how others have defined custom schema tests for their data models. It’d be great to have several examples here to inspire the rest of us in brainstorming data tests that fit our data needs.
For example, I work with North American phone numbers a lot, so I have a custom schema tests to check that phone number fields always contain ten-digit number (TDN) strings:
{# Verifies all elements in field are TDNs. If validation_errors returns >0 rows, test fails. #}
{% macro test_is_tdn(model, column_name) %}
with validation as (
select
{{ column_name }} as tdn_field
from {{ model }}
),
validation_errors as (
select
tdn_field
from validation
where tdn_field not rlike '^\\d{10}$'
)
select count(*)
from validation_errors
{% endmacro %}
What are some custom data schema tests you’ve developed?
I have two simple modifications on the standard schema tests: not_null_where and unique_together:
{% macro test_not_null_where(model, field, condition) %}
with exceptions as (
select
count(*)
from
{{ model }}
where
{{ field }} is null
and {{ condition }}
)
select * from exceptions
{% endmacro %}
{% macro test_unique_together(model, fields) %}
with exceptions as (
select
{% for field in fields %}
{{ field }}{% if not loop.last %},{% endif %}
{% endfor %}
from
{{ model }}
{{ dbt_utils.group_by(n=fields|length) }}
having
count(*) > 1
)
select count(*) from exceptions
{% endmacro %}
You use the following syntax to call these tests (v1 schema.yml):
We have built a few custom schema tests for ourselves at PageUp, which can be viewed here. Some of these are specific to our use cases but are interesting nonetheless.
Equality for range test that two tables have the same data in them, for the rows in the given id range. This is useful for testing against a controlled data set, as it allows you to test specific use cases on a subset of data while having other data in the model that may be used by other tests.
Has timestamp and Has version are tests that ensure that certain design patterns we have are enforced. For example, we have (had) a rule that every denormalised table or foreign key in a model should have its own timestamp column, which is then used for incremental updates later. This test ensures that a given column has a matching timestamp column, and that the timestamp column has a value in it whenever it needs to.
Null when parent column null This is a junk-data/model design test. It ensures that when you denormalise a table, its columns are either all populated or (in the event that the foreign key was null) all not populated.
We put all of the custom schema tests we find useful in dbt-utils.
One pattern which is super simple but really useful to implement is the use of the expression_is_true test, so you can define your own logic without having to write the sql for a test. I used this a lot in financial models when I had to validate that subtotal columns were equal to a total, that financial values were positive, etc. etc., so my schema.yml would look like:
We wanted to be able to test that numbers (float or non-float) were between a certain set of values. So we did this:
{% macro test_is_between(model, column_name, bottom_number, top_number) %}
with validation as (
select
{{ column_name }} as field_to_test
from {{ model }}
),
validation_errors as (
select
field_to_test
from validation
where field_to_test > {{ top_number }} or field_to_test < {{ bottom_number }}
)
select count(*)
from validation_errors
{% endmacro %}
Hey ted,
Quick question:
If I want to set “condition” as an optional argument and set a default value for it, then can I do it like this:
{% macro test_not_null_where(model, field, condition="1=1", condition2="1=1") %}
with exceptions as (
select
count(*)
from
{{ model }}
where
{{ field }} is null
and {{ condition }}
and {{ condition2 }}
)
select * from exceptions
{% endmacro %}
Also, can I use it in a test block instead of a macro block ?
Our team needed to delineate between warning and error thresholds for a number of tests, including dbt built ins like unique and not_null, based on a configurable threshold. For the built ins, we overrode the core implementation, adding support for thresholds and severity. Example implementation for our overridden implementation of unique:
{% test unique(model, column_name, severity_type= 'warn', percent_threshold=0.0) %}
-- overriden unique test allowing for a threshold of non-unique values
-- calculation of threshold is based on the total number of values with more than one row / total number of distinct values.
{% if severity_type == 'warn' %}
{{ config(tags=["critical"], severity = severity_type, warn_if = '>0') }}
{%- else -%}
{{ config(tags=["critical"], severity = severity_type, error_if = '>0') }}
{% endif %}
with row_count as (
select
count(1) as total_rows
from
{{ model }}
),
total_distinct_values as (
select
count(distinct {{ column_name }}) as distinct_vals_count
from
{{ model }}
where
{{ column_name }} is not null
)
, non_unique_values as (
select
{{ column_name }} as unique_field,
count(*) as n_records
from
{{ model }}
where
{{ column_name }} is not null
group by
{{ column_name }}
having count(*) > 1
)
, error_count as (
select
count(*) as n_errors
from
non_unique_values
)
, combined as (
select
case
when distinct_vals_count > 0 then
case
when cast(n_errors as decimal)/distinct_vals_count <= {{ percent_threshold }} then 0
else cast(n_errors as decimal)/distinct_vals_count
end
else 0
end as result
from
error_count
cross join total_distinct_values
)
select
result as percentage_duplicate_values
, v.*
from
combined a
cross join non_unique_values v
where a.result <> 0
{% endtest %}
In order to support different thresholds with different errors, we actually have to apply two differently configured unique tests to the same column in the yaml config, such as
- name: SOME_FIELD
data_type: varchar
description: A field that is supposed to have unique values, but we know there are a small number of cases where there are non-unique values that are outside of our team's control. We want to be aware those exists, but only want to error if we see a lot of cases.
tests:
- unique:
severity_type: 'warn'
percent_threshold: 0.000001
- unique:
severity_type: 'error'
percent_threshold: 0.0001
My post is over 4 years old. I’d recommend using the built-in not_null test, which now supports the where config (like all generic tests). You would use it like so: