Examples of custom schema tests

I’m really diving into data testing at the moment (thanks to @joshtemple’s Medium post!). By default, dbt comes with 4 schema tests: not null, unique, relationships, and accepted values. Also, dbt allows for custom schema test which lets you define additional schema tests.

I’m interested in how others have defined custom schema tests for their data models. It’d be great to have several examples here to inspire the rest of us in brainstorming data tests that fit our data needs.

For example, I work with North American phone numbers a lot, so I have a custom schema tests to check that phone number fields always contain ten-digit number (TDN) strings:

{# Verifies all elements in field are TDNs. If validation_errors returns >0 rows, test fails. #}

{% macro test_is_tdn(model, column_name) %}

with validation as (
  select
    {{ column_name }} as tdn_field
  from {{ model }}
),

validation_errors as (
  select
    tdn_field
  from validation
  where tdn_field not rlike '^\\d{10}$'
)

select count(*)
from validation_errors

{% endmacro %}

What are some custom data schema tests you’ve developed?

5 Likes

I have two simple modifications on the standard schema tests: not_null_where and unique_together:

{% macro test_not_null_where(model, field, condition) %}

  with exceptions as (
    select
      count(*)

    from
      {{ model }}

    where
      {{ field }} is null
      and {{ condition }}

  )

  select * from exceptions

{% endmacro %}
{% macro test_unique_together(model, fields) %}

  with exceptions as (
    select
      {% for field in fields %}
        {{ field }}{% if not loop.last %},{% endif %}
      {% endfor %}

    from
      {{ model }}

    {{ dbt_utils.group_by(n=fields|length) }}

    having
      count(*) > 1
  )

  select count(*) from exceptions

{% endmacro %}

You use the following syntax to call these tests (v1 schema.yml):

not_null_where:
  - {field: 'field_1', condition: "another_field = 'something'"}
  - {field: 'field_1', condition: "another_field > 0"}
  - {field: 'field_1', condition: "not(field_2 = 'a' and field_3 = 'b')"}
unique_together:
  - {fields: ['field_1', 'field_2']}
2 Likes

We have built a few custom schema tests for ourselves at PageUp, which can be viewed here. Some of these are specific to our use cases but are interesting nonetheless.

  • Equality for range test that two tables have the same data in them, for the rows in the given id range. This is useful for testing against a controlled data set, as it allows you to test specific use cases on a subset of data while having other data in the model that may be used by other tests.
  • Has timestamp and Has version are tests that ensure that certain design patterns we have are enforced. For example, we have (had) a rule that every denormalised table or foreign key in a model should have its own timestamp column, which is then used for incremental updates later. This test ensures that a given column has a matching timestamp column, and that the timestamp column has a value in it whenever it needs to.
  • Null when parent column null This is a junk-data/model design test. It ensures that when you denormalise a table, its columns are either all populated or (in the event that the foreign key was null) all not populated.
1 Like

We put all of the custom schema tests we find useful in dbt-utils.

One pattern which is super simple but really useful to implement is the use of the expression_is_true test, so you can define your own logic without having to write the sql for a test. I used this a lot in financial models when I had to validate that subtotal columns were equal to a total, that financial values were positive, etc. etc., so my schema.yml would look like:

version: 2

models:
  - name: payments
    tests:
      - dbt_utils.expression_is_true:
          expression: "inbound_amount = payout_amount + refund_amount"
      - dbt_utils.expression_is_true:
          expression: "inbound_amount = inbound_cash_amount + inbound_credit_amount"
4 Likes

We wanted to be able to test that numbers (float or non-float) were between a certain set of values. So we did this:

{% macro test_is_between(model, column_name, bottom_number, top_number) %}

with validation as (

    select
        {{ column_name }} as field_to_test

    from {{ model }}

),

validation_errors as (

    select
        field_to_test

    from validation
    where field_to_test > {{ top_number }} or field_to_test < {{ bottom_number }}

)

select count(*)
from validation_errors

{% endmacro %}

And here is an example of a test:

  - name: flight_end_long
    tests:
      - is_between: {'bottom_number' : -180, 'top_number' : 180}
2 Likes

I’m using is_between macro now, thanks! For schema.yml v2, use:

        - name: flight_end_long
          tests:
              - is_between:
                  bottom_number: -180
                  top_number: 180

Hey ted,
Quick question:
If I want to set “condition” as an optional argument and set a default value for it, then can I do it like this:

{% macro test_not_null_where(model, field, condition="1=1", condition2="1=1") %}

  with exceptions as (
    select
      count(*)

    from
      {{ model }}

    where
      {{ field }} is null
      and {{ condition }}
      and {{ condition2 }}

  )

  select * from exceptions

{% endmacro %}

Also, can I use it in a test block instead of a macro block ?

Our team needed to delineate between warning and error thresholds for a number of tests, including dbt built ins like unique and not_null, based on a configurable threshold. For the built ins, we overrode the core implementation, adding support for thresholds and severity. Example implementation for our overridden implementation of unique:

{% test unique(model, column_name, severity_type= 'warn', percent_threshold=0.0) %}
-- overriden unique test allowing for a threshold of non-unique values
-- calculation of threshold is based on the total number of values with more than one row / total number of distinct values.
{% if severity_type == 'warn' %}
    {{ config(tags=["critical"], severity = severity_type, warn_if = '>0') }}
{%- else -%}
    {{ config(tags=["critical"], severity = severity_type, error_if = '>0') }}
{% endif %}

with row_count as (
    select
        count(1) as total_rows
    from
        {{ model }}
),
total_distinct_values as (
    select
        count(distinct {{ column_name }}) as distinct_vals_count
    from
        {{ model }}
    where
        {{ column_name }} is not null
)
, non_unique_values as (
    select
        {{ column_name }} as unique_field,
        count(*) as n_records
    from
        {{ model }}
    where
        {{ column_name }} is not null
    group by
        {{ column_name }}
    having count(*) > 1
)
, error_count as (
    select
        count(*) as n_errors
    from
        non_unique_values
)
, combined as (
    select
        case
            when distinct_vals_count > 0 then
                case
                    when cast(n_errors as decimal)/distinct_vals_count <= {{ percent_threshold }} then 0
                    else cast(n_errors as decimal)/distinct_vals_count
                end
            else 0
        end as result
    from
        error_count
        cross join total_distinct_values
)
select
  result as percentage_duplicate_values
  , v.*
from
  combined a
  cross join non_unique_values v
where a.result <> 0

{% endtest %}

In order to support different thresholds with different errors, we actually have to apply two differently configured unique tests to the same column in the yaml config, such as

    - name: SOME_FIELD
      data_type: varchar
      description: A field that is supposed to have unique values, but we know there are a small number of cases where there are non-unique values that are outside of our team's control. We want to be aware those exists, but only want to error if we see a lot of cases.
      tests:
        - unique:
            severity_type: 'warn'
            percent_threshold: 0.000001
        - unique:
            severity_type: 'error'
            percent_threshold: 0.0001

My post is over 4 years old. I’d recommend using the built-in not_null test, which now supports the where config (like all generic tests). You would use it like so:

models:
  - name: my_model
    columns:
      - name: my_not_null_col
        tests:
          - not_null:
              config:
                where: >
                  foo == 1
                  and bar == 2
1 Like