Advanced schema tests; thresholds, exclusions and date limits

At Rezdy we’ve got our fair share of failing dbt tests and it goes without saying that when you have a failing test you should always address the root cause of the problem.

That being said it’s a lot faster just to sweep non-critical failing tests under the rug and get on with your day. These “advanced” schema tests are like the trap door under the rug.

Inspired by a post on creating error thresholds for schema tests, we set out to create a malleable dbt schema test for our most troublesome tests(unique and not_null)

The test handles 3 cases:

  1. Date scoping the dataset before running the test
  2. Implementing either a absolute or relative error threshold for row failure
  3. Explicitly exclude dodgy rows from tests

Date scoping has been covered a fair bit already and is likely to be made redundant by either dbt_utils or the new capability of the not_null schema test in dbt v0.20.0 and thresholds have received a fair bit of attention so this post is mostly about the more idiosyncratic test exclusions option.

The test exclusion functionality allows us to test a dataset as normal but put in a small clause such as:

Even though all customers who sign up should have a trial event before they have a purchase event(trial_id not null in fct_purchase), there was that one time back in 2018 where a customer managed to somehow do it because of a bug. We’ll make a note of the customers ID and the reason why they don’t have a trial in our exclusion file (a seed file or a google sheet if you can swing it Hail BigQuery) then move on with our lives.

Note:

  • There are separate test macros for not_null and unique but they are identical except for the logic that captures duplicates or nulls in the perform_test CTE.
  • These models are currently running on a BigQuery warehouse, you might need to tweak some syntax to get it to run on other warehouses.*
  • The two supporting macros; get_model_ref and exclude_rows_from_test are also included down the bottom.
  • You can name the macros test_not_null and test_unique without the adv prefix and they will just replace the ones that come with dbt, but then you have to go and update your whole project.
  • tbl_test_exclusions is the name of the table where we store both the rows that will be excluded and the reason for doing so you can call this whatever you like.
  • If you’ve been doing some black magic with condition blocks and ref statements and encountered the below test ERROR, it might be worth checking out the get_model_ref macro down the bottom.
Compilation Error in test some_test_name (models/staging/something/something.yml)
  dbt was unable to infer all dependencies for the model "some_test_name".
  This typically happens when ref() is placed within a conditional block.

  To fix this, add the following hint to the top of the model "some_test_name":

Not Null


/* -----------------------------------------------------

This is an advanced not_null schema test.
It is used the same way the normal not_null schema test is used inside the .yml file
however instead of:
              - not_null
use
              - adv_not_null:

The test has 3 functions:
 - Thresholds (absolute and relative)
 - Date Limits (by day)
 - Row Exclusions (nit picking out bad rows)

Usage Notes:
  Thresholds are calculated after date scoping.
  When testing relative thresholds use values 1 to 100 indicating the % of bad rows expected.
  The error output for threshold fields indicates the number of fails beyond the threshold.
    not the absolute number of fails. (The absolute number can be obtained from a normal test.)
  A single test can use all 3 functions.
  For specifcs on the row exclusions functionality check the exclude_rows_from_test macro


config for .yml file
            - adv_unique:
                model:                          # Taken from the model the test is on
                column_name:                    # Taken from the column the test is on
                event_timestamp_column_name:    # [The date column for date limit tests]
                number_of_days_to_check:        # [The number of days for date limit tests]
                threshold_type:                 # relative / absolute (relative in %)
                threshold_value:                # integer to indicate accepted errors for abs/rel
                column_to_exclude_on:           # if the column has exclusions on a field, enter that column name here

*/ -----------------------------------------------------
{%macro test_adv_not_null(
  model,
  column_name,
  event_timestamp_column_name = None,
  number_of_days_to_check = None,
  threshold_type = None,
  threshold_value = 1,
  column_to_exclude_on = None ) %}

/* -----------------------------------------------------
Scope to valid test rows.
*/ -----------------------------------------------------
WITH valid_test_rows AS (
SELECT *
  FROM {{ model }}
 WHERE 1 = 1

   {% if column_to_exclude_on !=  None %} --This will check your test exclusion
    AND {{ exclude_rows_from_test( model.name, column_to_exclude_on ) }}
   {% endif %}

   {% if number_of_days_to_check != None and event_timestamp_column_name != None%}
    AND {{ event_timestamp_column_name }} > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {{number_of_days_to_check}} DAY)
   {% endif %}
 )
/*-----------------------------------------------------
Set Threshold if applicable.
*/ -----------------------------------------------------

, threshold_limit AS (
SELECT

    {% if threshold_type ==  'relative' %}
      ROUND(COUNT(*) * ({{ threshold_value }} / 100),0) AS max_errors  --find the maximum amount of rows that can fail the test
      FROM valid_test_rows --This allows relative tests to be effective on date range tests

    {% elif threshold_type ==  'absolute' %}
      {{ threshold_value }} AS max_errors  --find the maximum amount of rows that can fail the test

    {% else %} --ELSE
      0 AS max_errors  --find the maximum amount of rows that can fail the test
    {% endif %}
)

/*-----------------------------------------------------
Perform the Test.
*/-----------------------------------------------------

, perform_test AS (

        SELECT *
        FROM valid_test_rows
        WHERE {{ column_name }} IS NULL


)
/*-----------------------------------------------------
Count the Errors.
*/ -----------------------------------------------------
, validation_errors AS (
SELECT
COUNT(*) AS n_errors
FROM perform_test
)
/*-----------------------------------------------------
Check if the number of errors is greater than the threshold.
*/ -----------------------------------------------------
SELECT
    CASE WHEN ve.n_errors > tl.max_errors THEN ve.n_errors - tl.max_errors
         WHEN ve.n_errors < tl.max_errors THEN 0
    ELSE 0
    END AS result
FROM validation_errors ve
CROSS JOIN threshold_limit tl


   {% endmacro %}


Unique

/* -----------------------------------------------------
This is an advanced unique schema test.
It is used the same way the normal unique schema test is used inside the .yml file
however instead of:
              - unique
use
              - adv_unique:

The test has 3 functions:
 - Thresholds (absolute and relative)
 - Date Limits (by day)
 - Row Exclusions

Usage Notes:
  Thresholds are calculated after date scoping.
  When testing relative thresholds use values 1 to 100 indicating the % of bad rows expected.
  The error output for threshold fields indicates the number of fails beyond the threshold.
    not the absolute number of fails. (The absolute number can be obtained from a normal test.)
  A single test can use all 3 functions.
  For specifics on the row exclusions functionality check the exclude_rows_from_test macro


config for .yml file
            - adv_unique:
                model:                          # Taken from the model the test is on
                column_name:                    # Taken from the column the test is on
                event_timestamp_column_name:    # [The date column for date limit tests]
                number_of_days_to_check:        # [The number of days for date limit tests]
                threshold_type:                 # relative / absolute (relative in %)
                threshold_value:                # integer to indicate accepted errors for abs/rel
                column_to_exclude_on:           # if the column has exclusions on a field, enter that column name here

*/ -----------------------------------------------------
{%macro test_adv_unique(
  model,
  column_name,
  event_timestamp_column_name = None,
  number_of_days_to_check = None,
  threshold_type = None,
  threshold_value = 1,
  column_to_exclude_on = None ) %}

/* -----------------------------------------------------
Scope to valid test rows.
*/ -----------------------------------------------------
WITH valid_test_rows AS (
SELECT *
  FROM {{ model }}
 WHERE 1 = 1

   {% if column_to_exclude_on !=  None %} --This will check your test exclusion
    AND {{ exclude_rows_from_test( model.name, column_to_exclude_on ) }}
   {% endif %}

   {% if number_of_days_to_check != None and event_timestamp_column_name != None%}
    AND {{ event_timestamp_column_name }} > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {{number_of_days_to_check}} DAY)
   {% endif %}
 )
/*-----------------------------------------------------
Set Threshold if applicable.
*/ -----------------------------------------------------

, threshold_limit AS (
SELECT

    {% if threshold_type ==  'relative' %}
    ROUND(COUNT(*) * ({{ threshold_value }} / 100),0) AS max_errors  --find the maximum amount of rows that can fail the test
      FROM valid_test_rows --This allows relative tests to be effevtive on date range tests

    {% elif threshold_type ==  'absolute' %}
      {{ threshold_value }} AS max_errors  --find the maximum amount of rows that can fail the test

    {% else %} --ELSE
      0 AS max_errors  --find the maximum amount of rows that can fail the test
    {% endif %}
)

/*-----------------------------------------------------
Perform the Test.
*/-----------------------------------------------------

, perform_test AS (

        select
            {{ column_name }}

        from valid_test_rows
        where {{ column_name }} is not null
        group by {{ column_name }}
        having count(*) > 1


)
/*-----------------------------------------------------
Count the Errors.
*/ -----------------------------------------------------
, validation_errors AS (
SELECT
COUNT(*) AS n_errors
FROM perform_test
)
/*-----------------------------------------------------
Check if the number of errors is greater than the threshold.
*/ -----------------------------------------------------
SELECT
    CASE WHEN ve.n_errors > tl.max_errors THEN ve.n_errors - tl.max_errors
         WHEN ve.n_errors < tl.max_errors THEN 0
    ELSE 0
    END AS result
FROM validation_errors ve
CROSS JOIN threshold_limit tl


   {% endmacro %}

The Test Exclusion Macro

/* -----------------------------------------------------
This is the macro that allows the advanced schema tests to access the tbl_test_exclusions 
table.

Arguments:
- model_name: The model you wish to exclude from
- column_name: The column name you wish to exclude on

Output:
A SQL clause that will exclude rows from a table.

Usage:
  Example use case:
  In 2019 a bug allowed a single customer (dim_customer.customer_id =123)
  to be able to sign up without a last name. Then the not_null schema test on
  dim_customer.last_name would always fail on a single row. How Annoying!

  Usage of this macro depends on the configuration of the tbl_test_exclusions
  table schema. It should have the following columns:
    id:               surrogate key for the exclusion table.[N]
    exclusion_value:  unique key of the model that you wish to exclude a row from for testing.[123]
    model_name:       The name of the model that you wish to exclude a row from. [dim_customer]
    column_name:      The name of the column you wish to exclude on. [customer_id]
    owner:            String field to indicate who decided to exclude the row from testing.[your_name]
    description:      String field to explain why this row is special. [Bug back in 2019 that allowed a single customer to sign up with no last name.]

Notes:
The exclusion table can be any table in the database. dbt seed files and Gsheets
work well.
The macro should be placed inside a SQL WHERE clause, either:
- after WHERE
- or after AND
Because of the way get_mode_ref() works, if you choose to create a base table for your
tbl_test_exclusions model you will need to set the schema=none


*/ -----------------------------------------------------
{% macro exclude_rows_from_test(model_name, column_name) %}
CAST({{ column_name }} AS STRING) NOT IN
(
SELECT CAST(exclusion_value AS STRING) AS exclusion_value  -- eg, id
   FROM {{ get_model_ref('tbl_test_exclusions') }} RIRE
  WHERE
  RIRE.model_name =  '{{ model_name }}'
  AND
  RIRE.column_name =  '{{ column_name }}'
)
{% endmacro %}


The base table for the test exclusions macro


{{
  config(
    materialized='table',
    schema=none
  )
}}
/* -----------------------------------------------------

Scope: Identifying data for rows in models that we wish to exclude from certain tests because
they will cause the test to constantly fail.

Grain: id(surrogate key from source table which is a gsheet.)

Purpose: to create a base model for the test exclusion dataset.

Config:
This model has to be a table because the source data is a a gsheet which has strange errors
when it is repeatedly queried.

The schema must be set to none because it is referred to using the get_model_ref macro
which is designed to get around a strange issue with using refs in conditional blocks.
https://github.com/dbt-labs/dbt/issues/1077.


Notes:

*/ -----------------------------------------------------

SELECT rite.id,
       rite.exclusion_value,
       rite.model_name,
       rite.column_name,
       rite.owner,
       rite.description
  FROM {{ source('internal_datasets','tbl_test_exclusions') }} rite
 WHERE rite.id IS NOT NULL
   AND rite.exclusion_value IS NOT NULL
   AND rite.model_name IS NOT NULL
   AND rite.column_name IS NOT NULL


Handy macro "get_model_ref"

/*
Purpose:
to provide an alternative for {{ ref('model_name') }}
that will allow dbt to compile:
- when using a diabled model
- when there is a {{ ref('model_name') }} inside a conditional block.

Arguments:
    model_name: the model name that might be disabled but needs to be referenced

Output:
project_id.dataset_name.table_name

Notes:
This doesn't include any custom schemas.
*/

{% macro get_model_ref(model_name) %}
`{{target.project}}.{{this.schema}}.{{model_name}}`
{% endmacro %}

3 Likes

This is cool! We’ve historically filtered out handfuls of misbehaving records by adding their IDs to the where clause of the standard tests, but if you’ve got a lot of them and want more observability this is more scalable once it’s properly set up.