Advanced schema tests; thresholds, exclusions and date limits

dominic · July 18, 2021, 9:04am

At Rezdy we’ve got our fair share of failing dbt tests and it goes without saying that when you have a failing test you should always address the root cause of the problem.

That being said it’s a lot faster just to sweep non-critical failing tests under the rug and get on with your day. These “advanced” schema tests are like the trap door under the rug.

Inspired by a post on creating error thresholds for schema tests, we set out to create a malleable dbt schema test for our most troublesome tests(unique and not_null)

The test handles 3 cases:

Date scoping the dataset before running the test
Implementing either a absolute or relative error threshold for row failure
Explicitly exclude dodgy rows from tests

Date scoping has been covered a fair bit already and is likely to be made redundant by either dbt_utils or the new capability of the not_null schema test in dbt v0.20.0 and thresholds have received a fair bit of attention so this post is mostly about the more idiosyncratic test exclusions option.

The test exclusion functionality allows us to test a dataset as normal but put in a small clause such as:

Even though all customers who sign up should have a trial event before they have a purchase event(trial_id not null in fct_purchase), there was that one time back in 2018 where a customer managed to somehow do it because of a bug. We’ll make a note of the customers ID and the reason why they don’t have a trial in our exclusion file (a seed file or a google sheet if you can swing it Hail BigQuery) then move on with our lives.

Note:

There are separate test macros for not_null and unique but they are identical except for the logic that captures duplicates or nulls in the perform_test CTE.
These models are currently running on a BigQuery warehouse, you might need to tweak some syntax to get it to run on other warehouses.*
The two supporting macros; get_model_ref and exclude_rows_from_test are also included down the bottom.
You can name the macros test_not_null and test_unique without the adv prefix and they will just replace the ones that come with dbt, but then you have to go and update your whole project.
tbl_test_exclusions is the name of the table where we store both the rows that will be excluded and the reason for doing so you can call this whatever you like.
If you’ve been doing some black magic with condition blocks and ref statements and encountered the below test ERROR, it might be worth checking out the get_model_ref macro down the bottom.

Compilation Error in test some_test_name (models/staging/something/something.yml)
  dbt was unable to infer all dependencies for the model "some_test_name".
  This typically happens when ref() is placed within a conditional block.

  To fix this, add the following hint to the top of the model "some_test_name":

Not Null


/* -----------------------------------------------------

This is an advanced not_null schema test.
It is used the same way the normal not_null schema test is used inside the .yml file
however instead of:
              - not_null
use
              - adv_not_null:

The test has 3 functions:
 - Thresholds (absolute and relative)
 - Date Limits (by day)
 - Row Exclusions (nit picking out bad rows)

Usage Notes:
  Thresholds are calculated after date scoping.
  When testing relative thresholds use values 1 to 100 indicating the % of bad rows expected.
  The error output for threshold fields indicates the number of fails beyond the threshold.
    not the absolute number of fails. (The absolute number can be obtained from a normal test.)
  A single test can use all 3 functions.
  For specifcs on the row exclusions functionality check the exclude_rows_from_test macro


config for .yml file
            - adv_unique:
                model:                          # Taken from the model the test is on
                column_name:                    # Taken from the column the test is on
                event_timestamp_column_name:    # [The date column for date limit tests]
                number_of_days_to_check:        # [The number of days for date limit tests]
                threshold_type:                 # relative / absolute (relative in %)
                threshold_value:                # integer to indicate accepted errors for abs/rel
                column_to_exclude_on:           # if the column has exclusions on a field, enter that column name here

*/ -----------------------------------------------------
{%macro test_adv_not_null(
  model,
  column_name,
  event_timestamp_column_name = None,
  number_of_days_to_check = None,
  threshold_type = None,
  threshold_value = 1,
  column_to_exclude_on = None ) %}

/* -----------------------------------------------------
Scope to valid test rows.
*/ -----------------------------------------------------
WITH valid_test_rows AS (
SELECT *
  FROM {{ model }}
 WHERE 1 = 1

   {% if column_to_exclude_on !=  None %} --This will check your test exclusion
    AND {{ exclude_rows_from_test( model.name, column_to_exclude_on ) }}
   {% endif %}

   {% if number_of_days_to_check != None and event_timestamp_column_name != None%}
    AND {{ event_timestamp_column_name }} > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {{number_of_days_to_check}} DAY)
   {% endif %}
 )
/*-----------------------------------------------------
Set Threshold if applicable.
*/ -----------------------------------------------------

, threshold_limit AS (
SELECT

    {% if threshold_type ==  'relative' %}
      ROUND(COUNT(*) * ({{ threshold_value }} / 100),0) AS max_errors  --find the maximum amount of rows that can fail the test
      FROM valid_test_rows --This allows relative tests to be effective on date range tests

    {% elif threshold_type ==  'absolute' %}
      {{ threshold_value }} AS max_errors  --find the maximum amount of rows that can fail the test

    {% else %} --ELSE
      0 AS max_errors  --find the maximum amount of rows that can fail the test
    {% endif %}
)

/*-----------------------------------------------------
Perform the Test.
*/-----------------------------------------------------

, perform_test AS (

        SELECT *
        FROM valid_test_rows
        WHERE {{ column_name }} IS NULL


)
/*-----------------------------------------------------
Count the Errors.
*/ -----------------------------------------------------
, validation_errors AS (
SELECT
COUNT(*) AS n_errors
FROM perform_test
)
/*-----------------------------------------------------
Check if the number of errors is greater than the threshold.
*/ -----------------------------------------------------
SELECT
    CASE WHEN ve.n_errors > tl.max_errors THEN ve.n_errors - tl.max_errors
         WHEN ve.n_errors < tl.max_errors THEN 0
    ELSE 0
    END AS result
FROM validation_errors ve
CROSS JOIN threshold_limit tl


   {% endmacro %}

Unique

/* -----------------------------------------------------
This is an advanced unique schema test.
It is used the same way the normal unique schema test is used inside the .yml file
however instead of:
              - unique
use
              - adv_unique:

The test has 3 functions:
 - Thresholds (absolute and relative)
 - Date Limits (by day)
 - Row Exclusions

Usage Notes:
  Thresholds are calculated after date scoping.
  When testing relative thresholds use values 1 to 100 indicating the % of bad rows expected.
  The error output for threshold fields indicates the number of fails beyond the threshold.
    not the absolute number of fails. (The absolute number can be obtained from a normal test.)
  A single test can use all 3 functions.
  For specifics on the row exclusions functionality check the exclude_rows_from_test macro


config for .yml file
            - adv_unique:
                model:                          # Taken from the model the test is on
                column_name:                    # Taken from the column the test is on
                event_timestamp_column_name:    # [The date column for date limit tests]
                number_of_days_to_check:        # [The number of days for date limit tests]
                threshold_type:                 # relative / absolute (relative in %)
                threshold_value:                # integer to indicate accepted errors for abs/rel
                column_to_exclude_on:           # if the column has exclusions on a field, enter that column name here

*/ -----------------------------------------------------
{%macro test_adv_unique(
  model,
  column_name,
  event_timestamp_column_name = None,
  number_of_days_to_check = None,
  threshold_type = None,
  threshold_value = 1,
  column_to_exclude_on = None ) %}

/* -----------------------------------------------------
Scope to valid test rows.
*/ -----------------------------------------------------
WITH valid_test_rows AS (
SELECT *
  FROM {{ model }}
 WHERE 1 = 1

   {% if column_to_exclude_on !=  None %} --This will check your test exclusion
    AND {{ exclude_rows_from_test( model.name, column_to_exclude_on ) }}
   {% endif %}

   {% if number_of_days_to_check != None and event_timestamp_column_name != None%}
    AND {{ event_timestamp_column_name }} > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {{number_of_days_to_check}} DAY)
   {% endif %}
 )
/*-----------------------------------------------------
Set Threshold if applicable.
*/ -----------------------------------------------------

, threshold_limit AS (
SELECT

    {% if threshold_type ==  'relative' %}
    ROUND(COUNT(*) * ({{ threshold_value }} / 100),0) AS max_errors  --find the maximum amount of rows that can fail the test
      FROM valid_test_rows --This allows relative tests to be effevtive on date range tests

    {% elif threshold_type ==  'absolute' %}
      {{ threshold_value }} AS max_errors  --find the maximum amount of rows that can fail the test

    {% else %} --ELSE
      0 AS max_errors  --find the maximum amount of rows that can fail the test
    {% endif %}
)

/*-----------------------------------------------------
Perform the Test.
*/-----------------------------------------------------

, perform_test AS (

        select
            {{ column_name }}

        from valid_test_rows
        where {{ column_name }} is not null
        group by {{ column_name }}
        having count(*) > 1


)
/*-----------------------------------------------------
Count the Errors.
*/ -----------------------------------------------------
, validation_errors AS (
SELECT
COUNT(*) AS n_errors
FROM perform_test
)
/*-----------------------------------------------------
Check if the number of errors is greater than the threshold.
*/ -----------------------------------------------------
SELECT
    CASE WHEN ve.n_errors > tl.max_errors THEN ve.n_errors - tl.max_errors
         WHEN ve.n_errors < tl.max_errors THEN 0
    ELSE 0
    END AS result
FROM validation_errors ve
CROSS JOIN threshold_limit tl


   {% endmacro %}

The Test Exclusion Macro

/* -----------------------------------------------------
This is the macro that allows the advanced schema tests to access the tbl_test_exclusions 
table.

Arguments:
- model_name: The model you wish to exclude from
- column_name: The column name you wish to exclude on

Output:
A SQL clause that will exclude rows from a table.

Usage:
  Example use case:
  In 2019 a bug allowed a single customer (dim_customer.customer_id =123)
  to be able to sign up without a last name. Then the not_null schema test on
  dim_customer.last_name would always fail on a single row. How Annoying!

  Usage of this macro depends on the configuration of the tbl_test_exclusions
  table schema. It should have the following columns:
    id:               surrogate key for the exclusion table.[N]
    exclusion_value:  unique key of the model that you wish to exclude a row from for testing.[123]
    model_name:       The name of the model that you wish to exclude a row from. [dim_customer]
    column_name:      The name of the column you wish to exclude on. [customer_id]
    owner:            String field to indicate who decided to exclude the row from testing.[your_name]
    description:      String field to explain why this row is special. [Bug back in 2019 that allowed a single customer to sign up with no last name.]

Notes:
The exclusion table can be any table in the database. dbt seed files and Gsheets
work well.
The macro should be placed inside a SQL WHERE clause, either:
- after WHERE
- or after AND
Because of the way get_mode_ref() works, if you choose to create a base table for your
tbl_test_exclusions model you will need to set the schema=none


*/ -----------------------------------------------------
{% macro exclude_rows_from_test(model_name, column_name) %}
CAST({{ column_name }} AS STRING) NOT IN
(
SELECT CAST(exclusion_value AS STRING) AS exclusion_value  -- eg, id
   FROM {{ get_model_ref('tbl_test_exclusions') }} RIRE
  WHERE
  RIRE.model_name =  '{{ model_name }}'
  AND
  RIRE.column_name =  '{{ column_name }}'
)
{% endmacro %}

The base table for the test exclusions macro


{{
  config(
    materialized='table',
    schema=none
  )
}}
/* -----------------------------------------------------

Scope: Identifying data for rows in models that we wish to exclude from certain tests because
they will cause the test to constantly fail.

Grain: id(surrogate key from source table which is a gsheet.)

Purpose: to create a base model for the test exclusion dataset.

Config:
This model has to be a table because the source data is a a gsheet which has strange errors
when it is repeatedly queried.

The schema must be set to none because it is referred to using the get_model_ref macro
which is designed to get around a strange issue with using refs in conditional blocks.
https://github.com/dbt-labs/dbt/issues/1077.


Notes:

*/ -----------------------------------------------------

SELECT rite.id,
       rite.exclusion_value,
       rite.model_name,
       rite.column_name,
       rite.owner,
       rite.description
  FROM {{ source('internal_datasets','tbl_test_exclusions') }} rite
 WHERE rite.id IS NOT NULL
   AND rite.exclusion_value IS NOT NULL
   AND rite.model_name IS NOT NULL
   AND rite.column_name IS NOT NULL

Handy macro “get_model_ref”

/*
Purpose:
to provide an alternative for {{ ref('model_name') }}
that will allow dbt to compile:
- when using a diabled model
- when there is a {{ ref('model_name') }} inside a conditional block.

Arguments:
    model_name: the model name that might be disabled but needs to be referenced

Output:
project_id.dataset_name.table_name

Notes:
This doesn't include any custom schemas.
*/

{% macro get_model_ref(model_name) %}
`{{target.project}}.{{this.schema}}.{{model_name}}`
{% endmacro %}

joellabes · July 19, 2021, 5:30am

This is cool! We’ve historically filtered out handfuls of misbehaving records by adding their IDs to the where clause of the standard tests, but if you’ve got a lot of them and want more observability this is more scalable once it’s properly set up.

Topic		Replies	Views
Limiting the not_null test Help testing , best-practice	3	7679	July 16, 2021
Examples of custom schema tests Show and Tell testing	8	29944	January 19, 2023
Schema tests - Disabling SELECT * in Column tests for a model Help testing , bigquery , dbt-core	4	1039	January 17, 2024
Error: dbt Tests on partitioned table fail Help jinja , testing , bigquery , dbt-core	1	129	August 12, 2024
Select rows based on test pass/fail status Help testing , best-practice	1	2300	June 27, 2023

Advanced schema tests; thresholds, exclusions and date limits

Related topics