At Rezdy we’ve got our fair share of failing dbt tests and it goes without saying that when you have a failing test you should always address the root cause of the problem.
That being said it’s a lot faster just to sweep non-critical failing tests under the rug and get on with your day. These “advanced” schema tests are like the trap door under the rug.
Inspired by a post on creating error thresholds for schema tests, we set out to create a malleable dbt schema test for our most troublesome tests(unique and not_null)
The test handles 3 cases:
- Date scoping the dataset before running the test
- Implementing either a absolute or relative error threshold for row failure
- Explicitly exclude dodgy rows from tests
Date scoping has been covered a fair bit already and is likely to be made redundant by either dbt_utils or the new capability of the not_null schema test in dbt v0.20.0 and thresholds have received a fair bit of attention so this post is mostly about the more idiosyncratic test exclusions option.
The test exclusion functionality allows us to test a dataset as normal but put in a small clause such as:
Even though all customers who sign up should have a trial event before they have a purchase event(trial_id
not null in fct_purchase
), there was that one time back in 2018 where a customer managed to somehow do it because of a bug. We’ll make a note of the customers ID and the reason why they don’t have a trial in our exclusion file (a seed file or a google sheet if you can swing it Hail BigQuery) then move on with our lives.
Note:
- There are separate test macros for not_null and unique but they are identical except for the logic that captures duplicates or nulls in the
perform_test
CTE. - These models are currently running on a BigQuery warehouse, you might need to tweak some syntax to get it to run on other warehouses.*
- The two supporting macros;
get_model_ref
andexclude_rows_from_test
are also included down the bottom. - You can name the macros test_not_null and test_unique without the adv prefix and they will just replace the ones that come with dbt, but then you have to go and update your whole project.
-
tbl_test_exclusions
is the name of the table where we store both the rows that will be excluded and the reason for doing so you can call this whatever you like. - If you’ve been doing some black magic with condition blocks and ref statements and encountered the below test ERROR, it might be worth checking out the
get_model_ref
macro down the bottom.
Compilation Error in test some_test_name (models/staging/something/something.yml)
dbt was unable to infer all dependencies for the model "some_test_name".
This typically happens when ref() is placed within a conditional block.
To fix this, add the following hint to the top of the model "some_test_name":
Not Null
/* -----------------------------------------------------
This is an advanced not_null schema test.
It is used the same way the normal not_null schema test is used inside the .yml file
however instead of:
- not_null
use
- adv_not_null:
The test has 3 functions:
- Thresholds (absolute and relative)
- Date Limits (by day)
- Row Exclusions (nit picking out bad rows)
Usage Notes:
Thresholds are calculated after date scoping.
When testing relative thresholds use values 1 to 100 indicating the % of bad rows expected.
The error output for threshold fields indicates the number of fails beyond the threshold.
not the absolute number of fails. (The absolute number can be obtained from a normal test.)
A single test can use all 3 functions.
For specifcs on the row exclusions functionality check the exclude_rows_from_test macro
config for .yml file
- adv_unique:
model: # Taken from the model the test is on
column_name: # Taken from the column the test is on
event_timestamp_column_name: # [The date column for date limit tests]
number_of_days_to_check: # [The number of days for date limit tests]
threshold_type: # relative / absolute (relative in %)
threshold_value: # integer to indicate accepted errors for abs/rel
column_to_exclude_on: # if the column has exclusions on a field, enter that column name here
*/ -----------------------------------------------------
{%macro test_adv_not_null(
model,
column_name,
event_timestamp_column_name = None,
number_of_days_to_check = None,
threshold_type = None,
threshold_value = 1,
column_to_exclude_on = None ) %}
/* -----------------------------------------------------
Scope to valid test rows.
*/ -----------------------------------------------------
WITH valid_test_rows AS (
SELECT *
FROM {{ model }}
WHERE 1 = 1
{% if column_to_exclude_on != None %} --This will check your test exclusion
AND {{ exclude_rows_from_test( model.name, column_to_exclude_on ) }}
{% endif %}
{% if number_of_days_to_check != None and event_timestamp_column_name != None%}
AND {{ event_timestamp_column_name }} > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {{number_of_days_to_check}} DAY)
{% endif %}
)
/*-----------------------------------------------------
Set Threshold if applicable.
*/ -----------------------------------------------------
, threshold_limit AS (
SELECT
{% if threshold_type == 'relative' %}
ROUND(COUNT(*) * ({{ threshold_value }} / 100),0) AS max_errors --find the maximum amount of rows that can fail the test
FROM valid_test_rows --This allows relative tests to be effective on date range tests
{% elif threshold_type == 'absolute' %}
{{ threshold_value }} AS max_errors --find the maximum amount of rows that can fail the test
{% else %} --ELSE
0 AS max_errors --find the maximum amount of rows that can fail the test
{% endif %}
)
/*-----------------------------------------------------
Perform the Test.
*/-----------------------------------------------------
, perform_test AS (
SELECT *
FROM valid_test_rows
WHERE {{ column_name }} IS NULL
)
/*-----------------------------------------------------
Count the Errors.
*/ -----------------------------------------------------
, validation_errors AS (
SELECT
COUNT(*) AS n_errors
FROM perform_test
)
/*-----------------------------------------------------
Check if the number of errors is greater than the threshold.
*/ -----------------------------------------------------
SELECT
CASE WHEN ve.n_errors > tl.max_errors THEN ve.n_errors - tl.max_errors
WHEN ve.n_errors < tl.max_errors THEN 0
ELSE 0
END AS result
FROM validation_errors ve
CROSS JOIN threshold_limit tl
{% endmacro %}
Unique
/* -----------------------------------------------------
This is an advanced unique schema test.
It is used the same way the normal unique schema test is used inside the .yml file
however instead of:
- unique
use
- adv_unique:
The test has 3 functions:
- Thresholds (absolute and relative)
- Date Limits (by day)
- Row Exclusions
Usage Notes:
Thresholds are calculated after date scoping.
When testing relative thresholds use values 1 to 100 indicating the % of bad rows expected.
The error output for threshold fields indicates the number of fails beyond the threshold.
not the absolute number of fails. (The absolute number can be obtained from a normal test.)
A single test can use all 3 functions.
For specifics on the row exclusions functionality check the exclude_rows_from_test macro
config for .yml file
- adv_unique:
model: # Taken from the model the test is on
column_name: # Taken from the column the test is on
event_timestamp_column_name: # [The date column for date limit tests]
number_of_days_to_check: # [The number of days for date limit tests]
threshold_type: # relative / absolute (relative in %)
threshold_value: # integer to indicate accepted errors for abs/rel
column_to_exclude_on: # if the column has exclusions on a field, enter that column name here
*/ -----------------------------------------------------
{%macro test_adv_unique(
model,
column_name,
event_timestamp_column_name = None,
number_of_days_to_check = None,
threshold_type = None,
threshold_value = 1,
column_to_exclude_on = None ) %}
/* -----------------------------------------------------
Scope to valid test rows.
*/ -----------------------------------------------------
WITH valid_test_rows AS (
SELECT *
FROM {{ model }}
WHERE 1 = 1
{% if column_to_exclude_on != None %} --This will check your test exclusion
AND {{ exclude_rows_from_test( model.name, column_to_exclude_on ) }}
{% endif %}
{% if number_of_days_to_check != None and event_timestamp_column_name != None%}
AND {{ event_timestamp_column_name }} > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {{number_of_days_to_check}} DAY)
{% endif %}
)
/*-----------------------------------------------------
Set Threshold if applicable.
*/ -----------------------------------------------------
, threshold_limit AS (
SELECT
{% if threshold_type == 'relative' %}
ROUND(COUNT(*) * ({{ threshold_value }} / 100),0) AS max_errors --find the maximum amount of rows that can fail the test
FROM valid_test_rows --This allows relative tests to be effevtive on date range tests
{% elif threshold_type == 'absolute' %}
{{ threshold_value }} AS max_errors --find the maximum amount of rows that can fail the test
{% else %} --ELSE
0 AS max_errors --find the maximum amount of rows that can fail the test
{% endif %}
)
/*-----------------------------------------------------
Perform the Test.
*/-----------------------------------------------------
, perform_test AS (
select
{{ column_name }}
from valid_test_rows
where {{ column_name }} is not null
group by {{ column_name }}
having count(*) > 1
)
/*-----------------------------------------------------
Count the Errors.
*/ -----------------------------------------------------
, validation_errors AS (
SELECT
COUNT(*) AS n_errors
FROM perform_test
)
/*-----------------------------------------------------
Check if the number of errors is greater than the threshold.
*/ -----------------------------------------------------
SELECT
CASE WHEN ve.n_errors > tl.max_errors THEN ve.n_errors - tl.max_errors
WHEN ve.n_errors < tl.max_errors THEN 0
ELSE 0
END AS result
FROM validation_errors ve
CROSS JOIN threshold_limit tl
{% endmacro %}
The Test Exclusion Macro
/* -----------------------------------------------------
This is the macro that allows the advanced schema tests to access the tbl_test_exclusions
table.
Arguments:
- model_name: The model you wish to exclude from
- column_name: The column name you wish to exclude on
Output:
A SQL clause that will exclude rows from a table.
Usage:
Example use case:
In 2019 a bug allowed a single customer (dim_customer.customer_id =123)
to be able to sign up without a last name. Then the not_null schema test on
dim_customer.last_name would always fail on a single row. How Annoying!
Usage of this macro depends on the configuration of the tbl_test_exclusions
table schema. It should have the following columns:
id: surrogate key for the exclusion table.[N]
exclusion_value: unique key of the model that you wish to exclude a row from for testing.[123]
model_name: The name of the model that you wish to exclude a row from. [dim_customer]
column_name: The name of the column you wish to exclude on. [customer_id]
owner: String field to indicate who decided to exclude the row from testing.[your_name]
description: String field to explain why this row is special. [Bug back in 2019 that allowed a single customer to sign up with no last name.]
Notes:
The exclusion table can be any table in the database. dbt seed files and Gsheets
work well.
The macro should be placed inside a SQL WHERE clause, either:
- after WHERE
- or after AND
Because of the way get_mode_ref() works, if you choose to create a base table for your
tbl_test_exclusions model you will need to set the schema=none
*/ -----------------------------------------------------
{% macro exclude_rows_from_test(model_name, column_name) %}
CAST({{ column_name }} AS STRING) NOT IN
(
SELECT CAST(exclusion_value AS STRING) AS exclusion_value -- eg, id
FROM {{ get_model_ref('tbl_test_exclusions') }} RIRE
WHERE
RIRE.model_name = '{{ model_name }}'
AND
RIRE.column_name = '{{ column_name }}'
)
{% endmacro %}
The base table for the test exclusions macro
{{
config(
materialized='table',
schema=none
)
}}
/* -----------------------------------------------------
Scope: Identifying data for rows in models that we wish to exclude from certain tests because
they will cause the test to constantly fail.
Grain: id(surrogate key from source table which is a gsheet.)
Purpose: to create a base model for the test exclusion dataset.
Config:
This model has to be a table because the source data is a a gsheet which has strange errors
when it is repeatedly queried.
The schema must be set to none because it is referred to using the get_model_ref macro
which is designed to get around a strange issue with using refs in conditional blocks.
https://github.com/dbt-labs/dbt/issues/1077.
Notes:
*/ -----------------------------------------------------
SELECT rite.id,
rite.exclusion_value,
rite.model_name,
rite.column_name,
rite.owner,
rite.description
FROM {{ source('internal_datasets','tbl_test_exclusions') }} rite
WHERE rite.id IS NOT NULL
AND rite.exclusion_value IS NOT NULL
AND rite.model_name IS NOT NULL
AND rite.column_name IS NOT NULL
Handy macro “get_model_ref”
/*
Purpose:
to provide an alternative for {{ ref('model_name') }}
that will allow dbt to compile:
- when using a diabled model
- when there is a {{ ref('model_name') }} inside a conditional block.
Arguments:
model_name: the model name that might be disabled but needs to be referenced
Output:
project_id.dataset_name.table_name
Notes:
This doesn't include any custom schemas.
*/
{% macro get_model_ref(model_name) %}
`{{target.project}}.{{this.schema}}.{{model_name}}`
{% endmacro %}