Make test failures more visible

We are interested in using dbt test to documenting data quality issues. For instance, some of when tests fail because of bad user input (e.g., missing references for mandatory fields, or invalid fields in applications).

However, processing failed tests is quite an effort. When a test fails we follow theses steps to figure out the offending rows:

  1. Check the output for failed tests
  2. Browse to the log with the query
  3. Update the query to return the offending rows (e.g, the null check query returns the amount of rows)
  4. Document and/or reach out to the person who can solve the issue at the source

We’d like to use the test output as a communication tool, so we can document and discuss data quality issues with people not working directly with dbt.

Does this make sense? If so, any suggestions how we can achieve this goal.

2 Likes

Hey @rafael! We have a long-standing issue for this which you can check out over here: https://github.com/fishtown-analytics/dbt/issues/517

My thinking is that we can just update the schema test macros that dbt uses to include a debug query in a SQL comment. Take this test macro for example: https://github.com/fishtown-analytics/dbt/blob/dev/0.14.1/core/dbt/include/global_project/macros/schema_tests/unique.sql

If we updated the macro contents to:

{% macro test_unique(model) %}

{% set column_name = kwargs.get('column_name', kwargs.get('arg')) %}

/*
---------------------------------------
-- Find failing rows for this test with:

select *
from {{ model }}
where {{ column_name }} in (
   select
        {{ column_name }}

    from {{ model }}
    where {{ column_name }} is not null
    group by {{ column_name }}
    having count(*) > 1
) failing_records
limit 100

---------------------------------------
*/

select count(*)
from (

    select
        {{ column_name }}

    from {{ model }}
    where {{ column_name }} is not null
    group by {{ column_name }}
    having count(*) > 1

) validation_errors

{% endmacro %}

If we wanted to use CTEs, we could do something like:

with failures as (
    select
        {{ column_name }}

    from {{ model }}
    where {{ column_name }} is not null
    group by {{ column_name }}
    having count(*) > 1
),
audit as (
  select * from {{ model }}
  where {{ column_name }} in (select {{ column_name }} from failures)
  limit 100
),
validation_errors as (
  select * from failures
)
select * from validation_errors

Then, users could just pick up the compiled SQL and change validation_errors in the outer select to audit. We historically have tried to avoid CTEs in schema tests because they can be very slow on some databases (looking at Postgres) but it’s worth considering!

Last, we’ve talked about persisting test failures as tables in the database. This would effectively involve running:

create table test_name_here__timestamp as (
 -- query that finds test failures
);

dbt would then query this table to find the number of records in the table. If it’s zero, then dbt can delete the table, otherwise, dbt can leave the table there for users to inspect. I imagine users would want to configure the number of sampled records, and we’d need to provide a way for dbt to clean up these tables too!

So, these are the ideas in my head - what do you think?

1 Like

Yes, I like the overall approach of being able to work with the actual offending rows instead of counts. That’s something we do for our custom tests, and the performance cost is acceptable for us.

You seemed to outline two solutions:

  1. CLI output by using comments and easy to change queries. My suggestion here would be to make it “first class”. You could add a dbt test --audit model, that would make the changes you recommended automatically. With Jinja this should be easy to do.

  2. Persisting results on the DB. This is cool! Although it’s obviously what I want, I was already thinking on how to make an ETL to get the CLI output to import in the DB :man_facepalming:

Because this issue is quite high on our backlog, I will try to hack something up based on these ideas. I’ll make a PR if I get it anywhere.

Great stuff!

Sounds great! Feel free to follow up in the linked issue (or create a new one) as you get underway. Super happy to help with figuring out the workflow + implementation however I can

PostgreSQL’s CTEs should not be materialized by default in v12.

As when working with DBT one tends to see everything as a model, we materialized ‘data’ tests in a model unioning all the tests (as data tests return dataset, not just count()). Then we created a ‘data’ test based on a count() on that model group by test label.

It would be nice to have a kind of “metrics vault” filled by dbt somewhere in the dwh, containing logs and test results, but that implies adding a backend to the tool.

1 Like