Make test failures more visible

Hey @rafael! We have a long-standing issue for this which you can check out over here: https://github.com/fishtown-analytics/dbt/issues/517

My thinking is that we can just update the schema test macros that dbt uses to include a debug query in a SQL comment. Take this test macro for example: https://github.com/fishtown-analytics/dbt/blob/dev/0.14.1/core/dbt/include/global_project/macros/schema_tests/unique.sql

If we updated the macro contents to:

{% macro test_unique(model) %}

{% set column_name = kwargs.get('column_name', kwargs.get('arg')) %}

/*
---------------------------------------
-- Find failing rows for this test with:

select *
from {{ model }}
where {{ column_name }} in (
   select
        {{ column_name }}

    from {{ model }}
    where {{ column_name }} is not null
    group by {{ column_name }}
    having count(*) > 1
) failing_records
limit 100

---------------------------------------
*/

select count(*)
from (

    select
        {{ column_name }}

    from {{ model }}
    where {{ column_name }} is not null
    group by {{ column_name }}
    having count(*) > 1

) validation_errors

{% endmacro %}

If we wanted to use CTEs, we could do something like:

with failures as (
    select
        {{ column_name }}

    from {{ model }}
    where {{ column_name }} is not null
    group by {{ column_name }}
    having count(*) > 1
),
audit as (
  select * from {{ model }}
  where {{ column_name }} in (select {{ column_name }} from failures)
  limit 100
),
validation_errors as (
  select * from failures
)
select * from validation_errors

Then, users could just pick up the compiled SQL and change validation_errors in the outer select to audit. We historically have tried to avoid CTEs in schema tests because they can be very slow on some databases (looking at Postgres) but it’s worth considering!

Last, we’ve talked about persisting test failures as tables in the database. This would effectively involve running:

create table test_name_here__timestamp as (
 -- query that finds test failures
);

dbt would then query this table to find the number of records in the table. If it’s zero, then dbt can delete the table, otherwise, dbt can leave the table there for users to inspect. I imagine users would want to configure the number of sampled records, and we’d need to provide a way for dbt to clean up these tables too!

So, these are the ideas in my head - what do you think?

1 Like