How to not lose your mind when auditing data: Part II

claire · October 2, 2019, 11:15pm

This article is the second part of a series (I didn’t know I was going to write a Part II, yet, here we are ). Read the first part here.

TL,DR: The original macros in the audit-helper package were super helpful in making sure you don’t introduce regressions when refactoring SQL. Now, we have new macros that help identify the problematic column when a regression is introduced.

Auditing data: no longer my least favorite task in analytics

In my previous article, I shared how our audit-helper package can be used to help confirm that two queries that should give you the same results are actually giving you the same results. Essentially, that you haven’t introduced a regression.

I’ve been using it pretty heavily while migrating some legacy ETL code into dbt, and then refactoring it to make it follow our coding conventions, all while checking I’m doing a “like for like” replacement. And it’s been going great!

If you find yourself refactoring SQL, you should definitely give this package a go.

Until it suddenly was my least favorite task again

Unfortunately, there have been a few cases lately where I have introduced regressions to my SQL. Now I needed to dive into why my records weren’t matching up.

Problem 1: Mismatching column types

During a recent model migration, I went to run my validation query to compare my old relation to my new one, and I got this back:

[Amazon](500310) Invalid operation: UNION types character varying and bigint cannot be matched;

Cool cool cool.

giphy%20(1)

So one of my columns in my old relation is a character varying, whereas the new one is a bigint. But which column?!

The relation I was moving over to dbt was pretty wide (~40 columns), so I didn’t want to inspect this by visually checking the results if I were to select * from my relation. Instead, I chose to query information_schema.columns to compare the column types, and check the the output of that query:

with a_cols as (
    select 
      * 
    from information_schema.columns
    where table_schema = 'old_etl_schema'
      and table_name = 'fct_orders'
),

b_cols as (
    select 
      * 
    from information_schema.columns
    where table_schema = 'dbt_claire'
      and table_name = 'fct_orders'
)

select
    column_name,
    a_cols.ordinal_position as a_ordinal_position,
    b_cols.ordinal_position as b_ordinal_position,
    a_cols.data_type as a_data_type,
    b_cols.data_type as b_data_type
from a_cols
full outer join b_cols using (column_name)
order by a_ordinal_position, b_ordinal_position

Oooh, the rare full outer join spotted in the wild!

(If you aren’t on Redshift skip this paragraph. If you are, my condolences…)

But it turns out that one of my relations is a late binding view, and Redshift doesn’t store late binding view columns in the the information schema. If I wanted to get the column types from this early binding view, I either had to:

switch it to being an early-binding view, or
instead query pg_get_late_binding_view_cols() (docs)

.

I went with the latter, because fortunately, @drew has already done all the hard work to write this query in dbt (let’s face it, he’s the real MVP).

(OK if you aren’t on Redshift, you can come back now, we hope you enjoyed whatever party you got to go to during that part, we’re all jealous!)

And then because I never want to solve this problem again, and I don’t want anyone else to have to either, I packaged it up in a macro and added it to audit-helper.

{% set old_etl_relation=adapter.get_relation(
      database=target.database,
      schema="old_etl_schema",
      identifier="fct_orders"
) -%}

{% set dbt_relation=ref('fct_orders') %}

{{ audit_helper.compare_relation_columns(
    a_relation=old_etl_relation,
    b_relation=dbt_relation
) }}

The compiled query helped me quickly find my offending columns:

column_name	a_ordinal_position	b_ordinal_position	a_data_type	b_data_type
order_id	1	1	integer	integer
customer_id	2	2	character varying	bigint
…	…	…	…	…

Once I found that customer_id was the issue, I was able to cast it to the correct data type, and run my compare_relations query.

Problem 2: Lots of rows that don’t match

So once I could run my comparison query, it turned out that my records did not match up across my two relations. In fact, almost 5k of my records didn’t match:

in_a	in_b	count
true	true	35 657
true	false	4 805
false	true	4 774

With almost 5k records, and 40 columns, this wasn’t going to be something I could (read: wanted to) debug by hand.

The solution? A macro ofc! I wrote a macro where, for a given primary key (used to match records), and for a given column, it would tell me whether the values matched across two relations. Here’s the SQL for comparing the status column:

with a_query as (
    select * from old_etl_schema.fct_orders
),

b_query as (
    select * from dbt_claire.fct_orders
),

joined as (
    select
        order_id
        a_query.status as a_query_value,
        b_query.status as b_query_value,
        case
          when a_query.status = b_query.status
            then '✅: perfect match'
          ...
        end as match_status
    from a_query
    full outer join b_query using (order_id)
),

final as (
    select
        match_status,
        count(*) as count_records
    from joined
    group by 1
)

select * from final

Another full outer join! So cool! And is that an emoji in a query output? I’ve gone too far. Someone save me from myself, please.

Again, to save myself from ever writing this again, and so others don’t have to either, I’ve added it to audit-helper.

{% set old_etl_relation_query %}
    select * from old_etl_schema.fct_orders
{% endset %}

{% set new_etl_relation_query %}
    select * from {{ ref('fct_orders') }}
{% endset %}

{% set audit_query = audit_helper.compare_column_values(
    a_query=old_etl_relation_query,
    b_query=new_etl_relation_query,
    primary_key="order_id",
    column_to_compare="status"
) %}

Then, to make my debugging really extra, I iterated through the columns of my relation, and printed the results to the command line. Here are the highlights:

From here, I could quickly tell that it was my status column was causing a lot of issues, and could debug my SQL accordingly. (The full instructions for this usage are here)

Announcing audit-helper v0.0.3

These new macros, compare_relation_columns and compare_column_values are now in the latest release of audit helper. If you’re new to dbt packages, check out the docs.

vish · April 7, 2020, 7:32am

How did you trigger the comparisons on dbt compile?

claire · April 13, 2020, 2:24pm

Hi Vish
I used the print_table method — check out the README for usage details.

Topic		Replies	Views
How to not lose your mind when auditing data Show and Tell data-quality	1	13182	September 16, 2019
Error message running compare_relations macro within dbt-audit-helper package Help testing , bigquery , dbt-cloud , macros	6	1099	May 15, 2024
Compilation error - No macro names 'get_columns_in_relation_sql' found Archive	2	3475	December 23, 2021
Can you use the comparison models for testing with different technologies IE SQL Server and Snowflake Help	2	843	May 4, 2023
Compare table from MySQL and Redshift DW table In-Depth Discussions redshift , macros , dbt-core	0	883	October 17, 2023