How to not lose your mind when auditing data

claire · July 11, 2019, 12:05pm

TL:DR; We’ve got a new package, audit_helper to help with data audits.

Auditing data: my least favorite task in analytics

As a data analyst, there’s one task that used to instill more dread in me than any other – auditing data . This task generally pops up in one of two situations:

I’m refactoring the SQL of an existing dbt model and want to make sure I haven’t introduced a regression
I’m creating dbt models that replace a data transformation that’s happening somewhere outside of dbt (and sometimes not in SQL!), and need to show my stakeholder that my new model is a like-for-like replacement.

Sure, you get to feel a bit like a detective, but in reality, I felt this dread for two reasons:

The SQL involved in auditing data is often harder than the SQL for building models – writing a query to compare two relations record-by-record uses intersect and except, a query to compare relation structures requires querying information_schema.columns (unless you’re using a Redshift late binding view, in which case you also need to select pg_get_late_binding_view_cols() ) .
And it’s also tedious, bordering on frustrating – if your audit doesn’t match up, it can be really hard to track down exactly why that is – finding the exact records that don’t match can be difficult, and finding the SQL that has introduced the regression can be even more so! (Especially if you leave it to the end.)

But it’s also a really important skill. When I’ve opened up a PR that has a “validation query”, it demonstrates to my reviewer (and to me!) that my work is rock solid, meaning we can get the work merged in more quickly. Here’s a screenshot from a recent PR that I opened:

(I was also able to demonstrate exactly why those 45 records didn’t match.)

Zen and the art of data auditing

I recently changed my workflow around data audits, borrowing one of the best practices from software engineering development – test driven development.

When refactoring an existing dbt model, rather than making all my changes, and then checking at the end that my models matched up, I did the following:

Write an audit query (my test!) first.
Run a development version of your model without making any changes, and check that the dev and prod versions of your model match up (they should! You haven’t made any changes yet!)
Refactor a small piece of the model.
Run the model.
Run the audit query.
If the query doesn’t pass, debug. Once it passes, commit.
Repeat steps 3 through 6.

I felt Zen as I did this, I worked quickly, and I got the tricky part (writing the audit query) out of the way first! This tackled the “tedious” aspect of the problem, but that audit query was still more difficult than it needed to be!

Introducing the `audit_helper` package

I’ve packaged up my most used audit queries into a new dbt package, audit_helper (new to dbt packages? Check the docs here!).

There’s two macros in here that are super useful:

compare_relations: This macro generates SQL that can be used to do a row-by-row validation of two relations.
compare_queries: This macro generates SQL that can be used to do a row-by-row validation of two queries – sometimes when refactoring, I change a column name, or intentionally change the values in the column (e.g. coalescing NULLs with 0s). By comparing the results of two queries, I can take these changes into account!

Importantly, both of these macros are written in a way that means they can be reused across warehouses and projects, and the SQL in them is written in a way that helps you debug!

In the past, we’ve written lots of macros that are intended to become part of your project; whether they are used in models, custom schema tests, hooks, or operations, they end up living in the version controlled code. Here, we’re writing a new kind of macro; a macro that you should use while developing, but not past that. And the cool thing here is that we recently released a great interface for these class of macros – dbt Develop.

Here’s a quick demo of how you can use it in dbt Develop!

In this video, you can see:

My project has the audit_helper package in the packages.yml file.
I used the audit_helper.compare_relations macro to generate a query to compare an existing table, with a model (yup! I could ref!).

{% set old_etl_relation=adapter.get_relation(
      database=target.database,
      schema="old_etl_schema",
      identifier="stg_jaffle_shop__orders" )%}

{% set dbt_relation=ref('stg_jaffle_shop__orders') %}

{{ audit_helper.compare_relations(
    a_relation=old_etl_relation,
    b_relation=dbt_relation,
    exclude_columns=["loaded_at"],
    primary_key="order_id"
) }}

Two of my records didn’t match perfectly:

in_a	in_b	count
true	true	97
true	false	2
false	true	2

Fortunately I could copy/paste/uncomment the compiled SQL to find those exact records that didn’t match to help me debug!

order_id	customer_id	order_date	status	in_a	in_b
44	66	2018-02-16	completed	true	false
44	66	2018-02-17	completed	false	true
56	79	2018-02-28	returned	true	false
56	79	2018-02-28	completed	false	true

Planned changes

This package is very much a v0.0.1 – we intend to polish up the existing macros, and to add to it over time, creating macros that compare the columns of two relations, the relations within a schema, and more! If you’ve got an audit query that you find useful, PRs are very welcome

Happy auditing!

mzaccaron · September 16, 2019, 10:12am

Love this, thanks to the contributors!

Topic		Replies	Views
How to not lose your mind when auditing data: Part II Show and Tell data-quality	2	9942	April 13, 2020
Error message running compare_relations macro within dbt-audit-helper package Help testing , bigquery , dbt-cloud , macros	6	1101	May 15, 2024
Can you use the comparison models for testing with different technologies IE SQL Server and Snowflake Help	2	844	May 4, 2023
Compare two models' schemas Help testing , data-quality	2	4179	September 20, 2022
Compilation error - No macro names 'get_columns_in_relation_sql' found Archive	2	3475	December 23, 2021

How to not lose your mind when auditing data

Auditing data: my least favorite task in analytics

Zen and the art of data auditing

Introducing the audit_helper package

Planned changes

Related topics

Introducing the `audit_helper` package