How to not lose your mind when auditing data

TL:DR; We’ve got a new package, audit_helper to help with data audits.

Auditing data: my least favorite task in analytics

As a data analyst, there’s one task that used to instill more dread in me than any other – auditing data :weary:. This task generally pops up in one of two situations:

  • I’m refactoring the SQL of an existing dbt model and want to make sure I haven’t introduced a regression
  • I’m creating dbt models that replace a data transformation that’s happening somewhere outside of dbt (and sometimes not in SQL!), and need to show my stakeholder that my new model is a like-for-like replacement.

Sure, you get to feel a bit like a detective, but in reality, I felt this dread for two reasons:

  1. The SQL involved in auditing data is often harder than the SQL for building models – writing a query to compare two relations record-by-record uses intersect and except, a query to compare relation structures requires querying information_schema.columns (unless you’re using a Redshift late binding view, in which case you also need to select pg_get_late_binding_view_cols() :man_facepalming:) .
  2. And it’s also tedious, bordering on frustrating – if your audit doesn’t match up, it can be really hard to track down exactly why that is – finding the exact records that don’t match can be difficult, and finding the SQL that has introduced the regression can be even more so! (Especially if you leave it to the end.)

But it’s also a really important skill. When I’ve opened up a PR that has a “validation query”, it demonstrates to my reviewer (and to me!) that my work is rock solid, meaning we can get the work merged in more quickly. Here’s a screenshot from a recent PR that I opened:


(I was also able to demonstrate exactly why those 45 records didn’t match.)

Zen and the art of data auditing

I recently changed my workflow around data audits, borrowing one of the best practices from software engineering development – test driven development.

When refactoring an existing dbt model, rather than making all my changes, and then checking at the end that my models matched up, I did the following:

  1. Write an audit query (my test!) first.
  2. Run a development version of your model without making any changes, and check that the dev and prod versions of your model match up (they should! You haven’t made any changes yet!)
  3. Refactor a small piece of the model.
  4. Run the model.
  5. Run the audit query.
  6. If the query doesn’t pass, debug. Once it passes, commit.
  7. Repeat steps 3 through 6.

I felt Zen as I did this, I worked quickly, and I got the tricky part (writing the audit query) out of the way first! This tackled the “tedious” aspect of the problem, but that audit query was still more difficult than it needed to be!

Introducing the audit_helper package

I’ve packaged up my most used audit queries into a new dbt package, audit_helper (new to dbt packages? Check the docs here!).

There’s two macros in here that are super useful:

  • compare_relations: This macro generates SQL that can be used to do a row-by-row validation of two relations.
  • compare_queries: This macro generates SQL that can be used to do a row-by-row validation of two queries – sometimes when refactoring, I change a column name, or intentionally change the values in the column (e.g. coalescing NULLs with 0s). By comparing the results of two queries, I can take these changes into account!

Importantly, both of these macros are written in a way that means they can be reused across warehouses and projects, and the SQL in them is written in a way that helps you debug! :woman_in_lotus_position::om:

In the past, we’ve written lots of macros that are intended to become part of your project; whether they are used in models, custom schema tests, hooks, or operations, they end up living in the version controlled code. Here, we’re writing a new kind of macro; a macro that you should use while developing, but not past that. And the cool thing here is that we recently released a great interface for these class of macros – dbt Develop.

Here’s a quick demo of how you can use it in dbt Develop!

In this video, you can see:

  1. My project has the audit_helper package in the packages.yml file.
  2. I used the audit_helper.compare_relations macro to generate a query to compare an existing table, with a model (yup! I could ref!).
{% set old_etl_relation=adapter.get_relation(
      database=target.database,
      schema="old_etl_schema",
      identifier="stg_jaffle_shop__orders" )%}

{% set dbt_relation=ref('stg_jaffle_shop__orders') %}

{{ audit_helper.compare_relations(
    a_relation=old_etl_relation,
    b_relation=dbt_relation,
    exclude_columns=["loaded_at"],
    primary_key="order_id"
) }} 
  1. Two of my records didn’t match perfectly:
in_a in_b count
true true 97
true false 2
false true 2
  1. Fortunately I could copy/paste/uncomment the compiled SQL to find those exact records that didn’t match to help me debug!
order_id customer_id order_date status in_a in_b
44 66 2018-02-16 completed true false
44 66 2018-02-17 completed false true
56 79 2018-02-28 returned true false
56 79 2018-02-28 completed false true

Planned changes

This package is very much a v0.0.1 – we intend to polish up the existing macros, and to add to it over time, creating macros that compare the columns of two relations, the relations within a schema, and more! If you’ve got an audit query that you find useful, PRs are very welcome :smiley:

Happy auditing!

14 Likes

Love this, thanks to the contributors!

2 Likes