TL:DR; We’ve got a new package, audit_helper to help with data audits.
Auditing data: my least favorite task in analytics
As a data analyst, there’s one task that used to instill more dread in me than any other – auditing data . This task generally pops up in one of two situations:
- I’m refactoring the SQL of an existing dbt model and want to make sure I haven’t introduced a regression
- I’m creating dbt models that replace a data transformation that’s happening somewhere outside of dbt (and sometimes not in SQL!), and need to show my stakeholder that my new model is a like-for-like replacement.
Sure, you get to feel a bit like a detective, but in reality, I felt this dread for two reasons:
- The SQL involved in auditing data is often harder than the SQL for building models – writing a query to compare two relations record-by-record uses
intersect
andexcept
, a query to compare relation structures requires queryinginformation_schema.columns
(unless you’re using a Redshift late binding view, in which case you also need toselect pg_get_late_binding_view_cols()
) . - And it’s also tedious, bordering on frustrating – if your audit doesn’t match up, it can be really hard to track down exactly why that is – finding the exact records that don’t match can be difficult, and finding the SQL that has introduced the regression can be even more so! (Especially if you leave it to the end.)
But it’s also a really important skill. When I’ve opened up a PR that has a “validation query”, it demonstrates to my reviewer (and to me!) that my work is rock solid, meaning we can get the work merged in more quickly. Here’s a screenshot from a recent PR that I opened:
(I was also able to demonstrate exactly why those 45 records didn’t match.)
Zen and the art of data auditing
I recently changed my workflow around data audits, borrowing one of the best practices from software engineering development – test driven development.
When refactoring an existing dbt model, rather than making all my changes, and then checking at the end that my models matched up, I did the following:
- Write an audit query (my test!) first.
- Run a development version of your model without making any changes, and check that the dev and prod versions of your model match up (they should! You haven’t made any changes yet!)
- Refactor a small piece of the model.
- Run the model.
- Run the audit query.
- If the query doesn’t pass, debug. Once it passes, commit.
- Repeat steps 3 through 6.
I felt Zen as I did this, I worked quickly, and I got the tricky part (writing the audit query) out of the way first! This tackled the “tedious” aspect of the problem, but that audit query was still more difficult than it needed to be!
Introducing the audit_helper
package
I’ve packaged up my most used audit queries into a new dbt package, audit_helper (new to dbt packages? Check the docs here!).
There’s two macros in here that are super useful:
-
compare_relations
: This macro generates SQL that can be used to do a row-by-row validation of two relations. -
compare_queries
: This macro generates SQL that can be used to do a row-by-row validation of two queries – sometimes when refactoring, I change a column name, or intentionally change the values in the column (e.g. coalescing NULLs with 0s). By comparing the results of two queries, I can take these changes into account!
Importantly, both of these macros are written in a way that means they can be reused across warehouses and projects, and the SQL in them is written in a way that helps you debug!
In the past, we’ve written lots of macros that are intended to become part of your project; whether they are used in models, custom schema tests, hooks, or operations, they end up living in the version controlled code. Here, we’re writing a new kind of macro; a macro that you should use while developing, but not past that. And the cool thing here is that we recently released a great interface for these class of macros – dbt Develop.
Here’s a quick demo of how you can use it in dbt Develop!
In this video, you can see:
- My project has the
audit_helper
package in thepackages.yml
file. - I used the
audit_helper.compare_relations
macro to generate a query to compare an existing table, with a model (yup! I couldref
!).
{% set old_etl_relation=adapter.get_relation(
database=target.database,
schema="old_etl_schema",
identifier="stg_jaffle_shop__orders" )%}
{% set dbt_relation=ref('stg_jaffle_shop__orders') %}
{{ audit_helper.compare_relations(
a_relation=old_etl_relation,
b_relation=dbt_relation,
exclude_columns=["loaded_at"],
primary_key="order_id"
) }}
- Two of my records didn’t match perfectly:
in_a | in_b | count |
---|---|---|
true | true | 97 |
true | false | 2 |
false | true | 2 |
- Fortunately I could copy/paste/uncomment the compiled SQL to find those exact records that didn’t match to help me debug!
order_id | customer_id | order_date | status | in_a | in_b |
---|---|---|---|---|---|
44 | 66 | 2018-02-16 | completed | true | false |
44 | 66 | 2018-02-17 | completed | false | true |
56 | 79 | 2018-02-28 | returned | true | false |
56 | 79 | 2018-02-28 | completed | false | true |
Planned changes
This package is very much a v0.0.1
– we intend to polish up the existing macros, and to add to it over time, creating macros that compare the columns of two relations, the relations within a schema, and more! If you’ve got an audit query that you find useful, PRs are very welcome
Happy auditing!