This article is the second part of a series (I didn’t know I was going to write a Part II, yet, here we are ). Read the first part here.
TL,DR: The original macros in the audit-helper package were super helpful in making sure you don’t introduce regressions when refactoring SQL. Now, we have new macros that help identify the problematic column when a regression is introduced.
Auditing data: no longer my least favorite task in analytics
In my previous article, I shared how our audit-helper package can be used to help confirm that two queries that should give you the same results are actually giving you the same results. Essentially, that you haven’t introduced a regression.
I’ve been using it pretty heavily while migrating some legacy ETL code into dbt, and then refactoring it to make it follow our coding conventions, all while checking I’m doing a “like for like” replacement. And it’s been going great!
If you find yourself refactoring SQL, you should definitely give this package a go.
Until it suddenly was my least favorite task again
Unfortunately, there have been a few cases lately where I have introduced regressions to my SQL. Now I needed to dive into why my records weren’t matching up.
Problem 1: Mismatching column types
During a recent model migration, I went to run my validation query to compare my old relation to my new one, and I got this back:
[Amazon](500310) Invalid operation: UNION types character varying and bigint cannot be matched;
Cool cool cool.
So one of my columns in my old relation is a character varying
, whereas the new one is a bigint
. But which column?!
The relation I was moving over to dbt was pretty wide (~40 columns), so I didn’t want to inspect this by visually checking the results if I were to select * from
my relation. Instead, I chose to query information_schema.columns
to compare the column types, and check the the output of that query:
with a_cols as (
select
*
from information_schema.columns
where table_schema = 'old_etl_schema'
and table_name = 'fct_orders'
),
b_cols as (
select
*
from information_schema.columns
where table_schema = 'dbt_claire'
and table_name = 'fct_orders'
)
select
column_name,
a_cols.ordinal_position as a_ordinal_position,
b_cols.ordinal_position as b_ordinal_position,
a_cols.data_type as a_data_type,
b_cols.data_type as b_data_type
from a_cols
full outer join b_cols using (column_name)
order by a_ordinal_position, b_ordinal_position
Oooh, the rare full outer join
spotted in the wild!
(If you aren’t on Redshift skip this paragraph. If you are, my condolences…)
But it turns out that one of my relations is a late binding view, and Redshift doesn’t store late binding view columns in the the information schema. If I wanted to get the column types from this early binding view, I either had to:
- switch it to being an early-binding view, or
- instead query
pg_get_late_binding_view_cols()
(docs)
.
I went with the latter, because fortunately, @drew has already done all the hard work to write this query in dbt (let’s face it, he’s the real MVP).
(OK if you aren’t on Redshift, you can come back now, we hope you enjoyed whatever party you got to go to during that part, we’re all jealous!)
And then because I never want to solve this problem again, and I don’t want anyone else to have to either, I packaged it up in a macro and added it to audit-helper.
{% set old_etl_relation=adapter.get_relation(
database=target.database,
schema="old_etl_schema",
identifier="fct_orders"
) -%}
{% set dbt_relation=ref('fct_orders') %}
{{ audit_helper.compare_relation_columns(
a_relation=old_etl_relation,
b_relation=dbt_relation
) }}
The compiled query helped me quickly find my offending columns:
column_name | a_ordinal_position | b_ordinal_position | a_data_type | b_data_type |
---|---|---|---|---|
order_id | 1 | 1 | integer | integer |
customer_id | 2 | 2 | character varying | bigint |
… | … | … | … | … |
Once I found that customer_id
was the issue, I was able to cast it to the correct data type, and run my compare_relations
query.
Problem 2: Lots of rows that don’t match
So once I could run my comparison query, it turned out that my records did not match up across my two relations. In fact, almost 5k of my records didn’t match:
in_a | in_b | count |
---|---|---|
true | true | 35 657 |
true | false | 4 805 |
false | true | 4 774 |
With almost 5k records, and 40 columns, this wasn’t going to be something I could (read: wanted to) debug by hand.
The solution? A macro ofc! I wrote a macro where, for a given primary key (used to match records), and for a given column, it would tell me whether the values matched across two relations. Here’s the SQL for comparing the status
column:
with a_query as (
select * from old_etl_schema.fct_orders
),
b_query as (
select * from dbt_claire.fct_orders
),
joined as (
select
order_id
a_query.status as a_query_value,
b_query.status as b_query_value,
case
when a_query.status = b_query.status
then '✅: perfect match'
...
end as match_status
from a_query
full outer join b_query using (order_id)
),
final as (
select
match_status,
count(*) as count_records
from joined
group by 1
)
select * from final
Another full outer join
! So cool! And is that an emoji in a query output? I’ve gone too far. Someone save me from myself, please.
Again, to save myself from ever writing this again, and so others don’t have to either, I’ve added it to audit-helper.
{% set old_etl_relation_query %}
select * from old_etl_schema.fct_orders
{% endset %}
{% set new_etl_relation_query %}
select * from {{ ref('fct_orders') }}
{% endset %}
{% set audit_query = audit_helper.compare_column_values(
a_query=old_etl_relation_query,
b_query=new_etl_relation_query,
primary_key="order_id",
column_to_compare="status"
) %}
Then, to make my debugging really extra, I iterated through the columns of my relation, and printed the results to the command line. Here are the highlights:
From here, I could quickly tell that it was my status
column was causing a lot of issues, and could debug my SQL accordingly. (The full instructions for this usage are here)
Announcing audit-helper v0.0.3
These new macros, compare_relation_columns
and compare_column_values
are now in the latest release of audit helper. If you’re new to dbt packages, check out the docs.