Build snapshot-based tests to detect regressions in historic data

Every morning your CEO opens up their favorite dashboard. But one particular Monday, she finds that all of the historic revenue figures have changed! Suddenly, your Slack is filling up with messages asking what went wrong! And honestly, you’re not sure at first — all of your dbt test steps passed earlier today, how could things have changed? Turns out, that one of the upstream teams has changed possible classifications which your model and tests didn’t account for properly, and they applied those changes retroactively for the entire year.

You find the root-cause and fix it, however, you spent a couple of hours debugging what has happened, and your CEO is not particularly happy with you and seems to be growing suspicious of whether you really know what you’re doing. No one wants to be there, right?

While dbt’s schema tests are extremely useful, sometimes you need an extra layer of assurance that nothing is broken to get ahead of these issues. In this write-up, we’re going to share how Zoox uses dbt’s snapshots to detect when historic data changes as the result of execution of your pipeline, so that our data-team is the first one to know about those issues and can be proactive instead of reactive.

Note: We’re assuming you are familiar with snapshots — if you’re not, go read this article about them first.

The basic idea

We use dbt snapshots to detect changes in historic values for our rollup metrics and leverage the way snapshots invalidate the records that have changed for our test. A minimum working project that implements the whole workflow can be found here (huge shoutout to Claire for putting it together!). Feel free cloning and trying it out!

Now let’s go through the logic in more detail.

Test implementation

Lets say we have a fct_orders table:

order_id customer_id order_date amount
1 1 2020-07-01 10
2 3 2020-07-01 20
3 94 2020-07-02 1
4 50 2020-07-03 25
5 64 2020-07-03 17
6 54 2020-07-04 6

In our project, we build a snapshots on top of our most important models (like fct_orders) that roll the table up to one record per day. For this rollup query, assume that the order can be refunded/altered within the next calendar day, and that will alter the order amount. Hence, the revenue for orders that is older than 2 days should never change, right?

{% snapshot historic_revenue_snapshot %}

    {{
        config(
          target_schema='dbt_snapshots',
          strategy='check',
          unique_key='date_day',
          check_cols=['total_revenue']
        )
    }}

    select
        order_date as date_day,
        sum(amount) as total_revenue
    from {{ ref('fct_orders') }}
    where order_date < current_date - interval '1 day'

    group by 1

{% endsnapshot %}

After our first dbt snapshot, it looks something like this:

date_day total_revenue dbt_valid_from dbt_valid_to
2020-07-01 30 2020-07-15 null
2020-07-02 1 2020-07-15 null
2020-07-03 42 2020-07-15 null
2020-07-04 6 2020-07-15 null

Each day, we run this command, and since our historic daily revenue never changes, the dbt_valid_to field always should always be null. Right? Nothing ever goes wrong when it comes to data.

Except one day, something does go wrong. Somehow, order_id = 4 got filtered out, so our revenue for 2020-07-03 drops from 42 to 17 (it’s an illustrative example so let’s not get hang up on the feasibility of a record disappearing).

When we run our dbt snapshot command we can see that this historic value has changed — dbt has invalidated the old record, and created a new record for the affected day.

date_day total_revenue dbt_valid_from dbt_valid_to
2020-07-01 30 2020-07-15 null
2020-07-02 1 2020-07-15 null
2020-07-03 42 2020-07-15 2020-07-16
2020-07-03 17 2020-07-16 null
2020-07-04 6 2020-07-15 null

So, the concept here is we can add an is_null test to the dbt_valid_to column to detect when a metric has changed.

We can do this with a data test, or by creating a custom schema test and applying it to the column.

In case if you go down the path of schema test, that’s how your schema.yml in snapshots folder would look like:

version: 2

snapshots:
  - name: historic_revenue_snapshot
    columns:
      - name: dbt_valid_to
        tests:
          - is_null # you need to create this test in your project

Then when you run dbt test you will see the following output:

$ dbt test
Running with dbt=0.17.1
Found 0 models, 1 test, 1 snapshot, 0 analyses, 137 macros, 0 operations, 1 seed file, 0 sources

19:06:32 | Concurrency: 8 threads (target='redshift')
19:06:32 |
19:06:32 | 1 of 1 START test is_null_historic_revenue_snapshot_dbt_valid_to..... [RUN]
19:06:34 | 1 of 1 FAIL 1 is_null_historic_revenue_snapshot_dbt_valid_to......... [FAIL 1 in 1.96s]
19:06:34 |
19:06:34 | Finished running 1 test in 4.50s.

Completed with 1 error and 0 warnings:

Failure in test is_null_historic_revenue_snapshot_dbt_valid_to (snapshots/schema.yml)
  Got 1 result, expected 0.

  compiled SQL at target/compiled/acme/snapshots/schema.yml/schema_test/is_null_historic_revenue_snapshot_dbt_valid_to.sql

Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

(The code for the custom schema test can be found here)

Now, when any historic records change, we’ll be proactively notified about it!

Healing the test

Now that we have a rogue record, our is_null test is going to fail every single time we run dbt test, even though we might have already fixed the issue.

To make our test “self-healing”, we have an extra operation in our project to move invalidated records into an audit table.

You can find the full code here, but the idea is that you want to insert the invalidated version of the record into a different table, and delete it from the original table:

INSERT INTO dbt_snapshots.historic_revenue_snapshot_invalidated (
            SELECT
                *,
                CURRENT_TIMESTAMP as _inserted_at
            FROM {{ ref('historic_revenue_snapshot') }}
            WHERE dbt_valid_to IS NOT null
        );

DELETE FROM {{ ref('historic_revenue_snapshot') }} WHERE dbt_valid_to IS NOT null

The historic_revenue_snapshot_invalidated contains all records that have caused an error, while the historic_revenue_snapshot table no longer has records that will fail the is_null test:

date_day total_revenue dbt_valid_from dbt_valid_to
2020-07-01 30 2020-07-15 null
2020-07-02 1 2020-07-15 null
2020-07-03 17 2020-07-15 null
2020-07-04 6 2020-07-15 null

So failures will only cause an error once, and it’s extremely easy to debug any past failures thanks to the _invalidated table (it’s a good practice to additionally have snapshots of your core tables to make debugging even easier, i.e. so that you will be able to zero in on exact record that caused you trouble, even if you’re not running any tests on them).

Putting it all together

Running this in production does get a little complex — we chose to create one job per important model that covers all the steps.

$ dbt run-operation historic_revenue_snapshot_cleanup
$ dbt run -m +fct_orders
$ dbt snapshot --select historic_revenue_snapshot
$ dbt test -m historic_revenue_snapshot

By adding this to your production run of dbt, you can get ahead of any mistakes, building trust with your stakeholders in the process!

4 Likes