Build snapshot-based tests to detect regressions in historic data

Every morning your CEO opens up their favorite dashboard. But one particular Monday, she finds that all of the historic revenue figures have changed! Suddenly, your Slack is filling up with messages asking what went wrong! And honestly, you’re not sure at first — all of your dbt test steps passed earlier today, how could things have changed? Turns out, that one of the upstream teams has changed possible classifications which your model and tests didn’t account for properly, and they applied those changes retroactively for the entire year.

You find the root-cause and fix it, however, you spent a couple of hours debugging what has happened, and your CEO is not particularly happy with you and seems to be growing suspicious of whether you really know what you’re doing. No one wants to be there, right?

While dbt’s schema tests are extremely useful, sometimes you need an extra layer of assurance that nothing is broken to get ahead of these issues. In this write-up, we’re going to share how Zoox uses dbt’s snapshots to detect when historic data changes as the result of execution of your pipeline, so that our data-team is the first one to know about those issues and can be proactive instead of reactive.

Note: We’re assuming you are familiar with snapshots — if you’re not, go read this article about them first.

The basic idea

We use dbt snapshots to detect changes in historic values for our rollup metrics and leverage the way snapshots invalidate the records that have changed for our test. A minimum working project that implements the whole workflow can be found here (huge shoutout to Claire for putting it together!). Feel free cloning and trying it out!

Now let’s go through the logic in more detail.

Test implementation

Lets say we have a fct_orders table:

order_id customer_id order_date amount
1 1 2020-07-01 10
2 3 2020-07-01 20
3 94 2020-07-02 1
4 50 2020-07-03 25
5 64 2020-07-03 17
6 54 2020-07-04 6

In our project, we build a snapshots on top of our most important models (like fct_orders) that roll the table up to one record per day. For this rollup query, assume that the order can be refunded/altered within the next calendar day, and that will alter the order amount. Hence, the revenue for orders that is older than 2 days should never change, right?

{% snapshot historic_revenue_snapshot %}

    {{
        config(
          target_schema='dbt_snapshots',
          strategy='check',
          unique_key='date_day',
          check_cols=['total_revenue']
        )
    }}

    select
        order_date as date_day,
        sum(amount) as total_revenue
    from {{ ref('fct_orders') }}
    where order_date < current_date - interval '1 day'

    group by 1

{% endsnapshot %}

After our first dbt snapshot, it looks something like this:

date_day total_revenue dbt_valid_from dbt_valid_to
2020-07-01 30 2020-07-15 null
2020-07-02 1 2020-07-15 null
2020-07-03 42 2020-07-15 null
2020-07-04 6 2020-07-15 null

Each day, we run this command, and since our historic daily revenue never changes, the dbt_valid_to field always should always be null. Right? Nothing ever goes wrong when it comes to data.

Except one day, something does go wrong. Somehow, order_id = 4 got filtered out, so our revenue for 2020-07-03 drops from 42 to 17 (it’s an illustrative example so let’s not get hang up on the feasibility of a record disappearing).

When we run our dbt snapshot command we can see that this historic value has changed — dbt has invalidated the old record, and created a new record for the affected day.

date_day total_revenue dbt_valid_from dbt_valid_to
2020-07-01 30 2020-07-15 null
2020-07-02 1 2020-07-15 null
2020-07-03 42 2020-07-15 2020-07-16
2020-07-03 17 2020-07-16 null
2020-07-04 6 2020-07-15 null

So, the concept here is we can add an is_null test to the dbt_valid_to column to detect when a metric has changed.

We can do this with a data test, or by creating a custom schema test and applying it to the column.

In case if you go down the path of schema test, that’s how your schema.yml in snapshots folder would look like:

version: 2

snapshots:
  - name: historic_revenue_snapshot
    columns:
      - name: dbt_valid_to
        tests:
          - is_null # you need to create this test in your project

Then when you run dbt test you will see the following output:

$ dbt test
Running with dbt=0.17.1
Found 0 models, 1 test, 1 snapshot, 0 analyses, 137 macros, 0 operations, 1 seed file, 0 sources

19:06:32 | Concurrency: 8 threads (target='redshift')
19:06:32 |
19:06:32 | 1 of 1 START test is_null_historic_revenue_snapshot_dbt_valid_to..... [RUN]
19:06:34 | 1 of 1 FAIL 1 is_null_historic_revenue_snapshot_dbt_valid_to......... [FAIL 1 in 1.96s]
19:06:34 |
19:06:34 | Finished running 1 test in 4.50s.

Completed with 1 error and 0 warnings:

Failure in test is_null_historic_revenue_snapshot_dbt_valid_to (snapshots/schema.yml)
  Got 1 result, expected 0.

  compiled SQL at target/compiled/acme/snapshots/schema.yml/schema_test/is_null_historic_revenue_snapshot_dbt_valid_to.sql

Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

(The code for the custom schema test can be found here)

Now, when any historic records change, we’ll be proactively notified about it!

Healing the test

Now that we have a rogue record, our is_null test is going to fail every single time we run dbt test, even though we might have already fixed the issue.

To make our test “self-healing”, we have an extra operation in our project to move invalidated records into an audit table.

You can find the full code here, but the idea is that you want to insert the invalidated version of the record into a different table, and delete it from the original table:

INSERT INTO dbt_snapshots.historic_revenue_snapshot_invalidated (
            SELECT
                *,
                CURRENT_TIMESTAMP as _inserted_at
            FROM {{ ref('historic_revenue_snapshot') }}
            WHERE dbt_valid_to IS NOT null
        );

DELETE FROM {{ ref('historic_revenue_snapshot') }} WHERE dbt_valid_to IS NOT null

The historic_revenue_snapshot_invalidated contains all records that have caused an error, while the historic_revenue_snapshot table no longer has records that will fail the is_null test:

date_day total_revenue dbt_valid_from dbt_valid_to
2020-07-01 30 2020-07-15 null
2020-07-02 1 2020-07-15 null
2020-07-03 17 2020-07-15 null
2020-07-04 6 2020-07-15 null

So failures will only cause an error once, and it’s extremely easy to debug any past failures thanks to the _invalidated table (it’s a good practice to additionally have snapshots of your core tables to make debugging even easier, i.e. so that you will be able to zero in on exact record that caused you trouble, even if you’re not running any tests on them).

Putting it all together

Running this in production does get a little complex — we chose to create one job per important model that covers all the steps.

$ dbt run-operation historic_revenue_snapshot_cleanup
$ dbt run -m +fct_orders
$ dbt snapshot --select historic_revenue_snapshot
$ dbt test -m historic_revenue_snapshot

By adding this to your production run of dbt, you can get ahead of any mistakes, building trust with your stakeholders in the process!

13 Likes

Hey @evgeny.r thanks so much for this inspiring article!

We actually created a regression testing setup based on it and just published a Medium article of our approach.
Also referring back to this post since you got us introduced to the solution. Big thanks for that!

Best,
Thomas

@evgeny.r I have one query regarding using snapshot for fact tables.

Snapshot uses SCD-2, which makes me think this is for tracking dimensions and not facts.

Do you think this will scale well, or have you seen it scale well for large data. I am talking about hourly pipelines for 20 or 50+ api sources that are bringing in order transactions data on lets say a black friday.

Snapshots on fact table makes me think we are trying to apply a dimension technique (that too a slowly changing one) for a change data capture on facts that are incoming in bulk and changing way faster than dimensions would.

@somtom I checked your article too and realized you were snapshotting summary facts (sessions) which are surely not going to be huge compared to actual sessions data.

Do you think snapshotting is a viable option for stagged data (stg_sessions) and not modelled data (fct_sessions). Or will it become a performance bottleneck.

Any help would be appreciated. :slight_smile:

@urwa when snapshotting staged data there might be different things to consider when it comes to performance concerns or the questions whether you need them at all:

  • Can you snapshot it properly at all? Do you for instance have a unique key or do you have an EL job that appends newly or changed records? If the latter, a snapshot might not make too much sense since your EL job is kinda doing snapshots from the source.
  • Would you expect the historical staging data to change at all? If not you might not need snapshots. If historical data does not change, then your snapshot table should not be much bigger than the original staging table itself.

Those are just some quick thoughts that came to my mind.

In our case, as you correctly wrote, we are snapshotting historical summary stats - mostly on modeled fact tables. Thus, we can check that they don’t change. The main reasons for changes are that certain dimensions are adjusted in the source or that we do code changes that have side effects. We did not yet use snapshotting for staging tables, though.