Data tests for eventually consistent data


#1

Moving a topic here from a Slack discussion. Note that I’ve lightly edited and shuffled the conversation to read better in discourse, so the below suggestions are not directly attributable quotes.

The original problem statement: I’m wondering if there’s a way to do the relationship tests such that it will only test prior to a certain create date.

Claire suggested: I have the same issue, but haven’t yet fixed it. My plan is to find the max created timestamp for each of my tables in question, and then find the min (or “least” in sql terms) of those timestamps. Then execute that query in a statement to return it to my jinja context, and then filter my base tables to only show records created before that timestamp.

We use Fivetran, so theoretically I could also query their audit tables to tell me when the most recent sync of each table started. That might be even better if records in one of your tables are not being created overly frequently (for example, some of my tables only have records added once every few weeks, so I wouldn’t want to use the most recent created_at). Depends on what audit tables you have available to you though.

Drew suggested: I think one approach to fixing this might be to add a “maxerror” parameter to tests. When i see this, it’s usually like 1 or 2 failing rows. So i could pretty happily set maxerror to 10 and ignore any missing values due to bad sync timing.

Harlan suggested: We’ve done the max-error thing. We also for a while had an Airflow process set up so that if dbt test failed, it’d re-run dbt run and try again, to try to clear the tests. Which sometimes worked, but was a pain, so we’re not currently doing that…

Tim suggested: You could create filtered views from those datasets, orders_thruYesterday and contacts_thruYesterday and set up your relationship tests to run on those subsets of records. Or instead of “yesterday” you could filter everything to some max timestamp as @claire is describing. I can’t think of exactly what the pros/cons are of filtering everything up-front to be a consistent timeframe vs. filtering your built “final” tabled just before testing. I’m sure there are scenarios where you’d want to do one of the other.

Four reasonable approaches depending on the requirements.

Based on my requirements: This is a custom etl job. Nobody has prebuilt connectors for this vendor. The underlying API is excruciatingly slow and the volume is high so I don’t think maxerror will work, we’d have to set it so high as to be worthless and then miss actual data integrity issues. I like least(max(t1.created_at), max(t2.created_at), ...), it makes sense and is somewhat elegant.

I also think Tim’s suggestion would work if there’s a guaranteed consistency point, e.g. yesterday or possibly 2 days ago depending on expected data integration performance.

This is an ongoing project so I will update as to which approach I take and how it goes. I would also love to hear your suggestions and experiences!


#2

Some useful snippets to find the start time of the most recent fivetran sync, and then only include data that was created before this date:

{% macro get_last_sync() %}

  {%- call statement('last_sync', fetch_result=True) -%}

    select
    max(update_started) as "last_sync"
    from {{var('base_schema')}}.fivetran_audit

  {%- endcall -%}

  {%- set last_sync = load_result('last_sync').table.columns['last_sync'].values()[0] -%}

  {{return(last_sync)}}

{% endmacro %}

and then in my model:

select
...
from {{var('base_schema')}}.payments
{%- if execute %}
  where created_at < '{{get_last_sync()}}'
{%- endif %}