Testing with fixed data set

I would like to test our models with fixed test data, not the real data. These tests would verify the model output against a known input data. The goal is to separate the testing of the model logic and the actual data.

I’m thinking about seeding the model source tables with fixed data and then running the tests. We’re using Snowflake and usually conduct development and testing in cloned test databases, hence truncating sources and replacing with test data is not a problem.

A problem I see with my approach is that these logic -tests would fail if run against CI or PROD databases, because there I can’t seed the source tables. Is there a way to group tests and only run data tests for example?

Is anyone practising this type of logic testing?

1 Like

Hey @jtke! Such a good question! I have done a version of this where I use fake data while developing models: usually my fake data consists of all my known edge cases so that I can check that my SQL works. Unfortunately, I usually only do that during development, and switch out my references to the fake data before merging the code.

However, I have thought about extending this to be more long-lived in my project! Here’s what I would do:

1. Add seed files for your expected input and output to your dbt project

Name them sensibly, like input_subscription_events and expected_subscription_periods. Normally I like to design these seed files to have lots of tricky cases so I can be sure that my SQL is working as intended. Don’t forget to run dbt seed.

If you want to run things with “real”, but static, data, you could take the time to recreate the tables in a separate schema – check out this post on some approaches there!

2. In your models, ref the static data

But wrap it up in some conditional logic! (Yup, you can ref seeds if you decide to go down that route).

-- subscription_periods.sql

with subscription_events as (
  {% if target.name == 'ci' %}
  select * from {{ ref('input_subscription_events') }}
  {% else %}
  select * from {{ ref('stg_subscription_events') }}
  {% endif %}
)
...

3. Add an “equality” test to check that the expected input matches the expected output

We use this test to check that two relations match exactly. You can see us use this testing pattern all throughout dbt_utils in its own integration tests (e.g. here)

# schema.yml
version: 2

models:
  - name: subscription_periods
    tests:
      - dbt_utils.equality:
          compare_model: ref('expected_subscription_periods')

4. Run your test with --target ci

The magic incanation here is:

dbt test --target ci

It should all pass if your transformations work!

But then when you run it in any other environment, that test will fail, so…

5. Parameterize the equality test to accept an environment argument.

You’ll have to add your own version of the equality test to your project, and add an extra argument. You want to get to something like this:

# schema.yml

version: 2

models:
  - name: subscription_periods
    tests:
      - jaffle_shop.equality: # replace with your project name
          compare_model: ref('expected_subscription_periods')
          env: ci

I haven’t done this last step for this particular macro, but I wrote about the structure of making a test environment aware in another Discourse article.

Now when you execute dbt test --target ci, you’ll be comparing your static data. In any other environment, you’ll be running the real-deal, and tests shouldn’t fail in either.

Let me know how you go!

Hi

The approach I’m now pursuing is a bit similar, but I found it easier to use a separate run-unit-tests.sh -script to organize files to correct places and then call dbt inside the script. I have grouped data-tests by model under “tests/” and the setup script copies relevant tests under “tests/run” at runtime.

Test input data is in seed files named like “test__<model_source_ref>.csv” and I use a macro to substitute the real ref with the test-ref if target==“unit-test”. Preferrably I would like like to override ref() and make it do the substitution transparently. Don’t know if that’s possible. Anyway, I would like to avoid having all that conditional logic in the model file that you show in step #2.

I wish I could override project settings, like test-path, with command line parameters. Then I would not need to copy files around. Now I might end up modifying the project.yml in the setup script before running the unit tests, depending on how complex the seeding becomes.

This is still work in progress and not functional. Hopefully I get it working today.

I got the basics working and I can test a module with predefined data set. I made a bit ugly macro to replace the model ref with a test data source when target=unit-test. Any ideas how to make this better:

{% macro ref2(table_name) %}

{% if target.name == ‘unit-test’ %}
{{ return(ref(table_name).database + ‘.’ + ref(table_name).schema + ‘.test__’ + ref(table_name).table_name) }}
{% else %}
{{ return(ref(table_name)) }}
{% endif %}

{% endmacro %}

You can use the get_relation method here to do this a bit more cleanly (docs) – it’s longer but IMO more readable.

{% macro ref_for_env(model_name) %}

{%- set normal_ref_relation = ref(model_name) -%}

{% if target.name == 'unit-test' %}

{%- set test_ref_relation = adapter.get_relation(
      database = normal_ref_relation.database,
      schema = normal_ref_relation.schema,
      identifier = 'test__' ~ normal_ref_relation.identifier
) -%}
      
{{ return(test_ref_relation) }}

{% else %}

{{ return(normal_ref_relation) }}

{% endif %}

{% endmacro %}

^ You might want to do some whitespace control on that.