State of testing in dbt

Hello!

My team and I have been thinking about testing lately and throughout this thinking I did a deep-dive into the current state of testing in dbt. In this post, I will try to motivate the importance of testing, present research I’ve collected about testing in dbt, make some conclusions from that research, and finally make some suggestions for advancement of testing in dbt.

I tagged a lot of folks on this post, would love to get more input, more research, and opinions here. Thanks!

Motivation

Presently, dbt offers two types of tests, schema tests and data tests. For the purpose of this post, I will refer to these types of tests as dbt’s primitive tests in the sense they are implemented in the dbt core and can be used in more sophisticated ways. Schema tests are good for enforcing constraints that RDBMS systems typically offer but analytical databases don’t. Meanwhile, data tests can test anything with endless possibilities.

Zooming out a bit from dbt, I would like to motivate the value of testing in software. The points here are applicable to all software projects including dbt projects.

Why should I test software?

With software tests, an engineer can:

  • reduce the likelihood of an unexpected outcome or side-effect from code or system under test,
  • reduce the likelihood that a current or future code change will introduce a system error,
  • inspire confidence in the test author and others that the code or system under test works as-intended, and
  • document code and system behavior by illustrating edge-cases.

There are two parts to testing: writing tests and running tests. Both of these parts are necessary for a useful test suite to provide the above benefits.

How should I test software?

A good test typically verifies something: that a piece of code behaves a certain way under certain conditions, or that a system functions in a certain way.

There are many established ways to test software, each with its own strengths and suitability. I would argue that the best way to test highly depends on what is trying to be accomplished, e.g., prevent code regressions, ensure a contract, introduce a new working feature. The level of effort for testing scales depending on the situation.

I would like to highlight two testing categories that I think are relevant to dbt:

  • Smoke testing: “Where there’s smoke, there’s fire.” A smoke test confirms that the system under test looks and/or behaves a certain way. Smoke tests are used in blue/green deployment pipelines as a last-step check before cutover; if the smoke test fails, then the deployment pipeline fails. Smoke tests are difficult to run at development-time because it requires the entire production system to be functioning for the developer. Smoke tests are particularly effective in continuous deployment pipelines to help catch issues difficult to detect at development-time. Smoke tests are a sign of a mature system.
  • Unit testing: A unit test proves that a “unit” of code works as expected. A unit test commonly has three steps: setup, execute the code under test, and verification/teardown section. A unit can vary in size, from a single function, to an entire class. Unit tests are self-contained and typically do not test functionality of external systems (e.g., using HTTP calls or database connections). “Mocking” is a common technique used during unit tests and is often essential for unit testing. Unit tests are run at development-time and during continuous integration to detect breakages before code reaches production.

Testing can occur at a variety of points in the software development lifecycle. Unit tests are typically run at development-time (i.e., when the code is being developed) and during continuous integration (i.e., when the code is being merged into the main branch of the project). Smoke tests are typically run during continuous deployment (i.e., when the code/data is being deployed to production).

(For more on testing in software, I recommend to consult Martin Fowler’s writing!)

Community research

I have conducted research about dbt by reading through blog posts, GitHub issues, and dbt Discourse. I’ve found there are a number of community members who described their attempts at and/or the challenges of building advanced tests to suit their needs:

  • Petyo Pahunchev pointed out that primitive tests (schema tests and data tests) leave a gap in dbt development, saying “there is no tool to let us deterministically re-create a set of conditions, execute our code and validate the output.” In his post, he suggested a way to create tests using a Python library to allow for a behavioural-driven development methodology in dbt.
  • @michael-kaminsky outlined the challenges of adopting a test driven development approach in ELT. He made a number of good points, including “it takes a lot of work to generate realistic test-cases”, and “executing a pipeline and testing it can take upwards of 10 minutes — this is way too slow for doing real test-driven development”.
  • MichelleArk described their team’s attempt of hand-constructing static CSVs, building macros, and running primitive tests to assert model behavior generally and in edge cases. Ultimately, they wrap up the GitHub issue saying “a general issue with this approach is that writing assertions for unit tests feels quite unnatural in SQL - it’s tricky even to get the right semantics for an equality check.” Later in the post, I described my approach to black box testing pl/pgsql transformations using Python.
  • @fabrice.etanchaud noted that with the new --defer flag, there may be a way to automate non-regression tests (see diagram). Later in the post, @jtcohen described how to mock a dbt model using macros and run-time variables.

Despite these challenges, there are successful examples of building advanced tests using dbt’s primitive tests or by building scaffolding around dbt:

  • @claire’s described how to test with dbt using a fixed dataset. Her fixed dataset consists of “fake data consisting of all my known edge cases so that I can check that my SQL works.” She noted she only does this testing during development, and it’s not clear if engineers returning to her code are able to run these tests.
  • @claus presented how to run primitive dbt tests as a smoke test during a blue/green data deployment. If the smoke tests pass, then the dbt output is made available for use; otherwise the data deployment fails.
  • @gnilrets built a well-documented testing framework outside of dbt. A developer using the framework may write testing specifications for a data transformation, and the framework verifies the spec.
  • The dbt-utils project has a series of integration tests that verify project functionality across the supported database. Meanwhile dbt-audit-helper has macros that can supercharge dbt data tests.

Conclusions

Based on this research, I conclude the following:

  1. Strategies for smoke testing in dbt have been discovered and documented by the community.
  2. Meanwhile, there is a high barrier-to-entry to advanced development-time dbt testing strategies. To achieve advanced dbt testing, one either needs expert-level knowledge of dbt, or a high investment of individual time.
  3. There’s a general desire for advancing development-time testing capabilities inside or outside of dbt and community members are looking for capabilities known in the field of software testing.

Suggestions

Here are some concrete suggestions that I think will advance testing in dbt:

  • Product: Map out the current development-time testing workflow and map out an improved development-time testing workflow.
  • Technical: Make “mocking” a dbt model and dbt source a well-designed and first-class construct in dbt. I think many of the community members I referenced above could benefit from a well-defined approach to mocking in dbt.
  • Documentation: Motivate testing in the dbt docs and present the trade-offs of testing. Promote testing best-practices and advanced testing strategies in the dbt docs. (What does a well-tested model look like? When should tests be run? See React Testing for an example.)

Looking forward to continuing the discussion!

21 Likes

Hi!

This is something I think about as well, as a Software Engineer coming into data and using dbt. I love dbt and how it works, it’s such a great tool.

One of the things I have also thought about is testing, and how we ensure data is correct when building our pipeline:

  • We use schema tests in our models to ensure not_null etc, and using dbt-utils package for helpers.
  • We have a csv which has specific columns, setup like this:
description,table_name,where_field,where_value,test_field,test_value
A customer has this weird edge case for a field,fct_foobar,customer_id,cd8e741c-1436-4145-bf00-b3e59fbf73f7,my_awesome_field,foopytoopy

So this then has a python script which looks for this file under data/data_tests.csv which will build out tests in tests/generated/xxx.sql, which will have the following content:

{{ config(tags=['tests']) }}
{{test_where_exists('fct_foobar','customer_id','cd8e741c-1436-4145-bf00-b3e59fbf73f7','my_awesome_field','foopytoopy')}}

The macro looks like this:

{% macro test_where_exists(model, where_field, where_value, test_field, test_value) %}
select {{where_field}}, {{test_field}} from {{ ref(model) }} where {{where_field}} = '{{where_value}}' and {{test_field}} != '{{ test_value }}'
{% endmacro %}

This is great as we can build this against random edge cases and “known good” orders. If something breaks we’ll know straight away.

We’re also planning on using a tool like great expectations to test our data models.

2 Likes

@joshtaylor thanks for sharing! This is quite clever, at which point in the development lifecycle do you run this test? At development-time, or at deploy-time?

I like the name test_where_exists, I wonder if semantic macros in a dbt package could be added to make improve the baseline of development-time testing. An idea I have for this is exact_match like so:

-- BigQuery Standard SQL:

{% macro exact_match(test_dataset_query, query_under_test) %}
(
    ({{ test_dataset_query }})
    except distinct
    ({{ query_under_test }})
)
union all 
(
    ({{ query_under_test }})
    except distinct
    ({{ test_dataset_query }})
)
{% endmacro %}

Then two more macros that can be used similarly:

  • is_contained_within(test_dataset_query, query_under_test)
  • does_not_contain(test_dataset_query, query_under_test)

Applying this in a data test (borrowing from jaffle_shop):

{% set test_dataset_query %}
select 'mperez0@chronoengine.com' as email, 33 as customer_lifetime_value union all
select 'smccoy1@reddit.com' as email, 23 as customer_lifetime_value union all
select 'kpayne2@cargocollective.com' as email, 65 as customer_lifetime_value
{% endset %}

{% set query_under_test %}
select
    email,
    customer_lifetime_value
from {{ ref('dim_customers') }} as dim_customers
left join {{ ref('raw_customers') }} as raw_customers
    on raw_customers.id = dim_customers.customer_id
{% endset %}

{{ exact_match(test_dataset_query, query_under_test) }}

Now imagine being able to define which refs are mocked and how in the above data test file, I think that would be a great developer experience for a self-contained “unit test” in a single file. @jerco curious to get your thoughts on the feasibility of that idea.

1 Like

We run these tests in development using filtering, but we also have a “sample” target set in the dbt profiles.yml.

We then have a filter in our models to filter by this id, as we want to filter orders that match usually :slight_smile:

So inside of our staging models, we do:

{{filter_if_sample_target_by('order_id')}}

we then have a python script which builds this macro:

file = open(f"./macros/join_orders.sql", "w")
macro = "{% macro filter_if_sample_target_by(source_column_name) %} {% if target.name == 'sample' %} where {{source_column_name}} in ("
macro += "'"+"','".join(ids)+"'"
macro += ") {% endif %} {% endmacro %}"
file.write(macro)
file.close()

Now our dataset in development goes down to ~150 records and is a lot faster to build against. :slight_smile:

1 Like

@boxysean thank you so much for pulling together this excellent literature review. You’ve really nailed the capabilities and constraints of testing with dbt at the end of 2020, as they are and (even more importantly) as they’re understood to be.

I’m excited to watch two talks about testing + dbt over the next two days; and since I’m already overdue to responding to this, I thought I’d leave a few thoughts here now.

If you can write it in SQL, you can test it

As I see it today, dbt testing can be quite full-featured—the primitive tests can do a lot—so long as you’re willing to write the code for it. Data tests are a great tool for one-off assurances; you just need the SQL that selects the rows you’re sure shouldn’t be there. When those assurances need to become assertions, standardized and cross-applicable, rather than copy-pasting data tests and swapping out a line or two, it’s simple enough to DRY up that code, parametrize the inputs (model, column_name), and wrap it in a Jinja macro prefixed test__. Write a bit more Jinja, and you can end up somewhere very fancy indeed:

  • unit testing of one model’s logic via in-file fixtures, following Sean’s and Josh’s examples above
  • mock data in seeds, supplied by an external library like faker, that stand in place of your real source data via a source() override, and enable full end-to-end DAG testing
  • leveraging inferential statistics to identify outliers or aberrations, via BigQuery ML + dbt_ml, or by building on a unified multilingual platform like Databricks

All of which is to say, the fact that dbt only ships with four schema tests never struck us as a limitation: those four are the starting points , and the code is simple enough that users can write many, many more. There are a dozen custom schema tests in dbt-utils. @claus wrote several dozen more in dbt-expectations, replicating many of the great ones in Great Expectations.

In this way and many others, dbt is a framework—a way of writing code, wrapped in boilerplate—more than it is a library. The same holds true for materializations, strategies, default macros, and the rest: the fact that a user can override, adapt, and extend built-in code to accomplish their unique use case is one of dbt’s biggest features.

Better built-ins

All of that said, I think there are a number of obvious, uncontroversial ways to make tests more accessible and powerful out-of-the-box. By virtue of their age, there are also a few things about tests that don’t make a lot of sense, and we should fix that. We want to make those changes soon, ahead of dbt v1.0.

Here are some of the things on my wishlist:

  • Data tests should be configurable and documentable.
  • Schema tests and data tests should both return a set of failing rows, rather than a numeric value count(*) for the former and a set of rows for the latter. Why? Because…
  • dbt should make it easier to debug errant tests in development, by (optionally) writing the failing records to database tables.
  • Users should have the option of setting warn and error severity via thresholds (absolute or a relative %), similar to source freshness.
  • When a user creates a schema test, they should feel they are building a reusable asset, not just hacking together one more macro to maintain. Schema test blocks (!) should have optional configurations like description and human-friendly failure messages. Those assets are even more valuable when packaged up and open sourced.
  • At the same time, you shouldn’t need a custom schema test to do something obvious, like add a where filter or a limit. All schema tests should have the ability to apply those configs, right out of the box.

(Check the v0.20.0 milestone for detailed issues on most of the above.)

Lowering barriers

Will this solve everything? No, definitely not—it doesn’t even begin to touch on some of the coolest things Sean noted above. I agree first and foremost, however, that there is too high a barrier separating unique + not_null from more advanced usage, and not enough scaffolding along the way.

I’d love to end up in a place where we add native support for mock data + unit testing, following an opinionated + thoughtful approach that reflects a community-driven consensus. If we’re going to get there, we need to focus on solidifying the simple stuff today: making the case that dbt testing is fully featured in its own right, and fully extensible in yours.

I want dbt developers to feel comfortable investing their time and energy in pushing the limits of tests. I know we have work to do here; I’m reminded of it every time I hear someone suggest that dbt’s built-in schema tests are fine as baby steps, on the way toward using a separate dedicated test framework. (Other frameworks, including GE, are great! If it’s not already clear, they inspire a lot of my thinking here.) I hope that the technical changes I listed above are a reasonable set of steps on the way there.

At the same time, I so agree with Sean’s suggestion that a big, big piece of this is less about technical change and more about a clearer narrative in the documentation:

  • We should talk about schema test definitions more like models, snapshots, and seeds—things you expect to create—rather than just as things you expect to use, and quickly grow out of.
  • We should clarify the building blocks for folks who want to build more complex systems. If you’re writing unit tests, should you make your equality comparisons using custom code? The widely used dbt_utils.equality() schema test? The beloved audit_helper.compare_relations() macro? Why not the adapter.get_rows_different_sql() adapter method, which is used in our adapter-testing suite?
  • We should encourage more community-contributed tooling that reinforces a culture of testing, tightly integrated with models, such as @tnightengale’s new package that offers a way to enforce test + documentation coverage.
  • We should tell a story about tests being expressive attributes of models, ones that can validate, document, and communicate their meaning. They are (dare I say it) rich pieces of metadata.

Wherever possible, we should have solid recommendations, use cases, and trade-offs. The most intrepid dbt developers have already done promising work here. I want to see many more community members taking measurable risks, pulling on some threads, and reporting back with their ideas and findings—without feeling like they’ve gone off the deep end. I promise, I’m right there with you.

8 Likes

Hi all !

As a former software engineer, I met something called “design by contract”. The idea, proposed by Bertrand Meyer (another French citizen, sorry ;-)), implemented in the Eiffel language, is taken from Hoare’s triples, and I wonder how it could be used to validate a model. I’d like to take a little time to explain myself, and ask you to question the idea heartly !

Folks in the Eiffel world use this every day, and this reduces heavily the need for unit tests that can only test for presence of an error (In fact, there is even a way to automatically produce unit tests based on the notion below).

The idea to test for correctness, instead of looking for defects, could be translated in our dbt world like this :
given :

  • a model M, implementation of a data transformation referencing zero or more direct upstream models as input
  • a list of tests (data tests, schema tests on the direct upstream models) known as M’s precondition
  • a list of tests (data tests, schema tests on M), known as M’s postcondition

M correctness can be express like this :

Any run of the model started in a state where its precondition holds will end up in a state where its postcondition holds, something like dbt test_precondition -m M; dbt run -m M; dbt test_postcondition -m M

It’s quite intuitive, think about a transformation where a numeric column of a upstream model must be non null and positive for example, because M in its transformation computes its square root.

Then, to prove correctness of all/part of your transformation chain, you only have to run it with precondition and/or postcondition enabled on all/a subset of the models. If dbt ends up correctly, that’s all correct.

What do you think of this idea ?

With these glasses on, one could already see schema tests being already :

  • part of the postcondition of the model (as it constrains the underlying transformation)
  • part of the precondition of all direct downstream models based on the model (as direct downstream transformations may rely on M schema tests).

So one could envision :

New model properties :

  • require : precondition tests, pretty like tests at model level, but with the ability to also list data tests
  • ensure (could remain tests ) : postcondition tests, idem, but at model or column level.
    In fact, may a direct upstream model have no schema tests, on could even add tests at column level in the M require block for needed upstream model columns.

New model configuration :

  • assertion : an array of possibly 'require', 'ensure' values, used to configure level of correctness for that model (will test_pre or/and test_post be executed ?).
    Thinking out loud, this could also be useful in jinja code, to assert that some condition holds at a given time in a macro’s code. One could envision a macro like check(boolean). In that case one could add a extra 'check' value to the list, used to enable assertions in jinja.

So yes, it’s all goes down to a kind of intertwined dbt run + test at model level !
And to follow on @jerco 's comment, these tests are more than tests, they are contracts, first class citizens of a model’s metadata : “if you run me under these preconditions, I will ensure you these postconditions”

Of course, the new config above would be used to disable the tests in production.
To enforce eventual consistency, this mechanism should allow to defer relationships test at the end of the dbt run.

Do you think it could be valuable ?

In fact, this notion is so powerful that it could even be used to synchronize parallel executions of DAG parts (Eiffelists call that SCOOP).

Imagine you partition the DAG, assigning a processor/thread to each partition. model B and model A in two distinct partitions, affected to distinct execution threads. B’s precondition contains a data test T referencing A. If T is false, dbt should not fail, but instead wait until T realization, and then run B model and downstream models in the same B’s partition.

So dbt run could have an “infinite” behaviour mode : each processor doing an endless loop of its DAG partition, applying any wait condition at each model execution. Data tests used in a model’s require config that are referencing only same partition’s models would behave as standard require tests.

We could even add source freshness to the list of waiting conditions.

I am not a graph theory specialist, and have more experience in imperative than functional programming, so there are edge cases for sure…
and easier said than done, for sure…