Testing dbt data transformations

Testing dbt data transformations

In a previous post, I introduced dtspec, a new framework for specifying and testing data transformations. The general idea and framework has held up quite well over the last year for our internal purposes. However, one aspect that was severely lacking was more direct integration with dbt. Previously, using dtspec required you to write your own (Python) process for loading and extracting data from a test database to be used by the dtspec api. Furthermore, dtspec wasn’t aware of dbt sources, refs, and aliases. This meant that sometimes you might refer to a model in dbt with a particular name, but you’d have to use a different name in dtspec. These issues have been resolved in the most recent version of dtspec with the introduction of a simple helper CLI.

Note that these new features are currently only expected to work with Postgres and Snowflake warehouses. PRs always welcome!

CLI

Details about using dtspec and the CLI can be found on the github project page. Here, I’ll just highlight some of the key components.

One of the first things you need to do in setting up dtspec is create a test database environment that mimics the table schemas present in your production warehouse. This is done by using the dtspec CLI to first reflect the production schemas via

dtspec db --fetch-schemas

This saves the table definitions in your github repository. Initializing a new test database with these table definitions is then accomplished via

dtspec db --init-test-db

At this point, the test database would have the same tables as your production databbase, but they would all be empty.

In order to run the dtspec tests, just run

dtspec test-dbt

Spec parsing and Jinja context

While using dtspec at InsideTrack, we found it was useful to separate our dtspec schema into several files. Usually, we’d have a single file for each target, using tests to describe how it was created, along with separate files for source definitions. For example, our dtspec/specs folder looked like:

├── dtspec/specs
    ├── main.yml
    ├── identifiers.yml
    ├── factories.yml # for factories shared across many scenarios
    ├── sources
    |   └── raw.yml
    └── marts
        └── core
            ├── fct_orders.yml
            └── dim_customers.yml

The dtspec cli will search for all *.yml files in the dbt/dtspec/specs folder and compile them into one spec file to pass on to the dtspec testing API.

When the dtspec CLI is collecting these files, it’s also parsing them with Jinja. The full Jinja context is documented on the github project page. Some important components of the context include dbt source and ref context, as well as relative dates. For example, a dtspec data source may look like:

- source: {{ dbt_source('raw', 'raw_products') }}
   table: |
    | export_time                          | product_id | product_name |
    | -                                    | -          | -            |
    | {{ YESTERDAY }}                      | milk       | Milk         |
    | {{ TODAY - relativedelta(days=5) }}  | milk       | Milk         |

After the dtspec CLI parses this yaml file, it would insert yesterday’s date for {{ YESTERDAY }}.

Conclusion

The dtspec CLI dramatically enhances the ease of getting a dbt project working with dtspec. If you want to see a full project using dtspec, check out the dbt-container-skeleton.

2 Likes

I like the idea of testing transformations on dummy data, but at first glance dtspec looks overly complicated to me.

I’ll give a potential implementation of a "poor man’s dtspec":

  1. Use seeds to create tables with some special prefix - for example test_. For a source table source('jaffle_shop', 'orders'), the test table would be named test_jaffle_shop__orders. For a model ref(customers), the test table would be named test_customers. The idea is to use the naming convention to create a 1 to 1 relationship between sources/models and corresponding test tables.
  2. Find all test seeds that refer to models - either by eliminating the ones that have two underscores __ in their name or by cross-referencing with the graph variable. For each such model:
    1. Somehow temporarily and recursively make the ref('model') macro resolve to test_model and source('source', 'table') resolve to test_source__table.
    2. Run the generated SQL and make sure the output matches the contents of the corresponding test seed.

All this can be attained just by using seeds and with minimal configuration. So what’s the advantage of using actual dtspec?

Those are good point, @apanasenco. I think your approach would work well for a small number of tests where the data doesn’t have too many relationships. The problem I see is that once you write more than few tests, you quickly lose the context around what each row is doing in the seed and what it’s being used to test. Additionally, each time you add another record to the seed to test something new, you’d also have to add records to related object – every time you add a new orders record to test something unrelated to the other orders tests, you may also have to create a new customers record. Managing all this could get really cumbersome really quickly.

dtspec automatically handles setting up the relationships you need (via factories), keeps track of which records belong to which test (via identifiers), and provides a focused context for why a test exists (via scenarios/cases – note that example data and expectations don’t require you to specify data for all fields, just those that matter to what you’re testing in a particular case).

1 Like

I see, I thought the point was to create some really nasty mock data and make sure it works in all directions. However, that’d be one big integration test, whereas what you built with dtspec is more like a true unit testing framework.