Testing dbt data transformations
In a previous post, I introduced dtspec, a new framework for specifying and testing data transformations. The general idea and framework has held up quite well over the last year for our internal purposes. However, one aspect that was severely lacking was more direct integration with dbt. Previously, using dtspec required you to write your own (Python) process for loading and extracting data from a test database to be used by the dtspec api. Furthermore, dtspec wasn’t aware of dbt sources, refs, and aliases. This meant that sometimes you might refer to a model in dbt with a particular name, but you’d have to use a different name in dtspec. These issues have been resolved in the most recent version of dtspec with the introduction of a simple helper CLI.
Note that these new features are currently only expected to work with Postgres and Snowflake warehouses. PRs always welcome!
CLI
Details about using dtspec and the CLI can be found on the github project page. Here, I’ll just highlight some of the key components.
One of the first things you need to do in setting up dtspec is create a test database environment that mimics the table schemas present in your production warehouse. This is done by using the dtspec CLI to first reflect the production schemas via
dtspec db --fetch-schemas
This saves the table definitions in your github repository. Initializing a new test database with these table definitions is then accomplished via
dtspec db --init-test-db
At this point, the test database would have the same tables as your production databbase, but they would all be empty.
In order to run the dtspec tests, just run
dtspec test-dbt
Spec parsing and Jinja context
While using dtspec at InsideTrack, we found it was useful to separate our dtspec schema into several files. Usually, we’d have a single file for each target, using tests to describe how it was created, along with separate files for source definitions. For example, our dtspec/specs
folder looked like:
├── dtspec/specs
├── main.yml
├── identifiers.yml
├── factories.yml # for factories shared across many scenarios
├── sources
| └── raw.yml
└── marts
└── core
├── fct_orders.yml
└── dim_customers.yml
The dtspec cli will search for all *.yml
files in the dbt/dtspec/specs
folder and compile them into one spec file to pass on to the dtspec testing API.
When the dtspec CLI is collecting these files, it’s also parsing them with Jinja. The full Jinja context is documented on the github project page. Some important components of the context include dbt source and ref context, as well as relative dates. For example, a dtspec data source may look like:
- source: {{ dbt_source('raw', 'raw_products') }}
table: |
| export_time | product_id | product_name |
| - | - | - |
| {{ YESTERDAY }} | milk | Milk |
| {{ TODAY - relativedelta(days=5) }} | milk | Milk |
After the dtspec CLI parses this yaml file, it would insert yesterday’s date for {{ YESTERDAY }}
.
Conclusion
The dtspec CLI dramatically enhances the ease of getting a dbt project working with dtspec. If you want to see a full project using dtspec, check out the dbt-container-skeleton.