How to get started with data testing

The ability to test data in dbt is something that dbt power users love and yet, overall, it’s an underused piece of functionality – out of the 684 dbt projects active last week, only 36% ran a test.

I’d like to compile expertise from this group into an article that helps analysts think about how to use tests to improve the quality of their data.

A few questions for the testers in the community:

  1. What are some tests you currently run in dbt?
  2. Why are these tests important for you and your team?
  3. What advice do you have for an analyst who isn’t currently testing their data but wants to get started?
  4. And, just because we’re curious, what’s your current ratio of tests to models?
6 Likes

Across several projects, we run tests on:

  • sources: mostly simply uniqueness, not_null schema tests
  • staging models: occasionally to make sure transformations didn’t create duplicates etc
  • dim/fact models: this is where test the core of the business logic, e.g. relationship tests, expression, recency tests etc

We also have a series of custom tests that test conditions across models.
For example, one project checks to make sure all orders contain at least one item from a certain product category. Another tests whether we’ve received a reasonable amount (above some threshold) of Fedex tracking data for our shipments (there is a time delay).

In our largest project, we currently have 315 models to 191 tests, where all dim/fact models are covered by at least one test.

2 Likes

We use tests extensively and they have enabled us to dramatically improve the quality of our transformations.

For several years prior to using dbt, we did all of our transformations in (untested) Looker PDTs. Eventually, we started having problems as our engineering team grew and our product development velocity increased. Since we adopted dbt in 2018, we’ve used it to validate our assumptions. IMO Josh Temple’s Automated testing the modern data warehouse is a must read.

I’d highly recommend testing to analysts that haven’t gotten started yet. We got by without them for awhile, but in the end I wish we started sooner. Trust in analytics can be fickle and finding out about issues before your end users report them makes a big difference.

We use all of the built-in tests (uniqueness, not null, accepted values, and relationships), have created some macros for repeated patterns (unique_where, not_null_where, etc.) along with custom data tests for business logic.

Our project has 430 models and 2287 tests (admittedly this may be a little excessive, but it’s easy to add/remove them).

To truly reap the benefit of tests and ensure data quality, some activities at the periphery of the tests are also required. I highly recommend:

  • Monitoring the health of your data warehouse (i.e. disk usage) and that dbt is actually running on the schedule it’s supposed to, within expected run-times, and alerting as necessary
  • Routing test failures to a tracking tool (i.e. Rollbar) rather than Slack or email so you’re not interrupted unless it’s truly urgent. We’re excited to use the new error levels functionality to further segment test failures here.
  • Systematically addressing patterns of false positives (i.e. noisy tests) by pruning and updating tests
  • Systematically addressing false negatives (i.e. data quality issues not caught by tests) through something analogous to postmortems
4 Likes

What are some tests you currently run in dbt?

We run all the common dbt tests (unique, not null, relation, etc). Our most valuable ones though are our custom tests. These typically validate more complicated assumptions about our models. For some of these, they also store in code some aspect of an upstream business process. Our zuora tests (Files · master · GitLab Data / GitLab Data Team · GitLab) are our most fleshed out custom tests. We have one that validates that every new subscription is linked to a Salesforce account. If this fails, it flags us to make and issue for the Finance team to fix it. We also validate the lineage of subscriptions so that renewals can’t be in the past. These are very valuable for data integrity purposes.

Why are these tests important for you and your team?

It gives us confidence to move quicker. We trust that our data is correct and we trust that we can make changes without breaking things because we have the proper checks in place. It removes some of the cognitive load on the analyst because they don’t have to self-check all of these assumptions - they can look at a test and know “Oh, every value in this column is unique” and they can move forward with those assumptions instead of constantly second-guessing.

What advice do you have for an analyst who isn’t currently testing their data but wants to get started?

Don’t think you have to do a big refactor to get going. Two is better than one and one is better than none. If you’re using dbt (which, really, you should) then start with the basis of unique and not null tests. Keep adding as you’re building. If you find a bug in your data, then write a test for it so you can catch it sooner. Start small and build the habit. Eventually you have a good number of tests and then you can invest even more in better coverage.

And, just because we’re curious, what’s your current ratio of tests to models?

226 models, 909 tests

4 Likes

While our use of dbt is still experimental and limited in use at HubSpot, we have instrumented testing/validation for our fact table/derived table pipeline. Each table is conceptually very similar to a dbt model (or combination of models). The pipeline generates 400+ tables and predominantly represents the work of data analysts across the company. The bulk of reporting and analytics queries initiated from our BI tool (Looker) are run off these tables.

We currently use airflow to orchestrate daily script runs that generate/append data to the tables and this includes running test scripts against the data in the tables. After the tests run, any tests that don’t pass are configured to alert users in slack so that the errors can be investigated ASAP.

What are some tests you currently run in dbt?

  1. primary key checks: (since snowflake does not enforce pk constraints). This can help identify fanout that was inadvertently introduced by changes in the up stream source system, an up stream script, or a code change to the current script.

  2. acceptable value checks: typically for data stored as a string, can be useful in cases where you are applying logic to a set of values you are expecting in a given column. If a value suddenly dissapears or a new one appears you will want to be alerted and update your logic to appropriately handle the new value. (ie: a column called program_name has expected values of silver, gold, platinum, diamond and then one day titanium appears – wtf?!)

  3. data volume checks: this type of tests monitors the volume of total rows or new rows inserted into your table and compares it against a defined baseline threshold. If the volume is higher or lower than usual it’s valuable to be able to investigate and identify possible upstream ingestion issues. Unmonitored, these types of issues can be problematic because most reports will still run but the numbers could be inaccurate.

  4. value comparison checks: compare the values in 2 fields to ensure they conform to expected results and make sense. A common use case where we run this type of test is to check for overlapping or out of order date ranges in scenarios where rows should be mutually exclusive, as well as logical (ie: start_date is not AFTER end_date)

Why are these tests important for you and your team?

Above all else, these tests are a critical tool for helping establish and maintain trust with business users who rely on this data.

They help analysts:

  • stay ahead of bad data
  • uncover problems BEFORE they impact reporting consumed across the business
  • monitor for upstream changes that are largely outside their control
  • produce consistent and accurate reports
  • measure the quality and reliability of their work
  • communicate and elicit empathy from software engineers who control data in the source systems
  • be intentional about what they are building and how it should work

What advice do you have for an analyst who isn’t currently testing their data but wants to get started?

Don’t hesitate to get started! Implementing testing will help create clarity and confidence in your data.

  • To avoid feeling overwhelmed, start with a focus on everything you build/update going forward, or your most heavily consumed data
  • If you know how to write a SQL query, you know how to write a validation test

SELECT count(*) as row_count , count(distinct pkey) as pkey_count , sum(iff(field_name NOT IN ('VALUE 1', 'VALUE 2', 'VALUE 3','VALUE 4'),1,0)) as relationship_type_out_of_range , sum(iff(start_date > end_date,1,0)) start_end_date_check FROM schema.table_name

  • At a very minimum you can start by checking your primary keys, and if you don’t have those defined in the database that’s okay, you should still know what 1 row of data represents for each table and can use a simple row() function to generate the key in the table

  • If you can, I recommend storing the results of the tests in a table in the database so you can build some meta reporting on your data validity/pipeline health

  • Adopt testing into your standard process and make it required to release new code

And, just because we’re curious, what’s your current ratio of tests to models?

Here’s the moment of truth – right now we have < 10% coverage (40+ tests) for our 400+ tables. Test scripts are currently NOT a requirement at HubSpot and we definitely feel that pain.

Testing really should be considered analogous to brushing your teeth. Would you ever think twice about brushing your teeth? It’s a basic form of hygeine that produces a STANK when skipped. Avoid the decay & test your data!

3 Likes