How do you test your data?

I’m curious to hear about approaches to testing data to ensure quality. Many of us in the data world are not trained software engineers and aren’t as familiar with things like CI/CD, unit or integration tests, separate testing environments, or containers.

If you use any creative dbt testing approaches, continuous integration/deployment, automation of SQL tests through a workflow tool like Airflow, or otherwise, I would love to hear about it! How do you test?

2 Likes

This is a really good question, and something that we’ve been asking ourselves more recently. We add the typical unique/not null test constraints, and some relationship tests between our models. A more recent “test type” we’ve incorporated are regression tests. This is especially useful for us when we find a bug in our code with unintended consequences. Before attempting to “fix” the issue we write a test that catches the bug and then attempt to fix the code, which is similar to test-driven development.

1 Like

Hey @joshtemple - we use testing in a few different capacities.

  1. During development

When creating or modifying a table, I like to add tests for some invariants, like the uniqueness or non-nullness of a given column. This helps me gain confidence that the code which I am writing works as intended. @kriselle makes a great point about TDD above too. For well-defined models, I like to write my tests first, then build a model around the tests. This flow helps reduce the friction around merging PRs… more on that below.

  1. During PRs

We’ve built build-on-PR functionality into sinter. Whenever a Pull Request is made against our internal-analytics repo, Sinter will kick off a full run & test of our dbt project in a scratch schema. If any of the models or tests fail, then Sinter puts a big red X on the Pull Request. If everything works as intended, then Sinter shows a nice green check mark. The contextualized test success/failures help guide code review, and acts as a last check before code gets merged into production. In practice, it looks like this:

46%20AM

  1. In production

We also use Sinter to run our models on a scheduled basis. A typical deployment of dbt includes running:

dbt seed
dbt run
dbt test

So, after every invocation of dbt run, we also run all of the tests in the project. If any of these steps fail, we get email notifications from Sinter. This helps alerts us when bad/invalid data has arrived from a data source, or when some logic in a model doesn’t hold up over time.

There’s lots more to say about testing, deployment, CI, etc. I know many dbt users are doing pretty interesting/complex things. You should cross-post this to #general in Slack :slight_smile:

2 Likes

I would like to echo Drew’s sentiments above. I would like to add two thoughts.

During Development

Before working on an existing model, I like to take a couple of key metrics about it- things like row count, sum of key columns (e.g. revenue), and count of distinct values in other columns. If the number of rows in the beginning and the end of the development process don’t match, that’s fine, but I need to be able to justify the change.

After Development
This is not yet something we’ve implemented, but it’s something we hope to implement soon.

I think a summary metrics table is a good idea. Pick 10 or so key business metrics from a bunch of different data sources and decide on some tolerance for change. If yesterday’s values are outside of that tolerance for change, have a test fail. A little bit of noise (too rapid growth? slow holiday weekend? etc) is better than a problem with data that your end-user discovers first.

I’ve spent a lot of time lately thinking about what CI looks like for data. Thanks for starting the conversation.

3 Likes

Thank you all for your input!

We have a similar setup where we use GitLab CI to create and test the entire data model in a build schema before merging into master. We also trigger our GitLab pipeline for daily runs after our EL jobs finish, which executes dbt seed/run/test in a build schema, then dbt seed/run in production if the tests pass.

This is helpful if issues with newly loaded source data cause a test to fail. The side effect of this is that if a single test fails, we don’t update our production schema at all, since we don’t have a good way to pass/fail specific models and promote the results to production. I would be interested to hear if anyone else has thought of a better way to do that.

@emilie, I like your idea of checking a few “health” metrics during and after development. At a previous company, we had a daily email that displayed the difference in record count between today and yesterday for each table. Even that simple count helped us catch a lot of issues.

@joshtemple – you might be interested in checking out this issue, there’s a fair bit of discussion about how dbt might handle this in the future! Feel free to contribute to the discussion there as well :slight_smile:

1 Like

I’ve thought about this a lot too and I’ve come to the conclusion that testing is a data analysis problem…something we are all familiar with! To that end, I like to think about instrumentation, storage and analysis as discrete problems with testing. Something gets the data, the data is stored and then you figure out what it means. Keeping it modular let’s us evolve each piece as we need to. So that noise you referenced earlier…over time our analysis should learn to deal with that

Hey folks :wave: - I recently wrote a post on this topic sharing one approach I’ve used in the past. May be of interest to this group: Testing SQL – Ian Whitestone

@joshtemple - Can you please share more details on the below?

We have a similar setup where we use GitLab CI to create and test the entire data model in a build schema before merging into master. We also trigger our GitLab pipeline for daily runs after our EL jobs finish, which executes dbt seed/run/test in a build schema, then dbt seed/run in production if the tests pass.