While our use of dbt is still experimental and limited in use at HubSpot, we have instrumented testing/validation for our fact table/derived table pipeline. Each table is conceptually very similar to a dbt model (or combination of models). The pipeline generates 400+ tables and predominantly represents the work of data analysts across the company. The bulk of reporting and analytics queries initiated from our BI tool (Looker) are run off these tables.
We currently use airflow to orchestrate daily script runs that generate/append data to the tables and this includes running test scripts against the data in the tables. After the tests run, any tests that don’t pass are configured to alert users in slack so that the errors can be investigated ASAP.
What are some tests you currently run in dbt?
-
primary key checks: (since snowflake does not enforce pk constraints). This can help identify fanout that was inadvertently introduced by changes in the up stream source system, an up stream script, or a code change to the current script.
-
acceptable value checks: typically for data stored as a string, can be useful in cases where you are applying logic to a set of values you are expecting in a given column. If a value suddenly dissapears or a new one appears you will want to be alerted and update your logic to appropriately handle the new value. (ie: a column called program_name has expected values of silver, gold, platinum, diamond and then one day titanium appears – wtf?!)
-
data volume checks: this type of tests monitors the volume of total rows or new rows inserted into your table and compares it against a defined baseline threshold. If the volume is higher or lower than usual it’s valuable to be able to investigate and identify possible upstream ingestion issues. Unmonitored, these types of issues can be problematic because most reports will still run but the numbers could be inaccurate.
-
value comparison checks: compare the values in 2 fields to ensure they conform to expected results and make sense. A common use case where we run this type of test is to check for overlapping or out of order date ranges in scenarios where rows should be mutually exclusive, as well as logical (ie: start_date is not AFTER end_date)
Why are these tests important for you and your team?
Above all else, these tests are a critical tool for helping establish and maintain trust with business users who rely on this data.
They help analysts:
- stay ahead of bad data
- uncover problems BEFORE they impact reporting consumed across the business
- monitor for upstream changes that are largely outside their control
- produce consistent and accurate reports
- measure the quality and reliability of their work
- communicate and elicit empathy from software engineers who control data in the source systems
- be intentional about what they are building and how it should work
What advice do you have for an analyst who isn’t currently testing their data but wants to get started?
Don’t hesitate to get started! Implementing testing will help create clarity and confidence in your data.
- To avoid feeling overwhelmed, start with a focus on everything you build/update going forward, or your most heavily consumed data
- If you know how to write a SQL query, you know how to write a validation test
SELECT count(*) as row_count , count(distinct pkey) as pkey_count , sum(iff(field_name NOT IN ('VALUE 1', 'VALUE 2', 'VALUE 3','VALUE 4'),1,0)) as relationship_type_out_of_range , sum(iff(start_date > end_date,1,0)) start_end_date_check FROM schema.table_name
-
At a very minimum you can start by checking your primary keys, and if you don’t have those defined in the database that’s okay, you should still know what 1 row of data represents for each table and can use a simple row()
function to generate the key in the table
-
If you can, I recommend storing the results of the tests in a table in the database so you can build some meta reporting on your data validity/pipeline health
-
Adopt testing into your standard process and make it required to release new code
And, just because we’re curious, what’s your current ratio of tests to models?
Here’s the moment of truth – right now we have < 10% coverage (40+ tests) for our 400+ tables. Test scripts are currently NOT a requirement at HubSpot and we definitely feel that pain.
Testing really should be considered analogous to brushing your teeth. Would you ever think twice about brushing your teeth? It’s a basic form of hygeine that produces a STANK when skipped. Avoid the decay & test your data!