Introduction: In Pursuit of High Quality Data
When teams decide to implement dbt Cloud, their motivations often include decreasing the amount of bad data that makes it in front of data consumers. This sounds great in theory - but once the product is in front of them, how exactly do they implement this?
This is a question we get a lot from new dbt Cloud users. Teams want to know how others are building out their jobs to minimize issues in production data. Our answer to this often comes back to one of my favorite foods - lasagna.
My mom has her own lasagna recipe that she has been making for 40+ years. When I first learned how to make lasagna from my mom, she instilled in me a very particular order of operations for assembly, and whenever I cook up a lasagna, I still refer to her notes. I will now pass on the data version of this recipe to you.
The Family Lore: How is My Data Model a Lasagna?
One of the beauties of a lasagna is its layers - you get cheese, tomato sauce, whatever vegetables you like all with their distinct flavours, all held up by the integrity of the pasta sheet. We believe your data model should emulate this culinary masterpiece. By using a layered approach in your model, you’ll ensure the integrity of your production data, and make sure end users get to enjoy the full delight that is your product with a fork on a plate, rather than as a mess of tomato and pasta you need to eat with a spoon in a bowl.
Back in 2019, Claire wrote a great article about how we at dbt Labs (then Fishtown Analytics) recommend structuring your dbt projects. The long and short of it gets back to layers - differentiate your sources, staging and mart models within your project with both naming conventions and directories, with those three layers defined as follows:
Sources: Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third party tool.
Staging models: The atomic unit of data modelling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.
Marts models: Models that represent business processes and entities, abstracted from the data sources that they are based on.
This approach is what we recommend leveraging in deployment to ensure the best possible product gets to our consumers, but with some pasta sheets thrown in to ensure your tomatoes stay tomato-y and your cheese stays cheese-y.
The Ingredients: Tools You’ll Use to Make your Cloud Jobs a Success
The dbt pantry is full of ingredients we can use to make lasagna. Your flour, your tomatoes, your cheese - it’s all there but you’ve got to know how to use it. Here are some ingredients you should know about and leverage to ensure your data is top quality:
Source Freshness - Source freshness allows you to check that your data pipelines are in a healthy state. By setting warn and error intervals on your data age, you can use this information to go ahead with or cancel any transformation runs you would normally do on this data.
Tests - Tests are assertions you make about your models and other resources in your dbt project. You can and should set tests on all layers of your project, to make sure your sources, staging models, and mart models are all working as expected, and so you can decide whether or not to go ahead with further transformations.
Continuous Integration - Continuous integration in dbt allows you to trigger jobs to run upon creating a pull request. This checks that the changes against your production environment before they are merged into your production branch, and let you know if there are any problems.
With these three ingredients, we are ready to make a top-notch dbt job that allows our data team to sleep at night knowing they are doing everything they can to ensure consumers are getting correct data.
The Recipe: A Step by Step Guide to Running your Cloud Jobs
Now that you know how the pantry is organized, let’s get cooking.
For production jobs, we want to make sure the integrity is there every step of the way, so we catch as many errors as possible before they end up in front of data consumers. Here are our recommended steps to do this:
Use Source Freshness:
Make sure we start from a good place by checking your source freshness. This will make sure that the data you are transforming is where it’s supposed to be (and not stuck in the EL layer of your pipeline). In dbt Cloud, this is available as a checkmark under ‘execution settings’.
Test your sources:
Set tests on your sources to check that nothing has gone wrong in your EL process. If your data is normalized, this might look like checking that each entry is unique, or making sure values are not null when they should be populated.
Run your staging models:
Now that you know you’re working with good ingredients, run your first layer of transformations.
Test your staging models:
Before moving on, test your staging models to make sure they worked as expected
Run your mart models:
Now that you know your first round of transformations has run successfully, run your second layer.
Test your mart models:
Again, test that your transformations are working as you’d like them to.
Now that you know everything has worked every step of the way, generate your documentation for end users to refer to. This is also available as a check mark in the ‘execution settings’ section of a job, which will ensure docs are generated even if your job errors out.
An example of this might be as follows:
dbt source snapshot-freshness # or use the check mark dbt test --models source:* dbt run --models path:staging dbt test --models path:staging dbt run --models path:marts dbt test --models path:marts dbt docs generate # or use the check mark
Slim CI Jobs
These jobs are all about helping you get your code validation as fast as possible. This means the steps used in these jobs differ slightly from full-on production runs as follows:
- No freshness checks needed (you can – but is fresh data required to test your code?)
- No source tests required (this depends on what you are testing - do you need validation on the code only, or the source as well?)
- dbt run --models state:modified+ (the state:modified+ here ensure that you build only changed models and their downstream dependents, allowing your code to be validated as fast as possible. See the slimCI docs for reference)
- dbt test --models state:modified+
- Generate docs not needed, as this isn’t going to be what your end users refer to
This job might look like this:
dbt run --models state:modified+
dbt test --models state:modified+
Advanced Ingredients to Also Consider
Another complementary approach is to create multiple environments (dev, test, and prod), whereby commits are merged into a test branch, a dbt job is executed with the test branch code, then once confirmed the commits are merged into the prod branch. If you have a low tolerance for error in your data, you might consider this approach for extra security in your end product.
If you are serving up your data models to a specific dashboard, you might consider also using exposures to define the use of data downstream of dbt. In turn, this will allow you to run, test, and list resources that feed into your exposure.
Dashboard Status Tiles:
You might also consider leveraging the metadata API along with your exposure to display the quality and freshness of your data to end-users. Dashboard status tiles allow you to add a tile to your dashboard inside your BI layer showing whether the freshness and data tests have passed.
Leverage the API:
Our API makes it possible to create dependencies between jobs and other parts of your stack through using third party orchestration tools, such as Airflow. Here’s a great write up that explains one approach for this.
The Final Product - Make it Your Own
As with any recipe, these are just some guidelines to follow when you first start out with dbt Cloud. As you get comfortable, you might find other approaches work better for you in specific situations. For example, you might find breaking your production jobs in two and doing tests separate from your runs works best for you.
With this in mind, go out and make this lasagna yours!
UPDATE (November 15, 2021): In dbt Core versions 0.21+ some of this syntax should be modified to align with best practices. In particular, in 0.21+ :
dbt source snapshot-freshnessshould become
dbt source freshness
- best practice is to use
--modelssyntax is still supported in 0.21.X
Altogether, this means the example syntax for production jobs would be:
dbt source freshness # or use the check mark dbt test --select source:* dbt run --select path:staging dbt test --select path:staging dbt run --select path:marts dbt test --select path:marts dbt docs generate # or use the check mark
and the example syntax for CI jobs would be:
dbt run --select state:modified+
dbt test --select state:modified+
If you have any questions on this, please refer to this documentation or ask a question below