The environments we have
We typically think about three environments:
dev: A dbt user developing on their own computer. Each dbt user has a separate dev environment
ci: When we open up a Pull Request on a dbt repo, our transformations are run to check for any failures. We have one
ci environment for each PR.
prod: The “source of truth” – our regularly scheduled runs write to this environment
Typically, we implement environments using schemas within the same database. You can read more about it here. We use dbt Cloud for both our
prod runs – Cloud takes care of running against a separate schema for each PR we open, and tears them down for us.
The commands we run in CI and prod
The commands we tend to run in both CI and prod are:
$ dbt deps
$ dbt seed
$ dbt run
$ dbt test
Testing at the end of the run in prod can often feel less than ideal though, because if a test fails, the model has already been created.
Understanding why tests fail
An important thing to consider here is why tests fail. Conceptually, I think we can sort test failures into three categories:
. The SQL in your model doesn’t do what you intended. For example, you forgot to write a
distinct, resulting in duplicate values.
. An assumption about your source data is wrong. For example, you assumed a key in a source table was unique;
. A previously-true assumption about your underlying source data is no longer true. For example, your data source started including duplicate IDs, or your ETL process is duplicating rows.
In scenarios and , 99% of the time, we catch these failing tests in
ci, and the code doesn’t get merged until they are fixed.
However, scenario presents a bit of a challenge – failing tests will happen on a production run, meaning the (potentially unreliable) model has already been created in our production schema.
So how should I handle scenario ?
Generally, we rely on notifications to tell us that a production test is failing (Cloud has built in email and Slack notifications), and try to investigate it as promptly as possible, communicating to any stakeholders if data in our production analytics environment is unreliable. These kinds of failures happen so infrequently that we’re comfortable with this approach.
If you’re not comfortable with this, it may be more desirable to instead prevent your model from being created when a test fails. We have lots of discussion on this GitHub issue about how we might be able to change dbt to support testing during a
dbt run, with a model only being created if the tests pass (the summary of the comments is: it’s really hard).
In the meantime, some changes in v0.13.0 could help out here! Since scenario arises when an assumption about a data source has changed, you could define these data sources as Sources. Once they are defined as Sources, you can write your tests that represent the assumptions about your data against your sources instead of your models! Then in your production run, implement logic like:
- Test my sources first
- If the tests pass, run all my models
- Test all my models (for good measure, though we expect all the tests to pass)
This workflow might look something like:
$ dbt deps
$ dbt seed
$ dbt test --models source:*
$ dbt run
$ dbt test --exclude source:*
Hope that helps! Let me know if anything is unclear, this answer went pretty deep!