What are the dbt commands you run in your production deployment of dbt?

Hi community!

I’d like to know how different companies run their development and production runs of dbt. In particular:

  • What are the execution steps that you run in production? Do you have different “layers” that you run and then test? For example, based on @claire’s recent post on How Fishtown Analytics structure their dbt projects the sequence seems to be:

    1. Run ‘staging’ models
    2. Test ‘staging’ models (same source file for raw tables + staging tables)
    3. Run ‘marts’ models
    4. Test ‘marts’ models (same schema file for intermediate tables + final tables)
  • What is the syntax for these different steps? Do you do something specific?

  • Do you use more than one database/schema during this process or just the same database/schema for the whole process?

For context,

Thanks!

4 Likes

The environments we have

We typically think about three environments:

  • dev: A dbt user developing on their own computer. Each dbt user has a separate dev environment
  • ci: When we open up a Pull Request on a dbt repo, our transformations are run to check for any failures. We have one ci environment for each PR.
  • prod: The “source of truth” – our regularly scheduled runs write to this environment

Typically, we implement environments using schemas within the same database. You can read more about it here. We use dbt Cloud for both our ci and prod runs – Cloud takes care of running against a separate schema for each PR we open, and tears them down for us.

The commands we run in CI and prod

The commands we tend to run in both CI and prod are:

$ dbt deps
$ dbt seed
$ dbt run
$ dbt test

Testing at the end of the run in prod can often feel less than ideal though, because if a test fails, the model has already been created.

Understanding why tests fail

An important thing to consider here is why tests fail. Conceptually, I think we can sort test failures into three categories:
:one:. The SQL in your model doesn’t do what you intended. For example, you forgot to write a distinct, resulting in duplicate values.
:two:. An assumption about your source data is wrong. For example, you assumed a key in a source table was unique;
:three:. A previously-true assumption about your underlying source data is no longer true. For example, your data source started including duplicate IDs, or your ETL process is duplicating rows.

In scenarios :one: and :two:, 99% of the time, we catch these failing tests in dev or ci, and the code doesn’t get merged until they are fixed.

However, scenario :three: presents a bit of a challenge – failing tests will happen on a production run, meaning the (potentially unreliable) model has already been created in our production schema.

So how should I handle scenario :three:?

Generally, we rely on notifications to tell us that a production test is failing (Cloud has built in email and Slack notifications), and try to investigate it as promptly as possible, communicating to any stakeholders if data in our production analytics environment is unreliable. These kinds of failures happen so infrequently that we’re comfortable with this approach.

If you’re not comfortable with this, it may be more desirable to instead prevent your model from being created when a test fails. We have lots of discussion on this GitHub issue about how we might be able to change dbt to support testing during a dbt run, with a model only being created if the tests pass (the summary of the comments is: it’s really hard).

In the meantime, some changes in v0.13.0 could help out here! Since scenario :three: arises when an assumption about a data source has changed, you could define these data sources as Sources. Once they are defined as Sources, you can write your tests that represent the assumptions about your data against your sources instead of your models! Then in your production run, implement logic like:

  1. Test my sources first
  2. If the tests pass, run all my models
  3. Test all my models (for good measure, though we expect all the tests to pass)

This workflow might look something like:

$ dbt deps
$ dbt seed
$ dbt test --models source:*
$ dbt run
$ dbt test --exclude source:*

Hope that helps! Let me know if anything is unclear, this answer went pretty deep!

11 Likes

Great write up!

In one of my projects, which was developed prior to sources becoming available, I run a set of dbt test --model myfolder.base before dbt run --model myfolder+. Then, I test everything except for the base models that were tested initially.

If testing sources is not enough, I imagine one could use tags to flag models that need to be tested prior to the project being built.

One extra note! We typically run any snapshot (formerly archive) commands separately to our prod runs! This is because we don’t want to tie the timing of these commands to the runs of our project, so separate them out.

Further, we also run snapshot-freshness commands as separate jobs for the same reason :slight_smile:

Thanks for your explanations, @claire . Does this mean for you ci runs you always run all transforms with all available data from the sources? As far as I understand the config in dbt Cloud I can only specify per job if it should be run on opening a pull request. However, I have no control to specify the target of the run (e.g. to apply a date filter). Do I understand this correctly?

To limit the amount of data processed in the CI build, there are a few ideas in this DBT Issue:

We ended up using a modified version of option #3. In our case, we find it beneficial to process the entire data set for CI purposes. Since we don’t have a formal ‘stage’ environment, we use the PR schema that’s created to do any manual data testing to compare the current production data vs the new ‘staged’ data in the PR.

It is sort of wasteful and slow, but right now in the early stages of building out the full warehouse, it is worth it. Later we may switch to a more limited approach which might leverage the dbt-audit helper to look at a limited timerange.

The engineering lifecycle Dev -> PR -> Stage -> Prod is important to us and we can get pretty close to the dream of full CI/CD for data engineering. Some day…

Yep, @drew_beaupre is on the money here!

@christian – it’s worth noting that the writeup on that issue didn’t explain the current most-popular solution, which is to write some Jinja directly in your models! Check out the best practice docs on limiting the data processed in dev – you can definitely extend this pattern to include ci as well (or, use target.name != 'prod'); just be sure to name your targets well!

If you need some help with naming your targets in dbt Cloud, you can reach out on Intercom (i.e. the chat bubble) on Cloud; we’re best able to help you through that interface.

Thanks for sharing your experiences on these @drew_beaupre!

@claire Thanks also for the addition. I believe when run through dbt Cloud the target will remain the same for both runs from master and PRs about to be merged. So, in both cases it will be production. (For instance) However, I was wondering if target.schema wouldn’t hold the new target schema (for CI) and I could use this for filtering?

Oh! We often run our PRs against a separate “job” in dbt Cloud, this way we can control things like the target name. It also helps in a few other cases, like getting clearer notifications when failures are from PR runs versus prod runs. Here’s the jobs from one of my dbt projects!

It is a little bit of extra work to maintain, but I recommend it as a pattern.

1 Like

Sounds interesting. Would you mind sharing what exactly it means to run your PRs “against a separate job in dbt Cloud”? How does your configuration look like? Is this job also pointing to the master branch (like you Production Run) and you have activated the CI integration, meaning this job usually is neither scheduled nor run manually but only kicks off when you create the PR?

That’s exactly right! The CI job is also connected to the “master” branch, but because it is only triggered by a PR in GitHub, it only builds off the open branch we’re working on.

If you need any help setting this up in Cloud, reach out on Intercom – we try to keep this forum for dbt Core discussion!

1 Like

Apologies in advance if I say something silly but I am 100% new to DBT and 90% new to Snowflake…

For those of us using Snowflake, it might help to use the “Clone” database capability for doing a “100% like PROD” final deployment before finally deploying on Prod?

Looks like an easy thing to do (and cheap)

Cheers

Eric

1 Like

Hi @claire,

I have been investigating dbt and it looks awesome :-). One thing that isn’t clear to me, is how you build SCD2 dimensions.

  • The best practice is to only snapshot sources
  • You seem to run snapshots at different times than your dimensions/facts

So, my question is, how do you build SCD2 dimensions? Do you build them on top of snapshots? Or don’t you use SCD2 dimensions at all and is everything in your dimensions ‘current’?