How to configure your dbt repository (one or many)?

How to configure your dbt repository (one or many?)

At Fishtown Analytics, as more folks adopt dbt, we have started to see more and more use cases that push the boundaries of our established best practices. This is especially true to those adopting dbt in the enterprise space.

After two years of helping companies from 20-10,000+ employees implement dbt & dbt Cloud, the below is my best attempt to answer the question: “Should I have one repository for my dbt project or many?” Alternative title: “To mono-repo or not to mono-repo, that is the question!”

Before we jump into specific structures, I want to start by emphasizing that our guiding principle has always been that simpler is better, especially when you are getting started! It should also be noted that everything presented below builds upon Jeremy’s excellent write up on this from a few years back. That is the prerequisite to this article.

Before we get started, we need to take inventory. Consider the workflow and teams that will be using dbt.

From a workflow perspective, consider:

  • What will the review process look like at your organization?
    • Who can approve pull requests?
    • Who will be able to merge code to production?
  • For more complex environments who have a dev/qa/prod git branching paradigm:
    • Who has access to the objects created in the dev environment? In the qa environment?
    • Who needs to be alerted when code has been released to the qa branch?
    • Who is responsible for promoting objects from dev to qa? From qa to prod?

From a people or team perspective, consider:

  • How do teams using dbt usually work together?

  • Do those teams have different code styles, review processes, and chief maintainers?

  • Do the teams using dbt ever use the same data sources? Is the raw data located somewhere that all teams using dbt will have access to?

  • Is there SQL that one team should have access to but another team should not? Can folks see the SQL behind the object creation?

  • Are there objects that one team is responsible for that other teams are the consumers of?

The answers to these questions should help you navigate through the four options detailed below. I also want to make it clear: the options I’m about to show you will likely be influenced by your data team(s) size but that should not be the only factor to consider. I have seen a team of 30 folks use option 1 and a team of 10 use option 3. It is truly dependent on what your priorities lay.

Note: One repository in this context equates to one dbt project with one dbt_project.yml. It does not need to have a 1:1 relationship with a dbt cloud project.

Option 1: One Repository

This is the most common structure we see for dbt repository configuration. Though the illustration separates models by business unit, all of the SQL files are stored and organized in a single repository.

Strengths

  • Easy to share and maintain the same core business logic
  • Full dependency lineage - your dbt generated DAG encompasses all of your data transformations for your entire company

Weaknesses

  • Too many people! Your repository could have a lot of concurrently open issues/pull requests.
  • Too many models! Your analyst is now wading through hundreds of files when their team only works on one business unit’s modeling
  • Pull Request approval can be challenging (who has approval for which team? who approves changes to core models used across teams?)

This is our most time tested option and our most recommended. However, we have started to see folks “size out” of this approach. While it’s difficult to define qualitatively when your team has outgrown this model, these are some factors to consider that might push you to consider alternative options:

  • Your project has 500+ models and the time it takes to compile your dbt project hinders the workflow of your developer*
  • Your git workflow is starting to become cumbersome because there are too many hands in the pot in terms of who needs to approve what

*We are making significant efforts to improve this on larger projects but this is something to keep in mind.

Option 2: Separate Team Repository with One Shared Repository

This is one of the first structures we see people move toward when they “outgrow” the mono repo: there is one “core” repository that is incorporated into team specific repositories as a package. If you aren’t familiar with packages, see the documentation for more information.

How would the above function? While each team would work in their own repository, they would put shared items into the shared repository which is then installed in as a package to their repository. Some common things to put into that shared repository would be:

  • a core dim_customers model that is relevant across marketing and finance departments.
  • all_days or calendar model that defines your specific business logics around your financial year calendar and company holidays.
  • Macros to be used across your business units. Things like date conversions, seed files to help segment company wide attributes, etc.
  • Shared sources (sources.yml files + staging models for those sources)

What doesn’t go into that shared repository?

  • Models specific to the team (things like fct_transactions or fct_ads) would live in the unique team repos.
  • Team specific logic (things like if you have different definitions of what revenue is, etc)

Strengths

  • Easier approval workflows in terms of team-specific models
  • Easier to control user permissions (especially if you have sensitive data or SQL)
  • Fewer people contributing to each repository

Weaknesses

  • Hard to decide what goes into the Shared Repository
  • Maintaining downstream dependencies of macros and models. There is a need to create a CI/CD process that assures changes in the shared repository will not negatively impact the downstream repositories. It’s possible that you will have to introduce semantic versioning to mitigate miscommunication about breaking changes.
  • Incomplete lineage/documentation for objects not the shared repository

This is the option I recommend the most when one must stray away from Option 2. This follows our dbt viewpoint the best in terms of dry code and collaboration as opposed to Option 3 & 4.

Option 3: Completely Separate Repositories

Then, there is the “don’t allow any overlap” complete separation of repositories within a single organization.

Strengths

  • Simple approval process
  • Fitting if different teams have separate Snowflake Accounts/Redshift instances

Weaknesses

  • Easy to create duplicate business logic or out of sync business logic between repositories
    • A less than ideal work around: consumers from other teams can subscribe to another team’s releases to be aware of changes.
  • Non-collaborative approach
  • Incomplete lineage/documentation of company wide data transformations

There is a time and a place where this makes sense but you start to lose the reusability of code that is one of dbt’s biggest strengths! Unless there is a really good security reason behind this and a true separation of analytics needs across the teams, this approach is the one we recommend avoiding as much as possible.

Option 4: Separate Team Repositories + One Documentation Repository

This approach is nearly identical to the former (completely separate repositories) but solves one of the weaknesses (“incomplete lineage/documentation”) by introducing an additional repository. If you need something akin to Option 3, this is the better approach.

Strengths

  • Creates a project to provide an overview of the entire organization’s dbt projects*
  • Simple maintenance
  • Takes advantage of the strengths from completely separate repositories (see above example)

Weaknesses

  • Creates an extraneous project for administrative oversight
  • Does not prevent conflicting business logic or duplicate macros
  • All models must have unique names across all packages

** The project will include the information from the dbt projects but might be missing information that is pulled from your data warehouse if you are on multiple Snowflake accounts/Redshift instances. This is because dbt is only able to query the information schema from that one connection.

So… to mono-repo or not to mono-repo?

All of the above configurations “work”. And as detailed, they each solve for a different use case and business priority. At the end of the day, you need to choose what makes sense for your team today and what your team will need 6 months from now. My recommendations are:

  1. Ask the above questions.
  2. Figure out what may be a pain point in the future and try to plan for it from the beginning.
  3. Don’t over-complicate things until you have the right reason. As I said in my Coalesce talk: don’t drag your skeletons from one closet to another. :skull:

Note: Our attempt in writing guides like this and How we structure our dbt projects aren’t to try to convince you that our way is right; it is to hopefully save you the hundreds of hours it has taken us to form those opinions!

4 Likes

Thanks for this very timely article as we are facing problems with our one repo approach. Too many people and too many models are causing a headache.
We’ve decided on one repo with packages for common code/models and then, for the moment a package per subject area. We are still testing this idea but it’s great to see that we are not the only ones needing to rethink how they do their dbt repos.

@meurant.naude I’m so glad to hear that! You are definitely not alone. This conversation also comes up a lot in dbt slack so hopefully, we can start to capture more insights on this post.

1 Like

This is really helpful.

We’re trying to go for something that looks like option 2, however we’d like to expose models from the shared repo as “sources” for domain repos, rather than allow domain repos to ref {{}} models in the shared repo.

My understanding is that if a domain repo references a model from the shared repo, installed as package, it will attempt to build the model in the domain database. We want to avoid this, we want users to reuse the table materialised by the share repo. Is my understanding correct?

Should our shared repo be a “shell” with just source yaml file instead?

@serge-bou I’ve seen two approaches in the wild for this.

  1. You could declare the model as a dbt source in your own dbt project and using jinja, switch back and forth from ref() to source() depending on the environment. Very much like we do here with the target name in the examples

  2. Alternatively you could install the shared repo as a package, you could just exclude it from the run.

I do want to push on the decision to go with option 2. Why did you decide to put it in a shared repo rather than going with a mono-repo choice?

Hi Amy

Thanks for your suggestions.

I ran a quick test. So let’s label these as “shared” repo (contains definition for all curated sources) and “domain” repo.

In the domain repo, I installed “shared” as a package and managed to build a domain model using either source() or ref() to reference a shared table. It compiled correctly referencing our source database tables and building into the domain database schema. I think that’s pretty much what we expect, except avoiding the option to use model ref() would be cleaner/more consistent.

The issue here is that we won’t as much control over domain repos since we want to “hand-over” control to teams of analysts. So enforcing the use of macros or exclusion is probably going to be difficult. The macro seems better for us, I think we could supply it with repos or as part of our package, thanks for this I will explore further.

In terms of motivations, we actually postponed the decision for as long as we could, However we’re now facing the following issues:

  • repo size and compile time
  • contention ci/cd builds and deployments

The latter being a major problem now with daily master builds and deployments, any issue affects other domain analysts from releasing changes to their models.

There’s also a clear split identified between “source models” (limited ETL complexity, simple hard rules, deduplication, standard metadata field and naming convention, usually pattern based, etc) and traditional “analytics models”. Ultimately I suspect the shared repo will become like a “shell” with very limited models supplied and emphasis on source definition in yml, tests etc.

Serge

Hi Amy,
Just an update: We ending up using option 2. We have dbt repos split by subject and a common code repo that is installed as a package. We’ve found that this approach bypasses the issues CI/CD , as mentioned by Serge, and this was our main problem with a mono-repo approach.