Should I have an organisation wide project (a monorepo) or should each work flow have their own?

Thanks for the great question, Liam! I’ve been thinking a lot about this topic recently. Just to be clear, these are definitely my opinions, not yet established Fishtown FactsTM.

Is it possible?

Can an organization’s dbt codebase be split across multiple projects/packages/repositories? Absolutely. dbt ships with a first-class understanding of packages, which are really other people’s projects (or dbt’s own) that sit beneath your “main project”—your primary codebase and working directory.

There are many public dbt packages, such as the ones at hub.getdbt.com, but it is equally possible to install private packages, as long as you have access to that package’s codebase. For example, a private package may be hosted in a private repository in your organization’s GitHub account. As long as your personal GitHub account has access to that repository, and your computer has an SSH key configured in your GitHub account, you will be able to pull down a clone of that repo upon running dbt deps. So long as you’ve properly configured a deploy key or the dbt Cloud GitHub app, your production runs will be just fine, too.

Is it desirable?

We already create, maintain, and install public packages to do a bunch of things:

  • Versatile macros, especially for “How to Write this in SQL” problems (dbt_utils)
  • Common staging models for one data source (facebook)
  • Ad hoc queries you find yourself writing over and over (audit_helper)
  • Follow industry standards for boilerplate-yet-complex modeling (snowplow)

The chief reason for creating these packages is reusability. In all cases, though, we put in a lot of extra work to make these packages as cross-compatible as possible. Where possible, there’s handling for multiple data warehouses, SQL syntaxes, and data loading vendors. Packages like snowplow also include several configurable variables that change based on an organization’s desired applications.

By contrast, you’re the only ones who are going to be using your own data transformations. You know who’s loading your data and which warehouse you’re running on. The primary motivations to create private packages are driven by people and workflow.

When I talk about using separate packages, I am talking about storing separate dbt projects in separate git repositories, where one is / some are ultimately installed as a package dependency of another (“main” project). I’m not talking about the following cases:

Unrelated projects: Two dbt projects that have no formal relationship. One doesn’t ref() models in another or share any of its sources. This might make sense if you’re using dbt to sub in for ETL-like operations, such as “loading” tables via postgres foreign data wrapper. You want to keep your true analytics work separate from code that is less dbt-onic.

"Local" packages: Multiple dbt projects contained in the same repo. For whatever reason, it makes sense to have a separate project that is an extension or variation of its “parent” project. A good example of this is our integration testing setup in packages with good coverage, such as dbt-utils and snowplow.

So, why multiple repos?

git reasons

  • You want different analysts to have different permissions on different repos.
  • Too many cooks! Your dbt repository has dozens of concurrently open issues/PRs,
    and analysts’ work is overlapping or mutually blocking.

dbt reasons

  • Each analyst has to navigate hundreds of model files, when their work only touches a few dozen. (Not to mention the delayed development process of needing to run all those models in their sandbox schema.)
  • Different teams have different code style, review process, and chief maintainers. Or, you might want to physically separate legacy cruft from net-new models guided by best practices.

Is it feasible?

In any event, there are some real logistical and organizational challenges to
working across multiple repositories. Separating code does not solve problems
of team cohesion; if anything it will exacerbate them.

Motivation Caveats Possible approach How could dbt better support this?
Diff permissions on repos Analysts can’t run dbt deps if they don’t have access All analysts at least read-only on all repos. Data is permissioned, but transformation logic is still visible to all. Env-aware deps, env-aware model enabling in dbt_project.yml
Too noisy, competing changes Breaking changes are still there, but now quieter Use GitHub releases. Full CI build of all packages (slow) Version resolution for non-hub packages, or private hub. Smarter CI??
Too many models dbt run will still run +my_package Analysts always run dbt run -m my_package only. Have a “superpackage” that installs all deps for prod runs. A dbt run flag to only compile + run models defined in the current package?
Kick out some models because code style or legacy cruft Someone’s package needs to be highest-up. Can’t ref both ways. Pick the best-maintained package as “main” project. idk!

The big theme here is that teams need have mutual agreements (“contracts”) about what their sources, models, macros, and tests ought to do, and protocol around when and how they are allowed to introduce breaking changes. With that in mind, here are a couple scenarios where I think multiple projects make a lot of sense:

  • The SLA between data engineers and data analysts is rooted in dbt sources. It is the data engineers’ responsibility to ensure successful loading of data into the warehouse (or staging of files in external storage), registry as dbt sources, and guaranteed freshness thresholds. The package of unified sources could then flow into one or more dbt projects, maintained by team(s) of analysts.
  • The forecasting & risk team wants to build on top of core business logic maintained by analysts in finance & operations, but they also wish to keep private their potentially sensitive modeling that projects organizational growth. They “extend” the main dbt project by installing it as a package dependency, pinning to a release tag. Crucially, they also keep an open line of communication with the analysts “upstream” of them, to prevent database errors that can’t be caught by that team’s CI.

I’m curious to hear what other folks have to say on the topic! Over the past few months, we’ve been witness to dbt rolling out at larger and larger organizations. The question of how to smartly separate analytics work across larger and more heterogenous teams is only going to become more important in the months and years to come.

p.s. These thoughts grew out of a presentation and conversation among Fishtown’s consulting team. Major thanks to Erica for her notes from that meeting :slight_smile:

14 Likes