How to balance the need for controlled development of core models with rapid development of peripheral models


#1

I’m struggling to have two different development ideologies coexist within dbt. On the one hand is my team (BI) who is responsible for developing the core data model for the business. Because those core components have broad impact across the business, development of those components needs to be pretty regimented.

On the other hand, there is a broad periphery of data models that need to be built and maintained by the business analyst group - they’re assigned to specific business units, and given the pace of development in the business overall, they will need/want to iterate more quickly, and are more tolerant of flaws in their models.

So how should we structure dbt models to facilitate both ideals? One thought I had was to have multiple repos (one for the core, and one or more for the data marts the BAs will create). But it’s not clear to me how execution would work in that sort of structure; would I include the data mart package(s) in the core package and dbt run from the core project?

Anyone else dealing with similar challenges?


#2

And what about development in the data mart package(s)? They’d need to include the core package, so that they could actually run their models to test them, and that would be a circular dependency and dbt just spins around and around forever on dbt deps in that case.

It would seem, therefore that the data mart models should refer to the actual database objects generated by my team’s dbt repo (rather than {{ref()}}, and the data mart package(s) should be independently executed, but this seems antithetical to “the dbt way” of doing things.


#3

Ooh! I’ve done similar things to what you’re suggesting before. I like where your head’s at.

When I develop open source packages, I build the “library” (the thing I want to share with other people) and then a “testing project” (the thing I actually run the code from). This is necessary because in the library, I actually can’t specify things like profile and many variable names, because those things need to be configured by the core project that is including the library. So–I build the shared code in the library, and then write the specific code that allows me to test and run that code (profiles, variable names, etc) in the “testing project” and run everything from there. I never actually push the testing project to any git repos…it’s just for testing. But I come away with code in a library that I can run pretty seamlessly.

This translates to your situation pretty directly, even if it sounds a bit roundabout. Your analysts would be able to clone your “core” project but not push to it. They would have editor permissions on your “marts” project. They would edit the code in marts, which would get pulled in as a dependency in core, and they’d always run from core (which isn’t a problem; they can get the code just not edit it).

Again…I recognize that this seems roundabout but it does work quite nicely in practice. Please let me know if there’s some part of this that I can make more clear.


#4

Thanks @tristan , I think I’m getting the gist of what you’re saying. By any chance is there any (dummy) github repo where one can see this kind of structure?


#5

This isn’t really something that gets checked into a package, it’s more about a local workflow for how to go about developing packages… You can see packages that have been developed in this workflow in our git org; check out Stripe or Snowplow or Quickbooks or any of them.


#6

Copy that, thanks.

So in your approach, wouldn’t core get re-run each time?


#7

Assuming you do an entire dbt run, yep! You can of course just choose to run whatever subgraph you like though.