Managing multi-project lineage in a data mesh setup with dbt core

gintautas.jankus · October 20, 2024, 4:57pm

Hi everyone!

We are exploring the implementation of a Data Mesh design within our company. Currently, we’re leaning towards a microservices-style architecture, where each domain has its own dbt core project, each connected to a separate GCP BigQuery project. This setup allows each domain to fully own its git repository and models. We would like to implement a setup similar to Option 4 in this post, aiming to create organization-wide documentation that illustrates the lineage across all projects.

Given that this post is a bit dated, we’re wondering if there have been any new methods or best practices developed for capturing the data lineage across multiple separate dbt projects while reflecting their distinct BigQuery projects?

Even if we were to use the approach of having a single repository that imports all projects and with unique model names, the lineage would inaccurately show all data residing in one BigQuery project rather than across the different ones. Is there a better way to import packages so that the lineage reflects each project’s distinct BigQuery project?

bbmm-i · October 23, 2024, 11:56am

That’s an interesting case, how are you generating the lineage exactly, is that the built-in dbt one? In such a setup it’s kind of difficult to even use {{ ref(...) }}.
Anyway, at my company we also have all the projects in a separate location, we declare the upstream sources in the data contract and then generate the sources.yml for the project from it. In the models we then use the {{ source(...) }} macro to reference models in other projects. I’m not sure if the dbt lineage will work in this case, we use Dagster and it works, so I guess it will do as well.

gintautas.jankus · October 24, 2024, 7:13am

Thanks a lot for your response!

Yes, currently we are generating lineage for each project using built-in dbt docs.
Nice workflow, although I wanted to clarify: do you create a custom data contract - lineage file where you define which models from one project belong to other projects? And based on that file, do you generate source.yml files that are uploaded to the respective projects?

I’m not very familiar with Dagster, but I think that relying solely on built-in dbt docs might not fully address this case. While this workflow could help with importing models into other projects, it seems that the documentation would still be limited to individual projects since each project’s docs are generated separately. This means we wouldn’t have a centralized place for all documentation, if I understand this workflow correctly

bbmm-i · October 24, 2024, 11:31am

Yeah, that’s right, in the data contract we specify dependencies lineage via db / schema / object reference (not project/model), and sources.yml is then uploaded to the project.

It depends, if the sources file is parsed for building the lineage it might still work, but you’re right, you’ll need a centralized place place if you want to capture the global lineage.

datacoves_noel · November 17, 2024, 4:46pm

Check out datahub. It has column level lineage and also supports multiple dbt projects

gintautas.jankus · November 19, 2024, 6:25pm

Yes, thank you for the suggestion! We’ve actually started considering datahub and open metadata. We’re still exploring which one would better suit our needs, though both seem like nice solutions for our case

Topic		Replies	Views
Single docs page for multiple dbt projects Help	1	1949	March 30, 2024
What is best practices for using DBT and Git with multiple internal organizations. In-Depth Discussions	4	1129	July 11, 2024
Can we use multiple Bigquery datasets (or projects) in the same dbt project? Help	4	16702	November 23, 2020
Single Documentation page for all projects Archive	6	6786	February 11, 2021
Use sources as API like interfaces between dbt projects Archive	0	1723	March 6, 2022

Managing multi-project lineage in a data mesh setup with dbt core

Related topics