We are exploring the implementation of a Data Mesh design within our company. Currently, we’re leaning towards a microservices-style architecture, where each domain has its own dbt core project, each connected to a separate GCP BigQuery project. This setup allows each domain to fully own its git repository and models. We would like to implement a setup similar to Option 4 in this post, aiming to create organization-wide documentation that illustrates the lineage across all projects.
Given that this post is a bit dated, we’re wondering if there have been any new methods or best practices developed for capturing the data lineage across multiple separate dbt projects while reflecting their distinct BigQuery projects?
Even if we were to use the approach of having a single repository that imports all projects and with unique model names, the lineage would inaccurately show all data residing in one BigQuery project rather than across the different ones. Is there a better way to import packages so that the lineage reflects each project’s distinct BigQuery project?
That’s an interesting case, how are you generating the lineage exactly, is that the built-in dbt one? In such a setup it’s kind of difficult to even use {{ ref(...) }}.
Anyway, at my company we also have all the projects in a separate location, we declare the upstream sources in the data contract and then generate the sources.yml for the project from it. In the models we then use the {{ source(...) }} macro to reference models in other projects. I’m not sure if the dbt lineage will work in this case, we use Dagster and it works, so I guess it will do as well.
Yes, currently we are generating lineage for each project using built-in dbt docs.
Nice workflow, although I wanted to clarify: do you create a custom data contract - lineage file where you define which models from one project belong to other projects? And based on that file, do you generate source.yml files that are uploaded to the respective projects?
I’m not very familiar with Dagster, but I think that relying solely on built-in dbt docs might not fully address this case. While this workflow could help with importing models into other projects, it seems that the documentation would still be limited to individual projects since each project’s docs are generated separately. This means we wouldn’t have a centralized place for all documentation, if I understand this workflow correctly
Yeah, that’s right, in the data contract we specify dependencies lineage via db / schema / object reference (not project/model), and sources.yml is then uploaded to the project.
It depends, if the sources file is parsed for building the lineage it might still work, but you’re right, you’ll need a centralized place place if you want to capture the global lineage.
Yes, thank you for the suggestion! We’ve actually started considering datahub and open metadata. We’re still exploring which one would better suit our needs, though both seem like nice solutions for our case