Our dbt project has been growing very quickly over the past year, as we migrate many legacy data pipelines from their old form to dbt. Our data architecture involves support for many customers that exist in single-tenant databases, but where the many identically structured canonical data models exist in each customer’s database. One of our challenges has been how best to scale out management of the content of our dbt project, especially in terms of managing many yaml files.
We are turning to greatly expanding out code generation capabilities to help support this. So, for instance, we can have code generation that inspects some metadata that indicates what customers should have what data models/data pipelines, and then generate the corresponding assets (models, seeds, and corresponding yaml) in the corresponding locations in the dbt project. For the generation of yaml in particular, this is a little daunting, and we are hoping that others in the dbt community might be able to share approaches they have taken. Our current approach is simplistic:
- python code gen builds dictionaries in memory, use pyyaml’s dump to output to corresponding yaml files
- In some cases, we will read in existing yaml files with pyyaml, append to it, write it back out to the same yaml file
Programmatically building the correct models in python using basic data structures is so far pretty cumbersome. I’m curious if anybody out there has worked on similar code-gen support, and if so, has anybody gone down the path of building out python classes that map to the components of the dbt yaml schemas (similar to what you might do with an ORM), or is aware of any open source projects that aim to facilitate that type of thing. I am thinking we might end up doing something like that, but would love not to re-invent something that might already exist somewhere else.