scalable yaml code generation

nadelman · February 17, 2023, 3:08am

Our dbt project has been growing very quickly over the past year, as we migrate many legacy data pipelines from their old form to dbt. Our data architecture involves support for many customers that exist in single-tenant databases, but where the many identically structured canonical data models exist in each customer’s database. One of our challenges has been how best to scale out management of the content of our dbt project, especially in terms of managing many yaml files.

We are turning to greatly expanding out code generation capabilities to help support this. So, for instance, we can have code generation that inspects some metadata that indicates what customers should have what data models/data pipelines, and then generate the corresponding assets (models, seeds, and corresponding yaml) in the corresponding locations in the dbt project. For the generation of yaml in particular, this is a little daunting, and we are hoping that others in the dbt community might be able to share approaches they have taken. Our current approach is simplistic:

python code gen builds dictionaries in memory, use pyyaml’s dump to output to corresponding yaml files
In some cases, we will read in existing yaml files with pyyaml, append to it, write it back out to the same yaml file

Programmatically building the correct models in python using basic data structures is so far pretty cumbersome. I’m curious if anybody out there has worked on similar code-gen support, and if so, has anybody gone down the path of building out python classes that map to the components of the dbt yaml schemas (similar to what you might do with an ORM), or is aware of any open source projects that aim to facilitate that type of thing. I am thinking we might end up doing something like that, but would love not to re-invent something that might already exist somewhere else.

nadelman · February 27, 2023, 6:01pm

Just to clarify an example of what I’m hoping exists out there somewhere. An analgous project might be swagger-to · PyPI, which can be used to generate language-specific bindings (as classes) to a specific schema (in that case, a swaggar spec which details the schema used for an api). Looking for similar for dbt yaml support.

Topic		Replies	Views
Codegen for dbt Archive	0	4809	July 11, 2019
dbt factory: re-usable sql, configured through YAML, compiled to dbt Show and Tell dbt-core	0	946	October 19, 2023
how to generate jason file using DBT model Help snowflake , dbt-cloud	0	37	April 13, 2025
Accelerate your documentation workflow: Generate docs for whole folders at once In-Depth Discussions devblog	3	1611	May 31, 2024
Wildcard in dbt_project.yml Help dbt_project-yml	0	1530	February 6, 2023

scalable yaml code generation

Related topics