Advantages of One Monolithic Schema.yml file vs Multiple

As we’ve begun to scale out more and more with dbt my team and I have continued to ask the structure of the schema.yml files within each model folder and how we are using them.

With tests, documentation and design being driven by these files I’ve mostly wondered is there any advantage to having the entirety of the model set within one file in that directory as opposed to creating one schema.yml file for each model. Such that you could have dim_date.yml, dim_customer.yml, and so on with the corresponding information within each file.

I know that dbt supports this methodology but I’m not sure if there are any hidden gotchas on implementing one way or another.

Any thoughts or insight would be beneficial.

At my previous job we migrated from a single file per directory to a single file per model. I think one file per directory is currently considered best practice.

The big pros are marginally easier auto-generation of the .yml files, a built-in naming convention (use the model name!), and less Ctrl-F to edit super long .yml files.

The cons are that you can’t use YAML anchors across files, and that you’ll probably want to nest all of your yml files into a sub-folder, instead of interleaving them with the model files.

All in all, it doesn’t make much of a difference.

It is really easy to write a Python script to do this for you. You can read each big file into memory with pyyaml and then dump each model out to its own file. IIRC the annoying part is that it takes some fiddling with a custom Dumper to print out folded scalars (multiline strings) nicely.

1 Like

I think if you’re writing a decent number of tests (unless they’re all singular tests, which I guess could happen), as well as writing basic descriptions, and you’re running a lot of models you’re pretty naturally (as I guess the subtext of your question suggests!) going to run into a point where it becomes quite inconvenient to find what you’re looking for in a massive yml file 100’s (1,000s!) of lines long.

Agree with Ted in that having one per directory is a nice midpoint - although as its the only yml file in the directory I guess you can stick with schema.yml as the filename!