Managing hundreds of schemas in the dbt_project.yml

Our group has more than 400 source schemas that we use for various data downstream into snowflake. We’re in the process of migrating to dbt and are trying to think through how to structure the project at this scale. One question came up with how to handle maintaining schema-specific definitions in the project config file.

Do I have to make an entry for every schema that we have? Is there a dynamic way of doing this or a way to not bloat the config file?

Curious to see how others have solved this problem.

1 Like

At our company, we have defined what we call a client_schema table. We also deal with 1000s of schemas(the number is growing exponentially). Every time a new client comes in, we store their data in a specific schema and create a record in that client_schema table containing the client name, a client unique identifier(client_id) and the relevant schema for the client, through the backend. If this is a solution you are interested in, we can discuss how we do it using Snowflake external functions, AWS Lambda and AWS API.

Yes! I’d love to hear how you’ve set this up. I haven’t used snowflake external functions yet so this could be a great use case.

How do you consolidate data all the clients data with dbt?

We also deal with hundreds of schemas but we use BigQuery – so we don’t need to define the schemas in dbt_project.yml. :thinking:

All schemas contain the same set of tables and columns and they all follow the pattern schema_ORG.

We combine “raw” tables from all “schemas” (aka “organizations”) into one table under the raw dataset. We created a dbt macro to combine them all into one view. Ex:

# models/raw/raw_users.sql
  ['id', 'team_id']

The macro:

  1. pulls the list of ORG from public_organizations
  2. pulls the list of columns from one of the ORG. Ex: acme.users
  3. builds a view that UNION ALL the tables from all ORGs:
    3.1 we add one extra column org to be able to scope queries down to a specific organization
    3.2 we prefix all id columns (here: id and team_id) with the org to prevent conflicts. Ex: acme/123

So for example, raw_users is:

org    | id        | name        | team_id        |
acme   | acme/123  | angela      | acme/300       |
acme   | acme/124  | chris       | acme/400       |
foo    | foo/123   | carl        | foo/350        |

I hope that helps!

Ha ha! I just came across this post: Unioning identically-structured data sources

It’s a much better write up! :slight_smile: