Managing hundreds of schemas in the dbt_project.yml

smartinez · February 7, 2022, 9:26pm

Our group has more than 400 source schemas that we use for various data downstream into snowflake. We’re in the process of migrating to dbt and are trying to think through how to structure the project at this scale. One question came up with how to handle maintaining schema-specific definitions in the project config file.

Do I have to make an entry for every schema that we have? Is there a dynamic way of doing this or a way to not bloat the config file?

Curious to see how others have solved this problem.

datachef · February 8, 2022, 4:54am

At our company, we have defined what we call a client_schema table. We also deal with 1000s of schemas(the number is growing exponentially). Every time a new client comes in, we store their data in a specific schema and create a record in that client_schema table containing the client name, a client unique identifier(client_id) and the relevant schema for the client, through the backend. If this is a solution you are interested in, we can discuss how we do it using Snowflake external functions, AWS Lambda and AWS API.

smartinez · February 8, 2022, 4:39pm

Yes! I’d love to hear how you’ve set this up. I haven’t used snowflake external functions yet so this could be a great use case.

plotnikov.sseu · August 18, 2022, 11:04am

How do you consolidate data all the clients data with dbt?

pcreux · August 26, 2022, 10:14am

We also deal with hundreds of schemas but we use BigQuery – so we don’t need to define the schemas in dbt_project.yml.

All schemas contain the same set of tables and columns and they all follow the pattern schema_ORG.

We combine “raw” tables from all “schemas” (aka “organizations”) into one table under the raw dataset. We created a dbt macro to combine them all into one view. Ex:

# models/raw/raw_users.sql
{{
combine_organization_tables(
  'users',
  ['id', 'team_id']
)
}}

The macro:

pulls the list of ORG from public_organizations
pulls the list of columns from one of the ORG. Ex: acme.users
builds a view that UNION ALL the tables from all ORGs:
3.1 we add one extra column org to be able to scope queries down to a specific organization
3.2 we prefix all id columns (here: id and team_id) with the org to prevent conflicts. Ex: acme/123

So for example, raw_users is:

org    | id        | name        | team_id        |
acme   | acme/123  | angela      | acme/300       |
acme   | acme/124  | chris       | acme/400       |
foo    | foo/123   | carl        | foo/350        |

I hope that helps!

pcreux · August 26, 2022, 10:15am

Ha ha! I just came across this post: Unioning identically-structured data sources

It’s a much better write up!

Topic		Replies	Views
Stop schemas from concatenating Help bigquery , custom-schema , dbt-core	3	1056	September 13, 2024
Best practice for dev environment in a project writing to multiple databases Help environments , snowflake	0	1603	January 23, 2023
dbt package for multiple sources Help bigquery , dbt-cloud	2	1677	July 4, 2023
How to structure different databases and schemas with more than one platform? Archive	2	4779	August 10, 2021
Structure Snowflake database, schema In-Depth Discussions	19	14936	June 1, 2021

Managing hundreds of schemas in the dbt_project.yml

Related topics