Structure Snowflake database, schema

Hey! Are there any best practices on Snowflake on how to structure your database / schema? In conventional databases, you’re not really able to query data from multiple databases. Usually, we’ve been having one schema per data sources (raw) and one schema for cleaned data sources (etl).

I was wondering what’s the common approach on Snowflake to structure those datasets?

Using Segment data as an example, I’m thinking of two main ways to approach this:

a)
Database: segment_data
Schema: raw_website_data
Schema: raw_mobile_data

Schema: etl_website_data
Schema: etl_mobile_data

b)
Database: raw_segment_data
Schema: website_data
Schema: mobile_data

Database: etl_segment_data
Schema: website_data
Schema: mobile_data

What are your thoughts?

I do not know what’s “common”, but we decided on a few things when setting ours up:

  1. We use databases to separate user role permissions (raw for loaders to load raw data, dbt_dev for analysts to build development dbt models, and analytics for our production models).

  2. The raw database houses our raw data as ingested by connectors. Connectors use the loader role, which is the only role with write access to this db, and it does not have access to any other db. The schemas in this database are named with a {source}_{connector} convention (eg salesforce_stitch). The purpose here is twofold, 1) It makes it easy to identify which connector is being used to ingest the data, and 2) It removes any naming conflict awkwardness when changing a source connector (eg if moving to Fivetran we would just create salesforce_fivetran):

raw.salesforce_stitch.leads
raw.delighted_stitch.responses
raw.paycom_airflow.employees, etc.

  1. The dbt_dev database is for (surprise) our analysts to develop dbt models. Each analyst has a namespaced default schema (eg dbt_dev.dpearce). Analysts use the transformer role, which has read access to all dbs but can only write to the dbt_dev database.

dbt_dev.dpearce
dbt_dev.dpearce_base, etc.

  1. Finally, the analytics database is for our production models. We separate our base tables and transformed models into different schemas (analytics.base and analytics.analytics). Our base models use a naming convention of {source_table}. Only our Sinter account can write to this database. Our BI tool (Looker) is read-only and only has access to the analytics.analytics schema. So:

analytics.base.freshdesk_agents
analytics.base.paycom_employees
analytics.base.salesforce_users

analytics.analytics.employees, etc.

3 Likes

After debating the idea in my head and reading further documentation on warehouses (effectively database), wouldn’t it be more manageable to have one data source being reflected as one warehouse?

Let’s say that we have 3 “connections”: Segment, Stitch, Luigi.

We would then have

segment.source_name.table_name
stitch.source_name.table_name
luigi.source_name.table_name

Then we could scale up or down these warehouses individually! Let’s imagine that we have an influx of Frontend events and we need to scale up the Segment database for better processing, then it would only affect that source of data instead of potentially scaling up for every data sources in the example of the “raw” database containing everything.

Have you hit issues or slowdown by having all the “connector” syncs into the same database? @dapearce

“Warehouses” and “databases” are not the same thing within Snowflake. A database in Snowflake really just represents a storage partition, while warehouses are compute resources.

You can have multiple warehouses processing data in the same “database” concurrently. For example, you can have a “segment” warehouse writing to the “raw” database at the same time a “stitch” warehouse is writing to the “raw” database. I am not aware of any scaling considerations with “databases” within Snowflake. My understanding is that a database is just an organizational partition on top of the storage to help with stuff like permissions. Your scaling would be handled with warehouses.

3 Likes

You are totally right, I understood the concept wrong and wrongly jump to assumptions. Thanks for your input, it helped clarify it!

I love this question @cab – thanks so much for posting it to Discourse. I’ve included some advice for folks that aren’t on Snowflake too.

I totally agree with @dapearce’s reply!

We often leverage the raw.<source>.* pattern on Snowflake (often, all our data sources are loaded by the same tool, but I can see the utility in including <connector>!). Note for warehouses where you can’t do cross-database queries (read: Redshift), I recommend using a raw_<source>.* pattern to get a similar result.

In terms of databases/schemas for objects built by dbt, in snowflake we often use an analytics.<mart>.* pattern. On Redshift, I like to use a pattern like zone_<mart>.* so the schemas get ordered nicely in your warehouse, but that’s my own preference!

Personally, I’d steer clear of using separate databases for each source (or collection of sources) in Snowflake, it feels like it would add too much complexity to the design without much benefit.

1 Like

Thanks for doing this.

I’m curious about Warehouse naming and convention. Perhaps a new thread required (please let me know), but if not:
I’ve setup two warehouses etl_wh and query_wh per this article, and I’ve setup all ETL tools + dbt to have permissions on etl_wh and I’ll setup users and BI tools to use query_wh.

Will this scale? How are are you thinking about warehouses. There is very little in the public domain about this type of stuff, so very useful to hear your thoughts.

@matt we have been incrementally switching over from Redshift and have thus far gotten by with a single small warehouse for everything. However, we are basically at our limit there now and will need to break it out soon. So my first-hand experience is so far limited but my understanding of warehouses is you want to consider a few things:

  1. The biggest cost with Snowflake is warehouse usage, which is dependent on a combination of the size of the warehouse and how many minutes it runs.

  2. The bigger the warehouse, the faster things run. So it’s very possible a big query can be cheaper to run on a large warehouse than a small, because although a large warehouse costs more per minute to run, the query takes less time to run than on a small warehouse.

It’s recommended to configure a warehouse to auto-suspend. This will pause the warehouse after a specified period of inactivity (thus stop the billing).

For us, because of how Stitch loads data it is running a small warehouse almost 24/7, but not using the full capacity of the warehouse. So we have so far been able to get by with our dbt modeling and BI tool on the same warehouse without too much issue (thus minimizing our billing costs). However, we moved over some of our data science jobs and are pretty close to separating things out now.

What’s “best” is likely different for a lot of organizations. For example, if you have some data science models that run once a week that use a lot of compute, but infrequently, you may assign them their own XL warehouse, which you only get billed for when they run. If you have a lot of BI users and want to ensure they never compete with ingestion or data science computing you could give the BI tool it’s own warehouse.

However, if you give Stitch it’s own warehouse, and Segment it’s own warehouse, and Looker it’s own warehouse, and dbt it’s own warehouse, and Airflow it’s own warehouse, and the most any of these ever use is 20% of the compute capacity then you’ve probably over-optimized and are paying more than you need to.

With that in mind, I think warehouse naming conventions are probably dependent on what you are assigning them to. I think what you have makes sense if you just want to make sure BI users aren’t impacted by ETL compute, and you can get by with a single warehouse for ETL compute. Warehouses are also pretty easy to create and delete and reassign so I think the consequences of warehouse naming conventions aren’t the same as with a database.

2 Likes

Perfect, thank you, that makes sense.

I’m not sure I understand the billing well enough, but is it not the case that compute resources are strictly pay-per-use, so that even if each tool had it’s own warehouse, you would only pay for the use of each warehouse, and not the fact that they exist? (assuming parity of warehouse size).

@matt, correct, you only pay for the time that you use each warehouse, not that they exist. So you can have as many warehouses as you want, and if they aren’t running you won’t be billed anything.

The consideration comes with when and how much they are being utilized. For example, let’s consider your data loading usage (Stitch) and your dbt usage (Sinter or Airflow).

You can give each of them their own small warehouse to ensure they don’t compete with each other. Or you can put them on the same warehouse. Which to choose?

Let’s say they both are scheduled to run at the top of the hour. Stitch runs for 15 minutes and dbt runs for 10 minutes. If you put them on separate warehouses you will be billed:

stitch_wh: 15 mins X 24 hours = 360 mins/day
sinter_wh: 10 mins X 24 hours = 240 mins/day
Total: 600 mins/day

However, do they need to be on separate warehouses? If they each only use 20% compute, then you could put them on the same warehouse and they would still only use 40% of the compute (meaning you could probably put your BI tool on there too).

In this scenario, they both still run at the top of the hour, and both still take the same amount of time to complete, but you are only running one warehouse and since they run concurrently you are only billed for the max 15/mins per hour that Stitch uses (Sinter’s 10mins overlaps with this), so:

shared_wh: 15 mins * 24 hours = total 360 mins/day (much cheaper)

Of course, if Sinter’s run is scheduled for half-past the hour, because you want to run it AFTER your Stitch load, then they would not run concurrently, and your billed minutes would be the same with one warehouse as it is for two (in this case it’s really no benefit either way, since they wouldn’t compete even on a shared warehouse).

So when scaling, yes, warehouse strategy can become complicated. You may need to consider the number of concurrent queries, the amount of compute needed, the priority of the job or the importance of how fast it’s completed, etc.

If you have a simple setup my recommendation is just put it all on a small warehouse. Make sure each piece of the system uses its own Snowflake user, then once you start to hit performance issues you can dig into usage per user and consider moving one or more users to separate warehouses.

2 Likes

@dapearce thank you, that is a great explanation, and certainly new insight for me. I’ll stick to a single warehouse for now, as reducing cost is initial priority.

This is a really useful thread.

Please note that this exists:

Jeremy Cohen @ Fishtown Analytics, How we configure Snowflake

1 Like

For those of you who use separate databases like @dapearce describes, could you post how you setup your profiles.yml and dbt_project.yml to connect to the different “databases”? Or whatever device you use to source data from one database/schema and create artifacts in another database/schema?

@dapearce Late comment but I was able to reduce my Stitch related snowflake spend by about 80% by scheduling my pipelines to be active at the same intervals. Being able to choose run time for your pipelines was a feature Stitch introduced some time last summer.

I like the notion of separate databases for raw, ETL, analytics. However, how do you deal with the orthogonal dimension of dev/UAT/prod?

@crholliday you don’t need to do anything in this files to allow dbt to connect to different databases. You just need to make sure the dbt user has permissions to access the different databases. You just query them by including the database in the selector:

SELECT * FROM database1.schema.table

LEFT JOIN database2.schema.table USING (join_key)

If you are talking about ensuring dbt builds to the correct database (developers build to dbt_dev but production builds to analytics), then you just specify that in the database parameter of the profiles.yml. Here’s what our profiles.yml template looks like:

snowflake:
  outputs:
    # This is the dev profile. Configure it by adding
    # your Snowflake username and password, and setting a
    # default dev schema for your dbt models.
    dev:
      type: snowflake
      threads: 1
      account: company.us-east-1
      user:     # snowflake username
      password:  # snowflake password
      role: TRANSFORMER
      database: DBT_DEV
      warehouse: SOLO_WH
      schema:  # {firstinitiallastname} ## Schema your dev models will build to (dpearce)

  target: dev   # Default target is dev unless changed at run time

We use dbt Cloud to build our production models, so our configuration there is to build to our analytics database.

@gordonwong That’s awesome

We use dbt_dev for analysts building dbt models locally, and raw_dev for data engineers developing custom ingesters. We don’t have much UAT other than analysts testing their own stuff. We do use dbt Cloud and it runs our dbt tests as part of a CI process anytime a PR is created or updated (it’s configured to build to a continuous_integration database).

1 Like

David, thanks for sharing your modeling outline. I’m doing some more reading on modeling methods/standards and was curious if your above description has evolved over the last year.

Specifically, I’m curious if you (or anyone) explored an architecture like analytics.base_{source}.{table} , or possibly a three-db approach of raw , staging , and analytics? i.e. something that maintains schemas per source, even for base tables.

I keep thinking that having all base models end up in the analytics.base schema will create lots of clutter — am I missing something, some trade off or unexpected advantages? It seems the standard is to go with a two-DB (raw & analytics approach based on [1]), as you’ve outlined, so I’d love to hear comments on why going beyond this is not recommended.

I found an PR [2] that introduces the ability to route models to a specific database, so having the ability to read from raw and then output base models to base.{source}.{table} while having all analytics-related models go to an analytics DB might not be too hard to manage.

Would love to hear thoughts on this or any reference to other useful examples. Thanks in advance

[1] https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355

[2] https://github.com/fishtown-analytics/dbt/issues/1183

Hey Kevin, when we designed our architecture it was not possible with dbt to build to multiple databases. I haven’t followed dbt updates closely over the last ~6 months so this is actually news to me that it’s possible now.

At that time, the primary reason we put base and analytic models in a single schema was to simplify permissions. If we had analytic models in multiple schemas, and base models in multiple schemas (within the same database), then anytime we want to add new schemas, or change permissions, it would be difficult to manage.

Of course, with the ability to configure dbt to build to multiple databases, permissions can be maintained at the database level. This can certainly provide for an additional level of organization.

Since I have not tried this in practice, I may be overlooking something, but to answer your question no I don’t see any immediate concerns or tradeoffs with this approach. I think it’s particularly appealing in a very large project. With that said, we have ~250 models in our project and have not found our current setup to be cumbersome or difficult to manage. We are very rarely browsing through the list of tables directly in Snowflake, and when we do, our base tables are prefixed with the source, so they are all grouped together and sorted alphabetically.

Hey Kevin

I’ve been trying to get started with dbt using the “multiple databases” approach in Snowflake, and have struggled to understand the PR[2] that you referenced. Moreover, Snowflake is about to introduce “Organization accounts” (similar to AWS Organizations), so we’ll have the option of using multiple Snowflake accounts going forward.

David previously kindly provided an example of his profiles.yml file. Kevin, could you share what a “multiple databases” profiles.yml should look like, and how to use the multiple database connections in your models?

Beyond that, I wonder if @claire or another dbt team member might share some additional insight on this topic. As you noted in your post, Kevin, the current dbt docs seem to focus on working with multiple schemas in a single database like analytics.

I appreciate any help that folks could offer here. Thanks!
Kyle

Hi Claire, thanks for this post. We’re setting up dbt on Synapse, hence the non-Snowflake perspective was quite useful. Would be awesome to get your thoughts on the following:

  • Would raw_ and zone_ be schemas be within the same database? If yes, then is the rationale to simplify cross-db queries?
  • Would you create a separate db/schema for dbt_dev vs prod or just use the ‘analytics’ db/schema for both?

We’re new to dbt and in the process of figuring out the basics, like the best way to structuring it along with the underlying data warehouse. Thanks a lot!