Structure Snowflake database, schema


#1

Hey! Are there any best practices on Snowflake on how to structure your database / schema? In conventional databases, you’re not really able to query data from multiple databases. Usually, we’ve been having one schema per data sources (raw) and one schema for cleaned data sources (etl).

I was wondering what’s the common approach on Snowflake to structure those datasets?

Using Segment data as an example, I’m thinking of two main ways to approach this:

a)
Database: segment_data
Schema: raw_website_data
Schema: raw_mobile_data

Schema: etl_website_data
Schema: etl_mobile_data

b)
Database: raw_segment_data
Schema: website_data
Schema: mobile_data

Database: etl_segment_data
Schema: website_data
Schema: mobile_data

What are your thoughts?


#2

I do not know what’s “common”, but we decided on a few things when setting ours up:

  1. We use databases to separate user role permissions (raw for loaders to load raw data, dbt_dev for analysts to build development dbt models, and analytics for our production models).

  2. The raw database houses our raw data as ingested by connectors. Connectors use the loader role, which is the only role with write access to this db, and it does not have access to any other db. The schemas in this database are named with a {source}_{connector} convention (eg salesforce_stitch). The purpose here is twofold, 1) It makes it easy to identify which connector is being used to ingest the data, and 2) It removes any naming conflict awkwardness when changing a source connector (eg if moving to Fivetran we would just create salesforce_fivetran):

raw.salesforce_stitch.leads
raw.delighted_stitch.responses
raw.paycom_airflow.employees, etc.

  1. The dbt_dev database is for (surprise) our analysts to develop dbt models. Each analyst has a namespaced default schema (eg dbt_dev.dpearce). Analysts use the transformer role, which has read access to all dbs but can only write to the dbt_dev database.

dbt_dev.dpearce
dbt_dev.dpearce_base, etc.

  1. Finally, the analytics database is for our production models. We separate our base tables and transformed models into different schemas (analytics.base and analytics.analytics). Our base models use a naming convention of {source_table}. Only our Sinter account can write to this database. Our BI tool (Looker) is read-only and only has access to the analytics.analytics schema. So:

analytics.base.freshdesk_agents
analytics.base.paycom_employees
analytics.base.salesforce_users

analytics.analytics.employees, etc.


#3

After debating the idea in my head and reading further documentation on warehouses (effectively database), wouldn’t it be more manageable to have one data source being reflected as one warehouse?

Let’s say that we have 3 “connections”: Segment, Stitch, Luigi.

We would then have

segment.source_name.table_name
stitch.source_name.table_name
luigi.source_name.table_name

Then we could scale up or down these warehouses individually! Let’s imagine that we have an influx of Frontend events and we need to scale up the Segment database for better processing, then it would only affect that source of data instead of potentially scaling up for every data sources in the example of the “raw” database containing everything.

Have you hit issues or slowdown by having all the “connector” syncs into the same database? @dapearce


#4

“Warehouses” and “databases” are not the same thing within Snowflake. A database in Snowflake really just represents a storage partition, while warehouses are compute resources.

You can have multiple warehouses processing data in the same “database” concurrently. For example, you can have a “segment” warehouse writing to the “raw” database at the same time a “stitch” warehouse is writing to the “raw” database. I am not aware of any scaling considerations with “databases” within Snowflake. My understanding is that a database is just an organizational partition on top of the storage to help with stuff like permissions. Your scaling would be handled with warehouses.


#5

You are totally right, I understood the concept wrong and wrongly jump to assumptions. Thanks for your input, it helped clarify it!


#6

I love this question @cab – thanks so much for posting it to Discourse. I’ve included some advice for folks that aren’t on Snowflake too.

I totally agree with @dapearce’s reply!

We often leverage the raw.<source>.* pattern on Snowflake (often, all our data sources are loaded by the same tool, but I can see the utility in including <connector>!). Note for warehouses where you can’t do cross-database queries (read: Redshift), I recommend using a raw_<source>.* pattern to get a similar result.

In terms of databases/schemas for objects built by dbt, in snowflake we often use an analytics.<mart>.* pattern. On Redshift, I like to use a pattern like zone_<mart>.* so the schemas get ordered nicely in your warehouse, but that’s my own preference!

Personally, I’d steer clear of using separate databases for each source (or collection of sources) in Snowflake, it feels like it would add too much complexity to the design without much benefit.


#7

Thanks for doing this.

I’m curious about Warehouse naming and convention. Perhaps a new thread required (please let me know), but if not:
I’ve setup two warehouses etl_wh and query_wh per this article, and I’ve setup all ETL tools + dbt to have permissions on etl_wh and I’ll setup users and BI tools to use query_wh.

Will this scale? How are are you thinking about warehouses. There is very little in the public domain about this type of stuff, so very useful to hear your thoughts.


#8

@matt we have been incrementally switching over from Redshift and have thus far gotten by with a single small warehouse for everything. However, we are basically at our limit there now and will need to break it out soon. So my first-hand experience is so far limited but my understanding of warehouses is you want to consider a few things:

  1. The biggest cost with Snowflake is warehouse usage, which is dependent on a combination of the size of the warehouse and how many minutes it runs.

  2. The bigger the warehouse, the faster things run. So it’s very possible a big query can be cheaper to run on a large warehouse than a small, because although a large warehouse costs more per minute to run, the query takes less time to run than on a small warehouse.

It’s recommended to configure a warehouse to auto-suspend. This will pause the warehouse after a specified period of inactivity (thus stop the billing).

For us, because of how Stitch loads data it is running a small warehouse almost 24/7, but not using the full capacity of the warehouse. So we have so far been able to get by with our dbt modeling and BI tool on the same warehouse without too much issue (thus minimizing our billing costs). However, we moved over some of our data science jobs and are pretty close to separating things out now.

What’s “best” is likely different for a lot of organizations. For example, if you have some data science models that run once a week that use a lot of compute, but infrequently, you may assign them their own XL warehouse, which you only get billed for when they run. If you have a lot of BI users and want to ensure they never compete with ingestion or data science computing you could give the BI tool it’s own warehouse.

However, if you give Stitch it’s own warehouse, and Segment it’s own warehouse, and Looker it’s own warehouse, and dbt it’s own warehouse, and Airflow it’s own warehouse, and the most any of these ever use is 20% of the compute capacity then you’ve probably over-optimized and are paying more than you need to.

With that in mind, I think warehouse naming conventions are probably dependent on what you are assigning them to. I think what you have makes sense if you just want to make sure BI users aren’t impacted by ETL compute, and you can get by with a single warehouse for ETL compute. Warehouses are also pretty easy to create and delete and reassign so I think the consequences of warehouse naming conventions aren’t the same as with a database.


#9

Perfect, thank you, that makes sense.

I’m not sure I understand the billing well enough, but is it not the case that compute resources are strictly pay-per-use, so that even if each tool had it’s own warehouse, you would only pay for the use of each warehouse, and not the fact that they exist? (assuming parity of warehouse size).


#10

@matt, correct, you only pay for the time that you use each warehouse, not that they exist. So you can have as many warehouses as you want, and if they aren’t running you won’t be billed anything.

The consideration comes with when and how much they are being utilized. For example, let’s consider your data loading usage (Stitch) and your dbt usage (Sinter or Airflow).

You can give each of them their own small warehouse to ensure they don’t compete with each other. Or you can put them on the same warehouse. Which to choose?

Let’s say they both are scheduled to run at the top of the hour. Stitch runs for 15 minutes and dbt runs for 10 minutes. If you put them on separate warehouses you will be billed:

stitch_wh: 15 mins X 24 hours = 360 mins/day
sinter_wh: 10 mins X 24 hours = 240 mins/day
Total: 600 mins/day

However, do they need to be on separate warehouses? If they each only use 20% compute, then you could put them on the same warehouse and they would still only use 40% of the compute (meaning you could probably put your BI tool on there too).

In this scenario, they both still run at the top of the hour, and both still take the same amount of time to complete, but you are only running one warehouse and since they run concurrently you are only billed for the max 15/mins per hour that Stitch uses (Sinter’s 10mins overlaps with this), so:

shared_wh: 15 mins * 24 hours = total 360 mins/day (much cheaper)

Of course, if Sinter’s run is scheduled for half-past the hour, because you want to run it AFTER your Stitch load, then they would not run concurrently, and your billed minutes would be the same with one warehouse as it is for two (in this case it’s really no benefit either way, since they wouldn’t compete even on a shared warehouse).

So when scaling, yes, warehouse strategy can become complicated. You may need to consider the number of concurrent queries, the amount of compute needed, the priority of the job or the importance of how fast it’s completed, etc.

If you have a simple setup my recommendation is just put it all on a small warehouse. Make sure each piece of the system uses its own Snowflake user, then once you start to hit performance issues you can dig into usage per user and consider moving one or more users to separate warehouses.


#11

@dapearce thank you, that is a great explanation, and certainly new insight for me. I’ll stick to a single warehouse for now, as reducing cost is initial priority.