Saas multi-schema best practices

daniellang · November 28, 2019, 3:53am

I was hoping to get some thoughts on what you would consider best practices for a typical SaaS situation:

We are a B2B SaaS tool
We use PostgreSQL as our main operational database
Each one of our customers has their own schema in our database (all have the same data structure - we just have separate schemas to prevent leaks in case of bugs)
There are many of these schemas (as in thousands)
We also have a Redshift with lots of other data from Salesforce, Stripe, etc and currently use DBT + DBT Cloud to model that and prepare for our BI tools

For the sake of the argument, let’s say we have a table “sale” in each of these PostgreSQL operational schemas. For analytics purposes, we’d like to be able to answer questions like how many sales there were in a given day, across all of our customers, i.e. all of the schemas.

How would you setup the data pipeline and model this?

My initial thoughts are:

Option A - we could replicate all of the schemas from Postgres to Redshift using something like FlyData, then use DBT to create a view across all the “sale” tables across all of these schemas in Redshift. Not sure this would even work.

Option B - we use some kind of ETL tool to somehow listen to changes on all the “sale” tables in the all schemas in Posgres, then write a “clean” version into a unified “sale” table in Redshift. Not sure what tools would be good for this either.

Any thoughts and guidance would be appreciated! Thanks.

warpraptor · December 11, 2019, 4:49pm

I’ve got a similar (although not exactly the same) situation in both MSSQL and MongoDB (2 different SaaS products). In MSSQL, we’ve got 1 database per “State” and in Mongo, we’ve got 1 db per customer.

Right now, for MSSQL, I’m using Stitch to replicate each table from the source db into its own schema (BigQuery dataset in this case). I then use dbt to loop through a list of these schemas to create a monster view across all the source tables. So if we’ve got 12 databases (again 1 for each state), then I end up with one PurchaseOrders view that essentially unions all the individual db_{{statecode}}.PurchaseOrders tables together. Then I can easily query across all of them as one - or if performance is an issue, I materialize it as a table. With 1000+ of these, this seems like it might hit a wall, but perhaps not?

For Mongo, where my numbers are headed in your direction (we’ve got 100+ dbs today); I’m planning on rolling my own singer tap configs and just add the tenant/customer id as a new column for every collection.

Would definitely be interested in others’ solutions as well.

claire · December 12, 2019, 4:47pm

Hey @daniellang – welcome

A little late to the party here, but hopefully you’ll let me join.

My preferred approach looks like @warpraptor’s! We use stitch to replicate each database into a separate schema, then, in dbt, union them all together.

Here are some tips:
1. We define our raw data as sources.
A neat trick here is to use yaml anchors to reduce duplication of code:

version: 2

sources:
  - name: shopify_us
    tables: &shopify_tables
      - name: products
        columns:
          - name: id
            tests:
              - unique
              - not_null
      - name: orders
        columns:
          - name: id
            tests:
              - unique
              - not_null

  - name: shopify_uk
    tables: *shopify_tables

  - name: shopify_au
    tables: *shopify_tables

2. Use jinja in dbt to union the tables together
For our first version, we’ll often just union together the tables manually (using a little Jinja to help us out)

{%- set countries = ['us', 'uk', 'au'] -%}
with products as (
{% for country in countries %}
  select * from {{ source('shopify_', ~ country, 'products') }}
  {% if not loop.last %} union all {% endif %}
{% endfor %)
),
...

Then, inevitably, there will be some table that has the columns in a different order. So we’ll use the union_relations macro from dbt utils here to help us out:

{%- set countries = ['us', 'uk', 'au'] -%}

{%- set shopify_product_relations=[] -%}

{% for country in countries %}
    {% do shopify_product_relations.append(source('shopify_' ~ country, 'products')) %}
{% endfor %}

with products as (
    {{ dbt_utils.union_relations(shopify_product_relations) }}
),
...

As the number of tables we need to union grows, we end up repeating the first 6 lines of this across multiple models. So we might then abstract it into a macro. Here’s an example:

{% macro union_shopify_sources(table_name) %}

    {%- set countries=['us', 'uk', 'au'] -%}

    {%- set shopify_relations=[] -%}

    {% for country in countries %}
        {% do shopify_relations.append(source('shopify_' ~ country, table_name)) %}
    {% endfor %}


    {{ dbt_utils.union_relations(shopify_relations) }}

{% endmacro %}

Then our downstream model looks like:

with products as (
    {{ union_shopify_sources('products') }}
),
...

Even though we’ve reduced the repetition in our code, there’s a tradeoff: we’ve obfuscated this code a little – it’s harder to understand exactly what’s happening without going to another code location.

3. Prefix unique IDs with country codes
My last tip for this kind of work is that I generally create a new unique ID for the unioned data by prefixing the old ID with a country code, e.g. US-187, AU-87546. I find that this is the most user-friendly version of the unique ID, and it allows me to write clean joins downstream (since I also do this for foreign keys). In one project, I even went so far as to macro-fy this logic in my union_shopify_sources, so that any field named id or ending with _id automatically had the country code prefixed.

daniellang · December 13, 2019, 5:06am

Thank you @claire and @warpraptor - this is very helpful and great to hear what you’re doing.

I had thought of something similar but ran into a couple limitations due to the fact that we have thousands of schemas and new ones created all the time (basically whenever a new trial gets started):

A view would be slow at query execution time across that many schemas
A materialized table would be pretty slow and expensive at dbt job execution time as we would recreate the entire table every single time dbt runs
Stitch cannot automatically sync all newly added schemas from Postgres to Redshift as far as I know

Because of that, we ended up creating a slightly different approach and it’s working great so far:

We added a changed timestamp column to all the tables in our production database that we care about. We use a Postgres trigger to update it on every write
We have a AWS Lambda function run every 30 minutes, iterate over all the schemas, query the tables with a timestamp cursor that we save for each schema+table in DynamoDB
Any changed rows since the last execution get batch sent to Amazon Kinesis Firehose, which writes them to a table in Redshift. These are the updates, not the final table. So for example we have a sale_cdc table that has all the changed rows appended every time
We have a dbt job that uses a window function on the CDC table to only get the most recent record for each unique schema + id combination and then materializes it into a final sale table that now contains all the sale rows from all the schemas in a format that is easy to use for further modeling and querying

Hope this might help someone in the future!

plotnikov.sseu · October 8, 2022, 6:35pm

@daniellang thank you for your post!
Could you elaborate on items 3-4 and/or share your code?

obar1 · October 16, 2022, 8:33am

About yaml anchors for mere mortals YAML Basics: Anchors and Aliases - YouTube

Topic		Replies	Views
Is dbt an ETL tool? Help	7	5537	August 24, 2018
Questions/Thoughts on Staging versus Analytics Areas in Warehouse In-Depth Discussions best-practice , environments	3	4671	February 11, 2019
Managing hundreds of schemas in the dbt_project.yml In-Depth Discussions best-practice , adopting-dbt	5	3730	August 26, 2022
How to structure different databases and schemas with more than one platform? Archive	2	4816	August 10, 2021
Structure Snowflake database, schema In-Depth Discussions	19	14999	June 1, 2021

Saas multi-schema best practices

Related topics