Saas multi-schema best practices

I was hoping to get some thoughts on what you would consider best practices for a typical SaaS situation:

  • We are a B2B SaaS tool
  • We use PostgreSQL as our main operational database
  • Each one of our customers has their own schema in our database (all have the same data structure - we just have separate schemas to prevent leaks in case of bugs)
  • There are many of these schemas (as in thousands)
  • We also have a Redshift with lots of other data from Salesforce, Stripe, etc and currently use DBT + DBT Cloud to model that and prepare for our BI tools

For the sake of the argument, let’s say we have a table “sale” in each of these PostgreSQL operational schemas. For analytics purposes, we’d like to be able to answer questions like how many sales there were in a given day, across all of our customers, i.e. all of the schemas.

How would you setup the data pipeline and model this?

My initial thoughts are:

Option A - we could replicate all of the schemas from Postgres to Redshift using something like FlyData, then use DBT to create a view across all the “sale” tables across all of these schemas in Redshift. Not sure this would even work.

Option B - we use some kind of ETL tool to somehow listen to changes on all the “sale” tables in the all schemas in Posgres, then write a “clean” version into a unified “sale” table in Redshift. Not sure what tools would be good for this either.

Any thoughts and guidance would be appreciated! Thanks.

2 Likes

I’ve got a similar (although not exactly the same) situation in both MSSQL and MongoDB (2 different SaaS products). In MSSQL, we’ve got 1 database per “State” and in Mongo, we’ve got 1 db per customer.

Right now, for MSSQL, I’m using Stitch to replicate each table from the source db into its own schema (BigQuery dataset in this case). I then use dbt to loop through a list of these schemas to create a monster view across all the source tables. So if we’ve got 12 databases (again 1 for each state), then I end up with one PurchaseOrders view that essentially unions all the individual db_{{statecode}}.PurchaseOrders tables together. Then I can easily query across all of them as one - or if performance is an issue, I materialize it as a table. With 1000+ of these, this seems like it might hit a wall, but perhaps not?

For Mongo, where my numbers are headed in your direction (we’ve got 100+ dbs today); I’m planning on rolling my own singer tap configs and just add the tenant/customer id as a new column for every collection.

Would definitely be interested in others’ solutions as well.

Hey @daniellang – welcome :wave:

A little late to the party here, but hopefully you’ll let me join.

My preferred approach looks like @warpraptor’s! We use stitch to replicate each database into a separate schema, then, in dbt, union them all together.

Here are some tips:
1. We define our raw data as sources.
A neat trick here is to use yaml anchors to reduce duplication of code:

version: 2

sources:
  - name: shopify_us
    tables: &shopify_tables
      - name: products
        columns:
          - name: id
            tests:
              - unique
              - not_null
      - name: orders
        columns:
          - name: id
            tests:
              - unique
              - not_null

  - name: shopify_uk
    tables: *shopify_tables

  - name: shopify_au
    tables: *shopify_tables

2. Use jinja in dbt to union the tables together
For our first version, we’ll often just union together the tables manually (using a little Jinja to help us out)

{%- set countries = ['us', 'uk', 'au'] -%}
with products as (
{% for country in countries %}
  select * from {{ source('shopify_', ~ country, 'products') }}
  {% if not loop.last %} union all {% endif %}
{% endfor %)
),
...

Then, inevitably, there will be some table that has the columns in a different order. So we’ll use the union_relations macro from dbt utils here to help us out:

{%- set countries = ['us', 'uk', 'au'] -%}

{%- set shopify_product_relations=[] -%}

{% for country in countries %}
    {% do shopify_product_relations.append(source('shopify_' ~ country, 'products')) %}
{% endfor %}

with products as (
    {{ dbt_utils.union_relations(shopify_product_relations) }}
),
...

As the number of tables we need to union grows, we end up repeating the first 6 lines of this across multiple models. So we might then abstract it into a macro. Here’s an example:

{% macro union_shopify_sources(table_name) %}

    {%- set countries=['us', 'uk', 'au'] -%}

    {%- set shopify_relations=[] -%}

    {% for country in countries %}
        {% do shopify_relations.append(source('shopify_' ~ country, table_name)) %}
    {% endfor %}


    {{ dbt_utils.union_relations(shopify_relations) }}

{% endmacro %}

Then our downstream model looks like:

with products as (
    {{ union_shopify_sources('products') }}
),
...

Even though we’ve reduced the repetition in our code, there’s a tradeoff: we’ve obfuscated this code a little – it’s harder to understand exactly what’s happening without going to another code location.

3. Prefix unique IDs with country codes
My last tip for this kind of work is that I generally create a new unique ID for the unioned data by prefixing the old ID with a country code, e.g. US-187, AU-87546. I find that this is the most user-friendly version of the unique ID, and it allows me to write clean joins downstream (since I also do this for foreign keys). In one project, I even went so far as to macro-fy this logic in my union_shopify_sources, so that any field named id or ending with _id automatically had the country code prefixed.

2 Likes

Thank you @claire and @warpraptor - this is very helpful and great to hear what you’re doing.

I had thought of something similar but ran into a couple limitations due to the fact that we have thousands of schemas and new ones created all the time (basically whenever a new trial gets started):

  • A view would be slow at query execution time across that many schemas
  • A materialized table would be pretty slow and expensive at dbt job execution time as we would recreate the entire table every single time dbt runs
  • Stitch cannot automatically sync all newly added schemas from Postgres to Redshift as far as I know

Because of that, we ended up creating a slightly different approach and it’s working great so far:

  1. We added a changed timestamp column to all the tables in our production database that we care about. We use a Postgres trigger to update it on every write
  2. We have a AWS Lambda function run every 30 minutes, iterate over all the schemas, query the tables with a timestamp cursor that we save for each schema+table in DynamoDB
  3. Any changed rows since the last execution get batch sent to Amazon Kinesis Firehose, which writes them to a table in Redshift. These are the updates, not the final table. So for example we have a sale_cdc table that has all the changed rows appended every time
  4. We have a dbt job that uses a window function on the CDC table to only get the most recent record for each unique schema + id combination and then materializes it into a final sale table that now contains all the sale rows from all the schemas in a format that is easy to use for further modeling and querying

Hope this might help someone in the future!

4 Likes

@daniellang thank you for your post!
Could you elaborate on items 3-4 and/or share your code?

About yaml anchors for mere mortals YAML Basics: Anchors and Aliases - YouTube