I was hoping to get some thoughts on what you would consider best practices for a typical SaaS situation:
- We are a B2B SaaS tool
- We use PostgreSQL as our main operational database
- Each one of our customers has their own schema in our database (all have the same data structure - we just have separate schemas to prevent leaks in case of bugs)
- There are many of these schemas (as in thousands)
- We also have a Redshift with lots of other data from Salesforce, Stripe, etc and currently use DBT + DBT Cloud to model that and prepare for our BI tools
For the sake of the argument, let’s say we have a table “sale” in each of these PostgreSQL operational schemas. For analytics purposes, we’d like to be able to answer questions like how many sales there were in a given day, across all of our customers, i.e. all of the schemas.
How would you setup the data pipeline and model this?
My initial thoughts are:
Option A - we could replicate all of the schemas from Postgres to Redshift using something like FlyData, then use DBT to create a view across all the “sale” tables across all of these schemas in Redshift. Not sure this would even work.
Option B - we use some kind of ETL tool to somehow listen to changes on all the “sale” tables in the all schemas in Posgres, then write a “clean” version into a unified “sale” table in Redshift. Not sure what tools would be good for this either.
Any thoughts and guidance would be appreciated! Thanks.