Creating a separate database for datasets created by Python pipelines

Hello!

Please let me know if this is the wrong category for post this!

A bit of background on our current database set up. We have a staging database for preparing datasets from external provider databases, a development database that mirrors the production database. We also have a number of Python pipelines deployed on a server producing datasets, such as forecasting results and web scraping data.

We are currently trying to create a proper staging to data mart data flow. Should we 1) create a separate database for these pipeline-produced datasets that then are brought into the staging database, treating them as a raw data source or 2) put them into a “pipelines” schema directly in the staging database because we have control over the formats and quality of these datasets (this is referencing a post by Claire about putting seeds directly in the staging database since they can already be put into a “staging” format).

Any thoughts would be greatly appreciated! :slight_smile:

Nancy

Hey @nancy.chelaru, this is a really interesting question! I think my answer depends on the specifics of

How much transformation and cleaning is happening in Python before they’re being loaded to the database, and does your team control that process directly? If the data is already in a perfect format for analysis, and that format isn’t going to change ever for a long time, you could probably get away with loading it directly into a staging db.

“Could probably get away with” is not exactly inspiring though! On the whole, I’d recommend still loading it into a separate raw database. Your staging layer might wind up being select * from "raw"."pipelines"."web" for now, but it maintains optionality if your data format changes in the future - you can make changes in your staging layer to maintain the interface/contract that your marts rely on.

Thank you so much for your reply, @joellabes !

Great to hear your take on it! Our team does have complete control over these pipelines. I too was leaning towards having a separate raw database for the resultant datasets. It just didn’t quite feel right for them to land directly in the staging database, which is supposed to, as you said, a layer to maintain data contracts that downstream models depend on. Having them in a separate database also makes the lineage graph clearer.

1 Like