Creating a separate database for datasets created by Python pipelines

nancy.chelaru · September 8, 2022, 2:56pm

Hello!

Please let me know if this is the wrong category for post this!

A bit of background on our current database set up. We have a staging database for preparing datasets from external provider databases, a development database that mirrors the production database. We also have a number of Python pipelines deployed on a server producing datasets, such as forecasting results and web scraping data.

We are currently trying to create a proper staging to data mart data flow. Should we 1) create a separate database for these pipeline-produced datasets that then are brought into the staging database, treating them as a raw data source or 2) put them into a “pipelines” schema directly in the staging database because we have control over the formats and quality of these datasets (this is referencing a post by Claire about putting seeds directly in the staging database since they can already be put into a “staging” format).

Any thoughts would be greatly appreciated!

Nancy

joellabes · September 8, 2022, 8:03pm

Hey @nancy.chelaru, this is a really interesting question! I think my answer depends on the specifics of

How much transformation and cleaning is happening in Python before they’re being loaded to the database, and does your team control that process directly? If the data is already in a perfect format for analysis, and that format isn’t going to change ~~ever~~ for a long time, you could probably get away with loading it directly into a staging db.

“Could probably get away with” is not exactly inspiring though! On the whole, I’d recommend still loading it into a separate raw database. Your staging layer might wind up being select * from "raw"."pipelines"."web" for now, but it maintains optionality if your data format changes in the future - you can make changes in your staging layer to maintain the interface/contract that your marts rely on.

nancy.chelaru · September 8, 2022, 8:31pm

Thank you so much for your reply, @joellabes !

Great to hear your take on it! Our team does have complete control over these pipelines. I too was leaning towards having a separate raw database for the resultant datasets. It just didn’t quite feel right for them to land directly in the staging database, which is supposed to, as you said, a layer to maintain data contracts that downstream models depend on. Having them in a separate database also makes the lineage graph clearer.

Topic		Replies	Views
How to use two different databases and their schemas Help snowflake , dbt-core	2	117	March 29, 2025
Best practice splitting SQL into staging, intermediate and marts layers and naming conventions Help	1	6028	January 3, 2024
Questions/Thoughts on Staging versus Analytics Areas in Warehouse In-Depth Discussions best-practice , environments	3	4662	February 11, 2019
Population source tables on AWS stack Archive	0	1691	January 11, 2022
Best Practices for Data Quality Checks in dbt for Data Processing Pipeline In-Depth Discussions	0	660	June 25, 2024

Creating a separate database for datasets created by Python pipelines

Related topics