Hello All -
Question how everyone handles their schema management for staging data versus transformations done on staging data.
It’s a good rule of thumb to not let users access the staging data, because as you build a warehouse you often need to constantly change how the staging and transformation process works, and if users write analytics/reports directly against the staging data the analytics will end up breaking as you change the staging and loading processes. And if users can somehow find and access data they will write reports against it one way or another.
To enable this, in a “traditional” DW, I am accustomed to having a separate schema or even database for the staging data versus the end-user accessible transformed data area and using the schema as a global way to ensure that the groups/roles to which end-users are assigning are blocked from the staging area.
We are trying to figure out how to handle this with dbt. With our current warehouse project in Snowflake, we have a “staging” schema and an “analytics” schema. We have multiple analysts editing and working on dbt code and data transformations. We also want to keep all of the dbt transformations in one dbt project to have one DAG that makes understanding the data workflow simpler for all concerned rather than splitting “staging” processing into one dbt project and “analytics” processing into another.
The problem that we are encountering is that in order to allow analysts to be working on different code branches and data transformation flows independently, we want to use dbt’s capabilities to allow them to just rename the schema and dbt should automatically create their own independent schemas. However, this doesn’t work so well with the staging schema because we only want to maintain one copy of that, and some of the dbt tasks involve creating additional worktables and views in the staging schema. We also have some parts of the process that go staging -> analytics -> staging -> analytics because of the complexity of some transformations.
How do others handle this? Do you just put everything in one schema and try to be careful at a granular level about granting end-users access to only particular tables/views? Does everybody get their own copy of two schemas? Or are we just over-complicating all of this?
Thanks in advance for any thoughts!